I've released Convolutional Neural Network class of version 0.8 finaly. Since matlabcentral not accepts mex-files included into submissions, the most full version of it can be found
here.
Among some bug fixes there're 3 most interesting features in this version.
First is GUI:

As can be seen there're RMSE (root mean squared error) and MCR (missclassification rate) plots. MCR is quite time expencive because it loops over some subset of training set to calucate, but as for me it worth it. MCR is much more informative than RMSE.
At the bottom of the window there're overall training and Hessian calculation progress bars. Very usefull thing as fom me. At the top right corner information about epoch, iteration, RMSE and MCR values. The last thing is Abort button. As opposed to previous version where you was able only to interrupt training by pressing "ctrl+c" loosing all training progress, abort button stops the training and saves all network parameters.
Second is new training modes. In the previous versions the Hessian digonal approximation (which is significantly speeds up covergence) has been calculated every iteration using so-called running estimate. But as stated by many authors Hessian is changes very slowly, so there's no need to recalculate it every iteration. So, I've added training mode at wich Hessian is calculated using some relatively small subset of training set and this repeated every few thousands of iterations. This trick really speeds up training without any sensible influence of covergence. Also I've added pure stochastic gradient as a third training mode.
Third and the main feature is CUDA-based CNN implementation. The
CUDA technology has beecome very popular for speeding up computational intensive applications. And offcourse neural networks are seems to "been born" for this technology. So no surprise that neural network programmers tried to implement NNs even when there was no such convinient things as floating point support, or conditional execution in kernels. Implementation of classical fully-connected neural networks is quite trivial, because it's just a matrix by matrix multiplications. More interesting is implementations of convolutional neural networks, especially considering it's very exciting results in many applications such as handwriting digits recognition, face detection objects classification etc. Here what I've found on implementation of CNN's on GPU:
- Kumar Chellapilla, Sidd Puri, Patrice Simard. High Performance Convolutional Neural Networks for Document Processing. 2006
- Fabian Nasse, Christian Thurau, and Gernot A. Fink. Face Detection Using GPU-Based Convolutional Neural Networks. 2009
- Dominik Scherer and Sven Behnke. Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors. In Proceeding of NIPS 2009 Workshop on Large-Scale Machine Learning: Parallelism and Massive Datasets, Whistler, Canada, December 2009
- Narayanan Sundaramy, Anand Raghunathany, and Srimat T. Chakradhar. A framework for efficient and scalable execution of domain-specific templates on GPUs. 2009
In [1] colleagues from MS Research proposed a very interesting idea to represent a convolutions in C-layers as matrix multiplications. This requires some extra memory and calculations, but benefits are convinience, scalability and finally simulation and training speed. There was no CUDA in 2006 so they used shaders for CNN's speed up.
In [2] researchers describes very good practical application of CUDA for real-time face detection. Though training wasn't implemented on GPU.
In paper [3] scientists from Bonn University presented implementation of convolutional neural network using CUDA with both simulation and training. But from that paper seems that realization was hard-wired.
In [4] researchers from NEC presented software for various tasks on GPU and Convolutional neural networks is shown as example of such task. No information about training was provided there.
It should be noted that all these results presented only in papers so no code or even executables were given.
The only CUDA CNN's implementation known to me is
this one. Good job, but it only implements simulation and is
very hard-wired.
So all of that motivated me to write my own CUDA-based implementation of CNN's, which is scalable, trainable and publicaly awalible.
The main disadvantage in my realization is that it needs matlab as a source of CNN class object and a sink for results processing. But I'm going to make cudacnn as a separate library.