MNIST is, for better or worse, one of the standard benchmarks for machine learning and is also widely used in then neural networks community as a toy vision problem.
Just for the unlikely case that anyone is not familiar with it:
It is a dataset of handwritten digits, 0-9, in black on white background.
It looks something like this:
There are 60000 training and 10000 test images, each 28x28 gray scale.
There are roughly the same number of examples of each category in the test and training datasets.
I used it in some papers myself even though there are some reasons why it is a little weird.
Some not-so-obvious (or maybe they are) facts are:
- The images actually contain a 20x20 patch of digit and where padded to 28x28 for use in LeNet in Yann LeCun's classic 1998 paper on convolutional neural networks.
- Even within the central 20x20 patch, there are some pixels of zero variance
- The dataset as a whole has very little variance as can be seen from looking at the spectrum using a PCA
- The dataset is very easy: random guessing is at 10% correct, a naive Bayes classifier scores about 90% correct and K nearest neighbor about 96.9 (I got that with K=3)
- It is very easy to exploit the special structure of the dataset: a lot of variation is caused by screwed letters and scaling. A relatively recent paper using a blown up training set and an MLP scored excellent 99,6%.
A thread in the kernel-machines forum motivated me to try and reproduce some results listed on the MNIST webpage using support vector machines with rbf kernel.
I am relatively new to that area and I thought that this would be a nice thing to try, since on the website, no source is given for the given performance of 98.6%.
I tried pretty long and hard on this. Since the dataset is quite big, doing a grid search took quite a while. Also I did not remove the padding which could have sped up the process.
I used the excellent LIBSVM implementation with rbf kernel and the provided grid search tool. I distributed the work on 10 local processors which was quite easy.
I tried several scalings of the original dataset:
Scaling between 0 and 1, between -1 and 1 and normalizing to unit Euclidean length.
First I did a coarse grid search using 5 fold cross validation on a 5000 sample subset of the training set. I also tried 10000 samples and got similar results.
The grid search looked something like:
The scaling between -1 and 1 seemed to work best, so I did a finer grid-search using this scaling and arrived at
gamma = 0.00728932024638 and C = 2.82842712475using
python grid.py -log2c 0.5,2.5,0.2 -log2g -7.5,-6.5,0.2 -v 5 -h
Using these parameters I trained on the whole training set and tested on the test-set to obtain....
... which is close enough so that I can finally rest ...
 By the way, the I used the default tolerance of the libsvm command line interface, which is 0.001. [/edit]