On twitter, Andrej Kaparthy complained that the list is a bit hard to browse through. I agree and even though this is probably not the nice visualization he had in mind, I felt like having topical reading lists would somehow mitigate this problem.
Here is my reading list on deep learning and unsupervised feature extraction:
A Generative Process for Contractive Auto-Encoders
Abstract: The contractive auto-encoder learns a representation of the input data that captures the local manifold structure around each data point, through the leading singular vectors of the Jacobian of the transformation from input to representation. The corresponding singular values specify how much local variation is plausible in directions associated with the corresponding singular vectors, while remaining in a high-density region of the input space. This paper proposes a procedure for generating samples that are consistent with the local structure captured by a contractive auto-encoder. The associated stochastic process defines a distribution from which one can sample, and which experimentally appears to converge quickly and mix well between modes, compared to Restricted Boltzmann Machines and Deep Belief Networks. The intuitions behind this procedure can also be used to train the second layer of contraction that pools lower-level features and learns to be invariant to the local directions of variation discovered in the first layer. We show that this can help learn and represent invariances present in the data and improve classification error.
Building high-level features using large scale unsupervised learning
Abstract: We consider the challenge of building feature detectors for high-level concepts from only unlabeled data. For example, we would like to understand if it is possible to learn a face detector using only unlabeled images downloaded from the Internet. To answer this question, we trained a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (which has 10 million images, each image has 200x200 pixels). On contrary to what appears to be a widely-held negative belief, our experimental results reveal that it is possible to achieve a face detector via only unlabeled data. Control experiments show that the feature detector is robust not only to translation but also to scaling and 3D rotation. Also via recognition and visualization, we find that the same network is sensitive to other high-level concepts such as cat faces and human bodies.
Evaluating Bayesian and L1 Approaches for Sparse Unsupervised Learning
Abstract: The use of L_1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L_1 methods, in this paper we find that L_1 regularisation often dramatically under-performs in terms of predictive performance when compared with other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L_1 minimising factor models, Bayesian variants of “L_1”, and Bayesian models with a stronger L_0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods outperform L_1 minimisation, even on a computational budget. We thus highlight the need to re-assess the wide use of L_1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.
On multi-view feature learning
Abstract: Sparse coding is a common approach to learning local features for object recognition. Recently, there has been an increasing interest in learning features from spatio-temporal, binocular, or other multi-observation data, where the goal is to encode the relationship between images rather than the content of a single image. We discuss the role of multiplicative interactions and of squaring non-linearities in learning such relations. In particular, we show that training a sparse coding model whose filter responses are multiplied or squared amounts to jointly diagonalizing a set of matrices that encode image transformations. Inference amounts to detecting rotations in the shared eigenspaces. Our analysis helps explain recent experimental results showing that Fourier features and circular Fourier features emerge when training complex cell models on translating or rotating images. It also shows how learning about transformations makes it possible to learn invariant features.
Deep Mixtures of Factor Analysers
Abstract: An efficient way to learn deep density models that have many layers of latent variables is to learn one layer at a time using a model that has only one layer of latent variables. After learning each layer, samples from the posterior distributions for that layer are used as training data for learning the next layer. This approach is commonly used with Restricted Boltzmann Machines, which are undirected graphical models with a single hidden layer, but it can also be used with Mixtures of Factor Analysers (MFAs) which are directed graphical models. In this paper, we present a greedy layer-wise learning algorithm for Deep Mixtures of Factor Analysers (DMFAs). Even though a DMFA can be converted to an equivalent shallow MFA by multiplying together the factor loading matrices at different levels, learning and inference are much more efficient in a DMFA and the sharing of each lower-level factor loading matrix by many different higher level MFAs prevents overfitting. We demonstrate empirically that DMFAs learn better density models than both MFAs and two types of Restricted Boltzmann Machines on a wide variety of datasets.
Learning Local Transformation Invariance with Restricted Boltzmann Machines
Abstract: The diﬃculty of developing feature learning algorithms that are robust to the novel transformations (e.g., scale, rotation, or translation) has been a challenge in many applications (e.g., object recognition problems). In this paper, we address this important problem of transformation invariant feature learning by introducing the transformation matrices into the energy function of the restricted Boltzmann machines. Speciﬁcally, the proposed transformation-invariant restricted Boltzmann machines not only learn the diverse patterns by explicitly transforming the weight matrix, but it also achieves the invariance of the feature representation via probabilistic max pooling of hidden units over the set of transformations. Furthermore, we show that our transformation-invariant feature learning framework is not limited to this speciﬁc algorithm, but can be also extended to many unsupervised learning methods, such as an autoencoder or sparse coding. To validate, we evaluate our algorithm on several benchmark image databases such as MNIST variation, CIFAR-10, and STL-10 as well as the customized digit datasets with signiﬁcant transformations, and show very competitive classiﬁcation performance to the state-of-the-art. Besides the image data, we apply the method for phone classiﬁcation tasks on TIMIT database to show the wide applicability of our proposed algorithms to other domains, achieving state-of-the-art performance.
Large-Scale Feature Learning With Spike-and-Slab Sparse Coding
Abstract: We consider the problem of object recogni- tion with a large number of classes. In or- der to scale existing feature learning algo- rithms to this setting, we introduce a new feature learning and extraction procedure based on a factor model we call spike-and- slab sparse coding (S3C). Prior work on this model has not prioritized the ability to ex- ploit parallel architectures and scale to the enormous problem sizes needed for object recognition. We present an inference proce- dure appropriate for use with GPUs which allows us to dramatically increase both the training set size and the amount of latent factors. We demonstrate that this approach improves upon the supervised learning ca- pabilities of both sparse coding and the ss- RBM on the CIFAR-10 dataset. We use the CIFAR-100 dataset to demonstrate that our method scales to large numbers of classes bet- ter than previous methods. Finally, we use our method to win the NIPS 2011 Workshop on Challenges In Learning Hierarchical Mod- els’ Transfer Learning Challenge.
Deep Lambertian Networks
Abstract: Visual perception is a challenging problem in part due to illumination variations. A possible solution is to first estimate an illumination invariant representation before using it for recognition. The object albedo and surface normals are examples of such representation. In this paper, we introduce a multilayer generative model where the latent variables include the albedo, surface normals, and the light source. Combining Deep Belief Nets with the Lambertian reflectance assumption, our model can learn good priors over the albedo from 2D images. Illumination variations can be explained by changing only the lighting latent variable in our model. By transferring learned knowledge from similar objects, albedo and surface normals estimation from a single image is possible in our model. Experiments demonstrate that our model is able to generalize as well as improve over standard baselines in one-shot face recognition.
Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers
Abstract: Scene parsing consists in labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features. In parallel to feature extraction, a tree of segments is computed from a graph of pixel dissimilarities. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average 'purity' of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The system yields record accuracies on the the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170 classes) and near-record accuracy on Stanford Background Dataset (8 classes), while being an order of magnitude faster than competing approaches, producing a 320x240 image labeling in less than 1 second, including feature extraction.