7.icann2011
7.icann2011
1 Introduction
The main purpose of unsupervised learning methods is to extract generally use-
ful features from unlabelled data, to detect and remove input redundancies, and
to preserve only essential aspects of the data in robust and discriminative rep-
resentations. Unsupervised methods have been routinely used in many scientific
and industrial applications. In the context of neural network architectures, un-
supervised layers can be stacked on top of each other to build deep hierarchies
[7]. Input layer activations are fed to the first layer which feeds the next, and
so on, for all layers in the hierarchy. Deep architectures can be trained in an
unsupervised layer-wise fashion, and later fine-tuned by back-propagation to be-
come classifiers [9]. Unsupervised initializations tend to avoid local minima and
increase the network’s performance stability [6].
Most methods are based on the encoder-decoder paradigm, e.g., [20]. The in-
put is first transformed into a typically lower-dimensional space (encoder), and
then expanded to reproduce the initial data (decoder). Once a layer is trained,
its code is fed to the next, to better model highly non-linear dependencies in the
input. Methods using this paradigm include stacks of: Low-Complexity Coding
and Decoding machines (LOCOCODE) [10], Predictability Minimization lay-
ers [23,24], Restricted Boltzmann Machines (RBMs) [8], auto-encoders [20] and
energy based models [15].
In visual object recognition, CNNs [1,3,4,14,26] often excel. Unlike patch-
based methods [19] they preserve the input’s neighborhood relations and
T. Honkela et al. (Eds.): ICANN 2011, Part I, LNCS 6791, pp. 52–59, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction 53
2 Preliminaries
2.1 Auto-Encoder
We recall the basic principles of auto-encoder models, e.g., [2]. An auto-encoder
takes an input x ∈ Rd and first maps it to the latent representation h ∈ Rd using
a deterministic function of the type h = fθ = σ(W x + b) with parameters θ =
{W, b}. This “code” is then used to reconstruct the input by a reverse mapping
of f : y = fθ (h) = σ(W h + b ) with θ = {W , b }. The two parameter sets
are usually constrained to be of the form W = W T , using the same weights for
encoding the input and decoding the latent representation. Each training pattern
xi is then mapped onto its code hi and its reconstruction yi . The parameters
are optimized, minimizing an appropriate cost function over the training set
Dn = {(x0 , t0 ), ..., (xn , tn )}.
where again there is one bias c per input channel. H identifies the group of latent
feature maps; W̃ identifies the flip operation over both dimensions of the weights.
The 2D convolution in equation (1) and (2) is determined by context. The convo-
lution of an m × m matrix with an n × n matrix may in fact result in an (m + n −
1) × (m + n − 1) matrix (full convolution) or in an (m − n + 1) × (m − n + 1) (valid
convolution). The cost function to minimize is the mean squared error (MSE):
1
n
E(θ) = (xi − yi )2 . (3)
2n i=1
Just as for standard networks the backpropagation algorithm is applied to com-
pute the gradient of the error function with respect to the parameters. This can
be easily obtained by convolution operations using the following formula:
∂E(θ)
= x ∗ δhk + h̃k ∗ δy. (4)
∂W k
δh and δy are the deltas of the hidden states and the reconstruction, respectively.
The weights are then updated using stochastic gradient descent.
Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction 55
3.1 Max-Pooling
For hierarchical networks in general and CNNs in particular, a max-pooling layer
[22] is often introduced to obtain translation-invariant representations. Max-
pooling down-samples the latent representation by a constant factor, usually
taking the maximum value over non overlapping sub-regions. This helps improv-
ing filter selectivity, as the activation of each neuron in the latent representation
is determined by the “match” between the feature and the input field over the
region of interest. Max-pooling was originally intended for fully-supervised feed-
forward architectures only.
Here we introduce a max-pooling layer that introduces sparsity over the hid-
den representation by erasing all non-maximal values in non overlapping sub-
regions. This forces the feature detectors to become more broadly applicable,
avoiding trivial solutions such as having only one weight “on” (identity func-
tion). During the reconstruction phase, such a sparse latent code decreases the
average number of filters contributing to the decoding of each pixel, forcing filters
to be more general. Consequently, with a max-pooling layer there is no obvious
need for L1 and/or L2 regularization over hidden units and/or weights.
4 Experiments
We begin by visually inspecting the filters of various CAEs, trained in various
setups on a digit dataset (MNIST [14]) and on natural images (CIFAR10 [13]).
In Figure 1 we compare 20 7 × 7 filters (learned on MNIST) of four CAEs of
the same topology, but trained differently. The first is trained on original digits
(a), the second on noisy inputs with 50% binomial noise added (b), the third
has an additional max-pooling layer of size 2 × 2 (c), and the fourth is trained
on noisy inputs (30% binomial noise) and has a max-pooling layer of size 2 × 2
(d). We add 30% noise in conjunction with max-pooling layers, to avoid loss of
too much relevant information. The CAE without any additional constraints (a)
learns trivial solutions. Interesting and biologically plausible filters only emerge
once the CAE is trained with a max-pooling layer. With additional noise the
filters become more localized. For this particular example, max-pooling yields
the visually nicest filters; those of the other approaches do not have a well-defined
shape. A max-pooling layer is an elegant way of enforcing a sparse code required
to deal with the overcomplete representations of convolutional architectures.
56 J. Masci et al.
(a) (b)
(c) (d)
Fig. 1. A randomly selected subset of the first layer’s filters learned on MNIST to
compare noise and pooling. (a) No max-pooling, 0% noise, (b) No max-pooling, 50%
noise, (c) Max-pooling of 2x2, (d) Max-pooling of 2x2, 30% noise.
(a)
(b)
(c)
(d)
Fig. 2. A randomly selected subset of the first layer’s filters learned on CIFAR10 to
compare noise and pooling (best viewed in colours). (a) No pooling and 0% noise, (b)
No pooling and 50% noise, (c) Pooling of 2x2 and 0% noise, (d) Pooling of 2x2 and
50% noise.
Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction 57
When dealing with natural color images, Gaussian noise instead of binomial
noise is added to the input of a denoising CAE. We repeat the above experiment
on CIFAR10. The corresponding filters are shown in Figure 2. The impact of a
max-pooling layer is striking (c), whereas adding noise (b) has almost no visual
effect except on the weight magnitudes (d). As for MNIST, only a max-pooling
layer guarantees convincing solutions, indicating that max-pooling is essential.
It seems to at least partially solve the problems that usually arise when training
auto-encoders by gradient descent. Another welcome aspect of our approach is
that except for the max-pooling kernel size, no additional parameters have to be
set by trial and error or time consuming cross-validation.
5 Conclusion
References
1. Behnke, S.: Hierarchical Neural Networks for Image Interpretation. LNCS,
vol. 2766, pp. 1–13. Springer, Heidelberg (2003)
2. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training
of deep networks. In: Neural Information Processing Systems, NIPS (2007)
3. Cireşan, D.C., Meier, U., Masci, J., Gambardella, L.M., Schmidhuber, J.: High-
Performance Neural Networks for Visual Object Classification. ArXiv e-prints,
arXiv:1102.0183v1 (cs.AI) (Febuary 2011)
4. Ciresan, D.C., Meier, U., Masci, J., Schmidhuber, J.: Flexible, high performance
convolutional neural networks for image classification. In: International Joint
Conference on Artificial Intelligence, IJCAI (to appear 201I)
5. Coates, A., Lee, H., Ng, A.: An analysis of single-layer networks in unsupervised
feature learning. Advances in Neural Information Processing Systems (2010)
6. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P.: Why Does
Unsupervised Pre-training Help Deep Learning? Journal of Machine Learning Re-
search 11, 625–660 (2010)
7. Fukushima, K.: Neocognitron: A self-organizing neural network for a mechanism
of pattern recognition unaffected by shift in position. Biological Cybernetics 36(4),
193–202 (1980)
8. Hinton, G.E.: Training products of experts by minimizing contrastive divergence.
Neural Comp. 14(8), 1771–1800 (2002)
9. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief
nets. Neural Computation (2006)
Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction 59
10. Hochreiter, S., Schmidhuber, J.: Feature extraction through LOCOCODE. Neural
Computation 11(3), 679–714 (1999)
11. Hubel, D.H., Wiesel, T.N.: Receptive fields and functional architecture of
monkey striate cortex. The Journal of Physiology 195(1), 215–243 (1968),
http://jp.physoc.org/cgi/content/abstract/195/1/215
12. Krishevsky, A.: Convolutional deep belief networks on CIFAR-2010 (2010)
13. Krizhevsky, A.: Learning multiple layers of features from tiny images. Master’s
thesis, Computer Science Department, University of Toronto (2009)
14. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
15. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-
based learning. In: Bakir, G., Hofman, T., Schölkopf, B., Smola, A., Taskar, B.
(eds.) Predicting Structured Data. MIT Press, Cambridge (2006)
16. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations. In: Proceedings
of the 26th International Conference on Machine Learning, pp. 609–616 (2009)
17. Lowe, D.: Object recognition from local scale-invariant features. In: The Proceed-
ings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp.
1150–1157 (1999)
18. Norouzi, M., Ranjbar, M., Mori, G.: Stacks of convolutional Restricted Boltz-
mann Machines for shift-invariant feature learning. In: 2009 IEEE Conference
on Computer Vision and Pattern Recognition, pp. 2735–2742 (June 2009),
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5206577
19. Ranzato, M., Boureau, Y., LeCun, Y.: Sparse feature learning for deep belief net-
works. In: Advances in Neural Information Processing Systems, NIPS 2007 (2007)
20. Ranzato, M., Fu Jie Huang, Y.L.B., LeCun, Y.: Unsupervised learning of invariant
feature hierarchies with applications to object recognition. In: Proc. of Computer
Vision and Pattern Recognition Conference (2007)
21. Ranzato, M., Hinton, G.E.: Modeling pixel means and covariances using factor-
ized third-order boltzmann machines. In: Proc. of Computer Vision and Pattern
Recognition Conference, CVPR 2010 (2010)
22. Scherer, D., Müller, A., Behnke, S.: Evaluation of pooling operations in convolu-
tional architectures for object recognition. In: International Conference on Artificial
Neural Networks (2010)
23. Schmidhuber, J.: Learning factorial codes by predictability minimization. Neural
Computation 4(6), 863–879 (1992)
24. Schmidhuber, J., Eldracher, M., Foltin, B.: Semilinear predictability minimization
produces well-known feature detectors. Neural Computation 8(4), 773–786 (1996)
25. Serre, T., Wolf, L., Poggio, T.: Object recognition with features inspired by visual
cortex. In: Proc. of Computer Vision and Pattern Recognition Conference (2007)
26. Simard, P., Steinkraus, D., Platt, J.: Best practices for convolutional neural net-
works applied to visual document analysis. In: Seventh International Conference
on Document Analysis and Recognition, pp. 958–963 (2003)
27. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and Composing
Robust Features with Denoising Autoencoders. In: Neural Information Processing
Systems, NIPS (2008)
28. Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional Networks.
In: Proc. Computer Vision and Pattern Recognition Conference, CVPR 2010
(2010)