Introduction
Standard ways to improve image classification are to collect more samples or to change the representation and the processing of the data. In practice, the number of samples is typically limited, so that the second approach becomes relevant. An important tool for this second approach are feature learning algorithms, which aim at easing the classification task by transforming the data. Recently proposed deep learning methods intend to jointly learn learn a feature transformation and the classification
[Krizhevsky, Sutskever, and Hinton2012]. In this work, however, we focus on unsupervised feature learning, especially on Sparse Filtering, because of their simplicity and scalability.Feature learning algorithms for image classification pipelines typically consists of three steps: preprocessing, (un)supervised dictionary learning, and encoding. An abundance of procedures is available for each of these steps, but for accurate image classification, we need procedures that are effective and interact beneficially with each other [Agarwal and Triggs2006, Coates and Ng2011, Coates, Ng, and Lee2011, Jia, Huang, and Darrell2012, Le2013, LeCun, Huang, and Bottou2004]. Therefore, a profound understanding of these procedures is crucial to ensure accurate results and efficient computations.
In this paper, we study the performance of Sparse Filtering [Ngiam et al.2011] for image classification. Our main contributions are:

we show that Sparse Filtering can strongly benefit from early stopping;

we show that the performance of Sparse Filtering is correlated with spectral properties of feature matrices on tests sets;

we introduce the Optimal Roundness Criterion (ORC), a stopping criterion for Sparse Filtering based on the above correlation, and demonstrate that the ORC can considerably improve image classification.
Feature Learning for Image Classification
Feature learning algorithms often consist of two steps: In a first step, a dictionary is learned, and in a second step, the samples are encoded based on this dictionary. A typical dictionary learning step for image classification is sketched in Figure 1: First, random patches (samples) are extracted from the training images. These patches are then preprocessed using, for example, Statistical Whitening or Contrast Normalization. Finally, an unsupervised learning algorithm is applied to learn a dictionary from the preprocessed patches. Once a dictionary is learnt, several further steps need to be applied to finally train an image classifier, see, for example,
[Coates and Ng2011, Coates, Ng, and Lee2011, Jia, Huang, and Darrell2012, Le2013]. Our pipeline is similar to the one in [Coates and Ng2011]: We extract square patches comprising pixels, preprocess them with Contrast Normalization^{1}^{1}1Contrast normalization consists of subtracting the mean and dividing by the standard deviation of the pixel values.
and/or Statistical Whitening, and finally pass them to Random Patches or Sparse Filtering. (Note that our outcomes differ slightly from those in [Coates and Ng2011] because we use square patches comprising pixels instead of pixels.) Subsequently, we apply softthresholding for encoding,spatial max pooling for extracting features from the training data images, and finally L2 SVM classification (cf.
[Coates and Ng2011]).Numerous examples show that feature learning can considerably improve classification. Therefore, insight in the underlying principles of feature learning algorithms such as Statistical Whitening and Sparse Filtering is of great interest.
In mathematical terms, a feature learning algorithm provides a transformation
(1)  
of an original feature matrix to a new feature matrix . We adopt the convention that the rows of the matrices correspond to the features, the columns to the samples; this convention implies in particular that is the number of original features, the number of samples, and the number of new features.
The Optimal Roundness Criterion
Roundness of Feature Matrices
Feature learning can be seen as tradeoff between reducing the correlations of the feature representation and preservation of relevant information. This tradeoff can be readily understood looking at Statistical Whitening. For this, recall that preprocessing with Statistical Whitening transforms a set of image patches into a new set of patches by changing the local correlation structure. More precisely, Statistical Whitening transforms patches (), that is, subsets of the entire feature matrix, into new patches such that
Statistical Whitening therefore acts locally: while the correlation structures of the single patches are directly and radically changed, the structure of the entire matrix is affected only indirectly. However, these indirect effects on the entire matrix are important for the following. To capture these effects, we therefore introduce the roundness of a feature matrix given an original feature matrix . On a high level, we say that the new feature matrix is round if the spectrum of the associated Gram matrix
is narrow. To specify this notion, we denote the ordered eigenvalues of
by and their mean by and define roundness as follows:Definition 1.
For any matrix , we define its roundness as
The largest eigenvalue measures the width of the spectrum of the Gram matrix; alternative measures of the width such as the standard deviation of the eigenvalues would serve the same purpose. The mean of the eigenvalues , on the other hand, is basically a normalization as the following result illustrates (the proof is found in the supplementary material):
Theorem 1.
Denote the columns of by . Then, the mean of the eigenvalues of the Gram matrix is constant on
Definition 1 therefore states that the larger is , the narrower is the spectrum of the eigenvalues of the Gram matrix of , and therefore, the rounder is the matrix . With this notion of roundness at hand, we can now understand the effects of Statistical Whitening: On the one hand, Definition 1 indicates that Statistical Whitening renders single patches perfectly round, that is, . On the other hand, Statistical Whitening preserves global structures in the feature matrix. In particular, the entire feature matrix is made rounder but not rendered perfectly round, that is, . In this sense, Statistical Whitening can be seen as tradeoff between increasing of roundness and preservation of global structures. It therefore remains to connect roundness and randomization.
Roundness and Randomness
A connection between roundness and randomization is provided by random matrix theory. To illustrate this connection, we first recall Gordon’s theorem for Gaussian random matrices (see
[Eldar and Kutyniok2012, Chapter 5] for a recent introduction to random matrix theory):Theorem 2 (Gordon).
Let be a random matrix with independent standard normal entries. Then,
Such exact bounds are available only for matrices with independent standard normal entries, but sharp bounds in probability are available also for other random matrices. For our purposes, the common message of all these bounds is that random matrices with sufficiently many columns (number of samples) have a small spectrum. This means in particular that such matrices are round as the following asymptotic result illustrates (the proof is based on wellknown results from random matrix theory and therefore omitted):
Lemma 1.
Let the number of features be a function of the number of samples such that . Moreover, for all , let be a random matrix with independent standard normal entries. Then, for all ,
that is, converges in probability to .
Similar results can be derived for nonGaussian or correlated entries, indicating that random matrices are typically round.
Besides the connection between roundness and randomization, the above results for random matrices also provide a link between roundness and sample sizes. Indeed, we observe that the above results indicate that large samples sizes lead to round matrices. To make this link more tangible, we conduct simulations with Toeplitz matrices, which can model local correlations that are typical for nearby pixels in natural images [Girshick and Malik2013]. To this end, we first recall that for any fixed parameter , the entries of a Toeplitz matrix are defined as^{2}^{2}2We set for and for . . We now construct a feature matrix
by drawing each of its columns, that is, the samples, from the normal distribution with mean zero and covariance matrix
. Toeplitz matrices with lead to feature matrices with independent entries; Toeplitz matrices with lead to feature matrices with dependence structures that are more similar to dependence structures found in natural images. In Figure 2, we report the roundness of for (plot on the left) and (plot on the right) as a function of the numbers of samples for different numbers of features . The results are commensurate with the theoretical findings above: First, both plots illustrate that the roundness of matrices increases if the number of samples is increased but decreases if the number of features is increased (cf. Theorem 2 and Lemma 1). Second, a comparison of the two plots illustrate that the roundness is larger for than for (since is perfectly round while is not).Optimal Roundness Criterion (ORC)
The above discussion suggest that optimal feature learning is the result of a tradeoff between increasing the roundness of the feature matrix and preserving global structures in the data. In this part, we want to exploit this insight to understand and improve iterative feature learning algorithms. Common feature learning algorithms consist of transformations that are defined as minimizers of a functional. These functionals are then often computed iteratively via a sequence of gradient based operations. In this paper, we therefore focus on feature learning algorithms where the transformation as in (1) is the limit of a sequence of transformations , that is,
where for all ,
A prominent representative of such iterative algorithms is Sparse Filtering. Sparse Filtering consists of normalizations and the minimization of an criterion (see next Section). It is reasonable to assume that these operations  similar to the local changes by Statistical Whitening  preserve certain global structures of the feature matrix. In view of a tradeoff between roundness and preservation of global structures, we are therefore interested in stopping the iterations as soon as the roundness is maximized. More formally, we introduce the ORC, which serves as stopping criterion to maximize the roundness:
Definition 2.
Let be the roundness introduced in Definition 1. The Optimal Roundness Criterion (ORC) replaces the transformation by
for
if the argmaximum is finite and otherwise.
The ORC assures that the computations continue only as long as the roundness increases. Assuming that certain global structures are preserved by the transformations, the ORC provides an optimization scheme for the performance of iterative feature learning algorithms. One could also think of modifications of the ORC that include an additive constant or a factor to force larger increases or to allow for temporary decreases of the roundness.
Image Classification on CIFAR10
For our all experiments, we use the CIFAR10 dataset [Krizhevsky and Hinton2009]^{3}^{3}3http://www.cs.toronto.edu/~kriz/cifar.html. This dataset consists of color images partitioned into classes, each containing images. Each of the images comprises pixels. The dataset is split into a training set with images and a test set with images. From the training set, we randomly select patches for the unsupervised feature learning. These patches are also used to determine the parameters of Contrast Normalization and Statistical Whitening (if applied).
Random Patches
For the dictionary learning step, it was shown that simple randomized procedures combined with Statistical Whitening work surprisingly well [Coates and Ng2011, Jarrett et al.2009, Saxe et al.2011]. A popular example is Random Patches, which creates a dictionary matrix by simply stacking up randomly selected samples. In Table 1, we report the influence of Contrast Normalization and Statistical Whitening on Random Patches (cf. [Coates and Ng2011]). We see that Statistical Whitening is very beneficial for Random Patches and increases the roundness of the transformed feature matrix. This suggests that the roundness can be used as an indicator for the performance of feature learning. (Note that the roundness is on different scales for different numbers of features and can therefore not be compared for different numbers of features.)
Num  Norm.  White.  Round.  Acc. 

243  No  No  0.0041  32.50% 
243  Yes  No  0.0131  63.65% 
243  No  Yes  0.2080  65.01% 
243  Yes  Yes  0.1548  64.34% 
486  No  No  0.0021  31.67% 
486  Yes  No  0.0062  66.14% 
486  No  Yes  0.1134  67.08% 
486  Yes  Yes  0.0965  67.84% 
Num  ORC  Round.  Acc. 

243  No  0.0519  57.66% 
243  Yes  0.1425  62.47% 
486  No  0.0495  58.19% 
486  Yes  0.0908  63.80% 
Sparse Filtering
Sparse Filtering [Ngiam et al.2011] is an unsupervised feature learning algorithm that computationally scales particularly well with the dimensions. To recall the definition of Sparse Filtering, we denote by the function that first normalizes^{4}^{4}4We set in the corresponding operations ensure that (2) is well defined. the rows of a matrix in to unit Euclidean norm and then normalizes the columns of the resulting matrix to unit Euclidean norm. For any fixed matrix , we then define a matrix such that
(2) 
if the minimum is finite and otherwise. Sparse Filtering is then the transformation
(Sparse Filtering) 
However, we now show by making the normalizations explicit that these normalizations make Sparse Filtering intricate. For this, we define the rank one matrices via
and via
where is the usual Kronecker delta. This then yields the following form of Definition (2).
Theorem 3.
The matrix in (2) is the minimizer of
over all matrices .
Although Sparse Filtering is sometimes claimed to have sparsity properties due to the involvement of the norm (similar as the Lasso [Tibshirani1996], for example), the above reformulation demonstrates that this is far from obvious and needs further clarification.
It is apparent that the choice of the number of features influences the performance of Sparse Filtering. As can be seen in Figure /refWe will see below, however, that the choice of the number of iterations surprisingly can have an even larger influence. We are therefore interested in choosing an appropriate number of iterations. A standard approach would involve fold crossvalidation schemes, but this requires training of models and is therefore computationally costly. The ORC, on the other hand, can be a computationally feasible alternative to crossvalidation. To illustrate this, we compare in Table 2 the outcomes of Sparse Filtering on the CIFAR10 dataset with and without application of the ORC. We have also computed the intermediate outcomes of Sparse Filtering at every 20 iterations and report in Figure 3 the corresponding test accuracy, training accuracy, roundness on the training set, and correlations with the test accuracy. The roundness on the test set is basically indistinguishable from the roundness on the training set and is therefore not shown. We make three crucial observations: (i) the test accuracy of Sparse Filtering peaks at around 20 iterations and then decreases monotonically; (ii) the roundness on the training set is highly correlated with the test accuracy; in particular, the locations of the peaks of these curves coincide; (iii) the roundness on the training set is highly correlated with the roundness on the test set. These observations suggest that (i) Sparse Filtering should be stopped early; (ii) the ORC can optimize the performance of Sparse Filtering; (iii) it is sufficient to compute the roundness on the training set. To further support these claims, we have also computed the intermediate outcomes of Sparse Filtering at every 2 iterations in the region around the peaks, that is, we have computed a zoomedin version of Figure 3. We report the results in Figure 4. We observe that training accuracy, test accuracy, and roundness are highly correlated, which corroborates the above claims and therefore confirms the potential of the ORC. We finally note that the curves in the zoomedin version are wiggly not only because of the randomness involved but also because computations of gradients over a small number of iterations involve numerical imprecisions.
Conclusions and Outlook
The spectral analysis of feature matrices is a novel and promising approach to feature learning. In particular, our results show that this “geometric” approach can provide new interpretations and substantial improvements of widespread feature learning tools such as Statistical Whitening, Random Patches, and Sparse Filtering. For example, we have revealed that Sparse Filtering can, quite surprisingly, deteriorate with increasing number of iterations and can be made considerably faster and more accurate by early stopping according to the spectrum of the intermediate feature matrices.
Regarding the theory, it would be of interest to obtain, for specific procedures, predictions on how the roundness changes with the iterations and to what it converges in the limit.
In an extended version of this paper, we are planning to include an analysis of Roundness in Convolutional Neural Networks (CNNs)
[Fukushima1980]. After being neglected for many years, CNNs have received an enormous deal of attention recently, see [Krizhevsky, Sutskever, and Hinton2012, Girshick et al.2013] and many others. We therefore expect that the application of our approach to CNNs can be of substantial interest.Acknowledgments
We thank the reviewers for their insightful comments.
Appendix: Proofs
We denote in the following the columns and rows of any matrix by and , respectively.
Proof of Theorem 1.
The matrix
is symmetric and can therefore be diagonalized. This implies that there is an orthogonal orthogonal matrix
such that the diagonal entries of are . For this matrix , it then holdsNext, we invoke the cyclic property of the trace and the orthogonality of the matrix to obtain
Finally, we note that the normalization of the columns of yields
for all The desired result
can now derived combining the three displays. ∎
Proof of Theorem 3.
We first show that for a matrix , the corresponding matrix with normalized rows can be written as
(Claim 1) 
To this end, we observe that the normalization of the rows of the matrix corresponds to the matrix multiplication
where is the diagonal matrix with nonzero entries
Next, we note that
and therefore
This yields the matrix equation
and therefore
This proves the first claim.
We now show that for a matrix , the corresponding matrix with normalized columns is given by
(Claim 2) 
We first note that we can write the normalization step  this time for the columns  as the matrix multiplication
where is the diagonal matrix with entries
Next, we note that
and therefore for the inverse diagonal matrix
This yields the matrix equation
and therefore
This proves the second claim.
We now consider for an arbitrary matrix and apply Claim 1 and Claim 2: Setting , we obtain from Claim 1 that normalizing the rows of the matrix yields the matrix given by
This implies in particular
Setting then , we obtain from Claim 2 and the two previous displays that the matrix becomes after normalizing its rows and then its columns the matrix
The desired result can then be deduced from the definition of in (2). ∎
Appendix
References

[Agarwal and Triggs2006]
Agarwal, A., and Triggs, B.
2006.
Hyperfeatures–multilevel local coding for visual recognition.
In
European Conference for Computer Vision (ECCV)
. 30–43. 
[Coates and Ng2011]
Coates, A., and Ng, A.
2011.
The importance of encoding versus training with sparse coding and vector quantization.
In28th International Conference on Machine Learning (ICML ’11)
, 921–928. 
[Coates, Ng, and
Lee2011]
Coates, A.; Ng, A.; and Lee, H.
2011.
An analysis of singlelayer networks in unsupervised feature
learning.
In
International Conference on Artificial Intelligence and Statistics (AISTATS)
, 215–223.  [Eldar and Kutyniok2012] Eldar, Y., and Kutyniok, G., eds. 2012. Compressed Sensing: Theory and Applications. Cambridge University Press.

[Fukushima1980]
Fukushima, K.
1980.
Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
Biological Cybernetics 36(4):193–202.  [Girshick and Malik2013] Girshick, R., and Malik, J. 2013. Training deformable part models with decorrelated features. In Proceedings of the International Conference on Computer Vision (ICCV).
 [Girshick et al.2013] Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. preprint arxiv:1311.2524.
 [Jarrett et al.2009] Jarrett, K.; Kavukcuoglu, K.; Ranzato, M.; and LeCun, Y. 2009. What is the Best MultiStage Architecture for Object Recognition? In IEEE International Conference on Computer Vision (ICCV), 2146–2153.
 [Jia, Huang, and Darrell2012] Jia, Y.; Huang, C.; and Darrell, T. 2012. Beyond spatial pyramids: Receptive field learning for pooled image features. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 3370–3377.
 [Krizhevsky and Hinton2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. Tech. Rep. Computer Science Department, University of Toronto.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS). 1097–1105.
 [Le2013] Le, Q. 2013. Building highlevel features using large scale unsupervised learning. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8595–8598. IEEE.
 [LeCun, Huang, and Bottou2004] LeCun, Y.; Huang, F.; and Bottou, L. 2004. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition (CVPR), II–97.
 [Ngiam et al.2011] Ngiam, J.; Koh, P. W.; Chen, Z.; Bhaskar, S.; and Ng, A. 2011. Sparse filtering. In Advances in Neural Information Processing Systems (NIPS), 1125–1133.
 [Saxe et al.2011] Saxe, A.; Koh, P.; Chen, Z.; Bhand, M.; Suresh, B.; and Ng, A. 2011. On random weights and unsupervised feature learning. In 28th International Conference on Machine Learning (ICML ’11), 1089–1096.
 [Tibshirani1996] Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58(1):267–288.
Comments
There are no comments yet.