1 Introduction
Classification of a vector of real numbers (called “feature activations”) into one of several discrete categories is well established and well studied, with generally satisfactory solutions such as the ubiquitous multinomial logistic regression reviewed, for example, by
Hastie et al. (2009). However, the canonical classification may not couple well with generation of the feature activations via convolutional networks (convnets) trained using stochastic gradient descent, as discussed, for example, by LeCun et al. (1998). Fitting (also known as learning or training) the combination of the convnet and the classification stage by minimizing the cost/loss/objective function associated with the classification suggests designing a stage specifically for use in such joint fitting/learning/training. In particular, the convnets presented in the present paper are “equivariant” with respect to scalar multiplication — multiplying the input values by any real number multiplies the output by the same factor; the present paper leverages this equivariance via a “scaleinvariant” classification stage — a stage for which multiplying the input values by any nonzero real number leaves the output unchanged. The scaleinvariant classification stage turns out to be more robust to outliers (including obviously mislabeled data), fits/learns/trains precisely at the rate that the user specifies, and apparently results in slightly lower errors on several standard test sets when used in conjunction with some typical convnets for generating the feature activations. The computational costs are comparable to those of multinomial logistic regression. Similar classification has been introduced earlier in other contexts by
Hill & Doucet (2007), Lange & Wu (2008), Wu & Lange (2010), Saberian & Vasconcelos (2011), Mroueh et al. (2012), Wu & Wu (2012), and others. Complementary normalization includes the work of Carandini & Heeger (2012), Ioffe & Szegedy (2015), and the associated references. The key to effective learning is rescaling, as described in Section 3 below (see especially the last paragraph there). This rescaled learning, while necessary for training convolutional networks, is unnecessary in the aforementioned earlier works.The remainder of the present paper has the following structure: Section 2 sets the notation. Section 3 introduces the scaleinvariant classification stage. Section 4 analyzes its robustness. Section 5 illustrates the performance of the classification on several standard data sets. Section 6 concludes the paper.
2 Notational conventions
All numbers used in the classification stage will be real valued (though the numbers used for generating the inputs to the stage may in general be complex valued). We follow the recommendations of Magnus & Neudecker (2007): all vectors are column vectors (aside from gradients of a scalar with respect to a column vector, which are row vectors), and we use to denote the Euclidean norm of a vector ; that is, is the square root of the sum of the squares of the entries of . We use to denote the spectral norm of a matrix ; that is,
is the greatest singular value of
, which is also the maximum of over every vector such that . The terminology “Frobenius norm” of refers to the square root of the sum of the squares of the entries of . The spectral norm of a vector viewed as a matrix having only one column or one row is the same as the Euclidean norm of the vector; the Euclidean norm of a matrix viewed as a vector is the same as the Frobenius norm of the matrix.3 A scaleinvariant classification stage
We study a linear classification stage that assigns one of classes to each realvalued vector of feature activations (together with a measure of confidence in its classification), with the assignment being independent of the Euclidean norm of ; the Euclidean norm of is its “scale.” We associate to the classes target vectors , , …, that are the vertices of either a standard simplex or a regular simplex embedded in a Euclidean space of dimension — the dimension of the embedding space being strictly greater than the minimum () required to contain the simplex will give extra space to help facilitate learning; Hill & Doucet (2007), Lange & Wu (2008), Wu & Lange (2010), Saberian & Vasconcelos (2011), Mroueh et al. (2012), and Wu & Wu (2012) (amongst others) discuss these simplices and their applications to classification. For the standard simplex, the targets are just the standard basis vectors, each of which consists of zeros for all but one entry. For both the regular and standard simplices,
(1) 
Given an input vector of feature activations, we identify the target vector that is nearest in the Euclidean distance to
(2) 
where
(3) 
for an matrix determined via learning as discussed shortly. The index such that is minimal is the index of the class to which we assign . The classification is known as “linear” or “multilinear” due to (3). The index to which we assign is clearly independent of the Euclidean norm of due to (2), and the assignment is “scaleinvariant” even if we rescale by a nonzero scalar multiple.
To determine , we first initialize all its entries to random numbers, then divide each entry by the Frobenius norm of and multiply by the square root of the number of rows in . We then conduct iterations of stochastic gradient descent as advocated by LeCun et al. (1998), updating to on each iteration via
(4) 
where is a positive real number (known as the “learning rate” or “step length”) and is the cost to be minimized that is associated with a vector chosen at random from among the input vectors and its associated vector of feature activations,
(5) 
where is the target for the correct class associated with , and is the vectorvalued function of specified in (2) and (3).
As elaborated by LeCun et al. (1998), usually we combine stochastic gradient descent with backpropagation to update the entries of associated with the chosen input, which requires propagating the gradient back into the network generating the feature activations that are the entries of for the chosen input sample. We use the same learning rate from the classification stage throughout the network generating the feature activations. Fortunately, a straightforward calculation shows that the Euclidean norm of the gradient is bounded independent of the scaling of :
(6) 
please note that scaling the matrix by any nonzero scalar multiple has no effect on the righthand side of (6) — the gradient propagating in backpropagation is independent of the size of .
Critically, after every update as in (4), we rescale the matrix : we divide every entry by the Frobenius norm of and multiply by the square root of the number of rows in . We use the rescaled matrix for subsequent iterations of stochastic gradient descent. Rescaling yields precisely the same vector in (2) and cost in (5); together with the scaleinvariance of the righthand side of (6), rescaling ensures that the stochastic gradient iterations are effective and numerically stable for any learning rate .
4 Robustness
Combining (1) and the fact that the Euclidean norm of from (2) is 1 yields that the cost from (5) satisfies
(7) 
As reviewed by Hastie et al. (2009), the cost associated with classification via multinomial logistic regression is
(8) 
where is the index among , , …, of the correct class, and , , …, are the entries of the vector from (3), with for multinomial logistic regression. Whereas the cost is bounded as in (7), the cost from (8) is bounded only for positive values of , growing linearly for negative . Thus, is more robust than to outliers; logistic regression is less robust to outliers (including obviously mislabeled inputs).
5 Numerical experiments
The present section provides a brief empirical evaluation of rescaling in comparison with the usual multinomial logistic regression, performing the learning for both via stochastic gradient descent (the learning is endtoend, training the entire network — including both the convolutional network and the classification stage — jointly, with the same learning rate everywhere). The experiments (and corresponding figures) consider various choices for the learning rate and for the dimension of the space containing the simplex targets. We renormalize the parameters in the classification stage after every minibatch of 100 samples when rescaling (not with the multinomial logistic regression), as detailed in the last paragraph of Section 3 and the penultimate paragraph of the present section. The rescaled approach appears to perform somewhat better than multinomial logistic regression in all but Figure 2 for the experiments detailed in the present section. The remainder of the present section provides details.
Following LeCun et al. (1998), the architectures for generating the feature activations are convolutional networks (convnets) consisting of series of stages, with each stage feeding its output into the next (except for the last, which feeds into the classification stage). Each stage convolves each image from its input against several learned convolutional kernels, summing together the convolved images from all the inputs into several output images, then takes the absolute value of each pixel of each resulting image, and finally averages over each patch in a partition of each image into a grid of
patches. All convolutions are complex valued and produce pixels only where the original images cover all necessary inputs (that is, a convolution reduces each dimension of the image by one less than the size of the convolutional kernel). We subtract the mean of the pixel values from each input image before processing with the convnets, and we append an additional feature activation to those obtained from the convnets, namely the standard deviation of the set of values of the pixels in the image. For each data set, we use two network architectures, where the second is a somewhat smaller variant of the first. We consider three data sets whose training properties are reasonably straightforward to investigate, with each set consisting of
classes of images; the first two are the usual CIFAR10 and MNIST of Krizhevsky (2009) and LeCun et al. (1998). The third is a subset of the 2012 ImageNet data set of
Russakovsky et al. (2015), retaining 10 classes of images, representing each class by 100 samples in a training set and 50 per class in a testing set. CIFAR10 contains 50,000 images in its training set and 10,000 images in its testing set. MNIST contains 60,000 images in its training set and 10,000 images in its testing set. The images in the MNIST set are grayscale. The images in both the CIFAR10 and ImageNet sets are full color, with three color channels. We neither augmented the input data nor regularized the cost/loss functions. We used the Torch7 platform —
http://torch.ch — for all computations.Tables 1–4 display the specific configurations we used. “Stage” specifies the positions of the indicated layers in the convnet. “Input images” specifies the number of images input to the given stage for each sample from the data. “Output images” specifies the number of images output from the given stage. Each input image is convolved against a separate, learned convolutional kernel for each output image (with the results of all these convolutions summed together for each output image). “Kernel size” specifies the size of the square grid of pixels used in the convolutions. “Input image size” specifies the size of the square grid of pixels constituting each input image. “Output image size” specifies the size of the square grid of pixels constituting each output image. Tables 1 and 2 display the two configurations used for processing both CIFAR10 and MNIST. Tables 3 and 4 display the two configurations used for processing the subset of ImageNet described above.
Figures 1–6 plot the accuracies attained by the different schemes for classification while varying from (4) ( is the “learning rate,” as well as the length of the learning step relative to the magnitude of the gradient) and varying the dimension of the space containing the simplex targets; is the number of rows in from (3) and (4
). In each figure, the top panel — that labeled “(a)” and “rescaled” — plots the error rates for classification using rescaling, with the targets being the vertices on the hypersphere of a regular simplex; the middle panel — that labeled “(b)” and “logistic” — plots the error rates for classification using multinomial logistic regression; the bottom panel — that labeled “(c)” and “best of both” — plots the error rates for the bestperforming instance from the top panel (a) together with the bestperforming instance from the middle panel (b). All error rates refer to performance on the test set. The label “epoch” for the horizontal axes refers, as usual, to the number of training sweeps through the data set, as reviewed in the coming paragraph.
As recommended by LeCun et al. (1998), we learn via (minibatched) stochastic gradient descent, with 100 samples per minibatch; rather than updating the parameters being learned for randomly selected individual images from the training set exactly as in Section 3, we instead randomly permute the training set and partition this permuted set of images into subsets of 100, updating the parameters simultaneously for all 100 images constituting each of the subsets (known as “minibatches”), processing the series of minibatches in series. Each sweep through the entire training set is known as an “epoch.” The horizontal axes in the figures count the number of epochs.
In the experiments of the present section, the accuracies attained using the scaleinvariant classification stage are comparable to (if not better than) those attained using the usual multinomial logistic regression. Running the experiments with several different random seeds produces entirely similar results. The scaleinvariant classification stage is stable for all values of , that is, for all learning rates.
6 Conclusion
Combining [1] a convolutional network that is equivariant to scalar multiplication, [2] a classification stage that is invariant to scalar multiplication, and [3] the rescaled learning of the last paragraph of Section 3 fully realizes and leverages invariance to scalar multiplication. This combination is more robust to outliers (including obviously mislabeled data) than the standard multinomial logistic regression “softmax” classification scheme, results in marginally better errors on several standard test sets, and fits/learns/trains precisely at the userspecified rate, all while costing about the same computationally. The attained invariance is clean and convenient — a good goal all on its own.
Acknowledgments
We would like to thank Léon Bottou and Rob Fergus for critical contributions to this project. The reviewers also helped immensely in improving the paper.
Stage  Input images  Output images  Kernel size  Input image size  Output image size 

first  †  16  
second  16  128  
third  128  1024 
Stage  Input images  Output images  Kernel size  Input image size  Output image size 

first  †  16  
second  16  64  
third  64  256 
Stage  Input images  Output images  Kernel size  Input image size  Output image size 

first  3  16  
second  16  64  
third  64  256  
fourth  256  1024 
Stage  Input images  Output images  Kernel size  Input image size  Output image size 

first  3  16  
second  16  64  
third  64  256  
fourth  256  256 
References
 Carandini & Heeger (2012) Carandini, Matteo and Heeger, David J. Normalization as a canonical neural computation. Nature Reviews Neurosci., 13(1):51–52, 2012.
 Hastie et al. (2009) Hastie, Trevor, Tibshirani, Robert, and Friedman, Jerome. Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition, 2009.
 Hill & Doucet (2007) Hill, Simon I. and Doucet, Arnaud. A framework for kernelbased multicategory classification. J. Artificial Intel. Research, 30:525–564, 2007.
 Ioffe & Szegedy (2015) Ioffe, Sergey and Szegedy, Christian. Batch normalization: accelerating deep network training by reducing internal covariate shift. Technical Report 1502.03167, arXiv, 2015.
 Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical Report Master’s Thesis, University of Toronto Department of Computer Science, 2009.
 Lange & Wu (2008) Lange, Kenneth and Wu, Tong Tong. An MM algorithm for multicategory vertex discriminant analysis. J. Comput. Graph. Statist., 17(3):527–544, 2008.
 LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
 Magnus & Neudecker (2007) Magnus, Jan R. and Neudecker, Heinz. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley and Sons, 3rd edition, 2007.
 Mroueh et al. (2012) Mroueh, Youssef, Poggio, Tomaso, Rosasco, Lorenzo, and Slotine, JeanJeacques. Multiclass learning with simplex coding. In Advances in Neural Information Processing Systems, volume 25, pp. 2789–2797. Curran Associates, 2012.
 Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Kruse, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and FeiFei, Li. ImageNet large scale visual recognition challenge. Technical Report 1409.0575v3, arXiv, 2015.
 Saberian & Vasconcelos (2011) Saberian, Mohammad J. and Vasconcelos, Nuno. Multiclass boosting: theory and algorithms. In Advances in Neural Information Processing Systems, volume 24, pp. 2124–2132. Curran Associates, 2011.

Wu & Lange (2010)
Wu, Tong Tong and Lange, Kenneth.
Multicategory vertex discriminant analysis for highdimensional data.
Annals Appl. Statist., 4(4):1698–1721, 2010.  Wu & Wu (2012) Wu, Tong Tong and Wu, Yichao. Nonlinear vertex discriminant analysis with reproducing kernels. Statist. Anal. Data Mining, 5(2):167–176, 2012.
Comments
There are no comments yet.