Filter level sparsity, commonly known as “dying kernels” or “dying ReLUs” are common phenomena in Convolutional Neural Networks (CNNs). This phenomenon has been attributed to adaptive optimizers (Adam, Adagrad, Adadelta), high L2 or weight decay, and the usage of the ReLU activation function(mehta2019implicit; lu2019dying; yaguchi2018adam). Attempts to mitigate the dying ReLU issue with Leaky ReLUs have not been entirely effective, with minor overall impact on the emergent sparsity (mehta2019implicit).
While kernel sparsity or pruning could benefit CNNs for faster inference and smaller models (han2015deep; molchanov2016pruning; li2016pruning; gale2019state), sparsity in the lower layers could be detrimental as the first few layers, especially the first layer, are “sensitive to initialization” are critical to forming good predictions (zhang2019all). Since the initial layers learn simple patterns and subsequent layers build upon initial layers to develop complex features, sparsity in initial layers would decrease model capacity.
Besides sparsity, kernel diversity has garnered much attention: Zeiler et. al. proposed a set of best practices to improve upon Alexnet
with different kernel sizes, stride-lengths and feature-scale clipping(zeiler2014visualizing) to prevent specific filters from “dominating”. Capturing distinctive features, especially in lower level layers, ensures that each filter learns something different and improves computational capacity that the CNNs dispense at the classification problem(graff2016correlating).
Here, we seek to address two problems, namely sparsity and lack of diversity
of kernels, especially for the most important first few layers. We propose Grassmannian subspace packings as an initialization to project channel information onto a space where the basis vectors are maximally distant to capture as much diversity as possible and increase model capacity. We refer to the latter as “channel expander” layers. We demonstrate that models with first-layer Grassmannian kernels achieve better accuracy and convergence rate with diverse, non-sparse kernels, which improves model capacity.
2 Grassmannian line and subspace packings
Grassmannian subspace packing (conway1996packing) in experimental mathematics and information theory has been used in beamforming techniques for physical layer wireless communications, limited feedback codebook-based downlink beamforming schemes and sparse signal reconstruction (malioutov2005sparse; love_1; prabhu2009performance).
Set theoretic definition. The real Grassmann manifold is define as the set of all -dimensional subspaces in , where . The Grassmannian N-subspace packing problem is the problem of finding a set of -dimensional subspaces in that maximize the minimum pairwise distance between subspaces in the set under some metric (dhillon2008constructing). We define the pairwise distance between any two subspaces in in terms of principal angles.
Principle angles. Suppose that and are two subspaces in . These subspaces are inclined against one another by different principle angles, with the smallest principal angle formed by a pair of unit-length vectors drawn from , where
with denoting the inner product of vectors and , and denoting the norm metric such that .
The second principle angle is defined as the smallest angle obtained by a pair of unit length vectors from the same subspace that is orthogonal to the first pair, where
The remaining principle angles are defined similarly. The sequence of principle angles is non-decreasing and bounded in .
Optimizing Grassmannians using principle angles. For an packing problem of , let denote the set of subspaces in . We find where the minimum distance between any is maximized. The two popular distance metrics being chordal distance and Fubini-Study distance, which are defined below. With the restriction that , for two -dimensional subspaces and ,
For our experiments, we use generate a codebook matrix for subspace packing based on Fubini-Study distance which has better experimental results111https://www.mathworks.com/matlabcentral/fileexchange/41652-grassmannian-design-package.(medra2014flexible). We first generate a codebook matrix, . is a rank- codebook characterized by the minimum distance of packing ,
We seek to find that maximizes ,
The Rankin bound (barg2002bounds) provides the theoretical upper bound of the maximum of the minimum pairwise distances between subspaces and is given by,
Grassmannian packing seeks to answer "what is the best way to optimally pack N k-dim subspaces in an M-dim space?". For the first convolutional kernel with width = 3, height = 3 (9 parameters per channel) with input channels = 3, output channels = 32, we ask:
"How can we optimally pack 9 3-dim subspace in a 32-dim space,
such that any two subspaces (kernels) are maximally separated?"
We first generate Grassmannian codebooks for subspace packings, and load the codebook as CNN kernels. We compare accuracy and rate of convergence of shallow CNNs and deep CNNs. For a shallow 4-conv CNN, we trained 3 models: Kaiming-initialized CNNs, the same CNN but with Grassmannian as first layer (frozen and trainable). These models had their first layer initialized as 32 packings of for CIFAR10 and CIFAR100, and 32 packings of for single-channel MNIST and KMNIST222The first conv layer has parameters (width 3, height 3, 1 or 3 input channels and 32 output channels)..
For deep CNNs, we used ResNet18 architecture and initialized the first layer with 64 packings of 333The first layer of ResNets are conv kernels (width 7, height 7, 3 input channels and 64 output channels).. We trained ResNet with batch size of 64, SGD optimizer with learning rate of 0.01, momentum of 0.9, with a learning rate decay by a factor of
every 10 epochs(he2016deep). Besides having Kaiming-initialized baseline, first-layer Grassmann and first-layer frozen Grassmann, we also scaled the Grassmann as per Kaiming init, such that the activations are correctly scaled that facilitates learning without vanishing gradients (he2015delving). We scale the Grassmannian basis such that the magnitude is of factor . Since the scaling factor is constant, the directions of Grassmannian subspace basis vectors are preserved.
We check for convergence rate by examining first-epoch accuracy of shallow CNNs on MNIST, KMNIST, CIFAR10 and CIFAR100, as well as accuracy, kernel sparsity and diversity. As adaptive optimizers and weight decay could affect sparsity, we test on different optimizers such as SGD, Adadelta and Adam, and with/without weight decay of 0.0001.
4 Results and Discussions
Grassmannian models achieve higher accuracy on the 10 class Imagenette imagenette problem with ResNets, even when Grassmannian kernels are frozen upon initialization. With weight decay, Grassmannian kernels achieve the highest accuracy out of the four, whereas Kaiming-scaled Grassmannian packings outperforms the rest without weight decay, as per Table 1.
|Validation Accuracy (%)|
|Model||Without Decay||With Decay|
Grassmannian initialization also improves convergence for both shallow (Figure 6) and Deep CNNs (Figure 3), with both frozen and trainable Grassmannian models achieving higher first-epoch accuracy (Table 2). Grassmannian subspace initialization, where each convolutional filter was assigned the subspace basis vectors for their corresponding Grassmannian packing scheme, outperformed the baseline on every dataset.
|Validation Accuracy (%)|
Interestingly, we see a marginal improvement over Grassmannian initialization by simply rendering the Grassmannian initialization untrainable. We postulate that in certain cases, simply projecting the input feature maps to be as far apart as possible as opposed to using another trainable convolutional layer is enough for later layers to make better use of their own capacity, at least in early training.
Qualitative inspection of the kernels (Figure 7
) reveal that Grassmannian kernels learn diverse features with a mixture of edge, color and pattern detectors in each kernel, without any kernel dominating that shows improved diversity and fewer sparsity. While baseline and Grassmannian packings gave mean close to 0, intra-kernel variance and norms are higher for Grassmannian packings, indicating diversity and fewer sparsity with the distribution across kernels shown in Figure2.
Interestingly, trainable Grassmannian packings outperform baseline which outperforms frozen Grassmannian packings when adaptive optimizers are used (especially Adam, see Figure 5). Since Adam contributes to sparsity (mehta2019implicit; yaguchi2018adam)
, it is at odds with Grassmannian packings that are against sparsity and for diversity. Nonetheless, most models such as ResNet, RandWire, NASNet and DenseNet achieve SOTA results through SGD with a careful learning rate scheduler, which could further benefit from Grassmannian packings due to their demonstrated performance using SGD(wilson2017marginal).
In this work, we showed that the use of Grassmannian packings for first layer initialization improved performance in the contexts of shallow networks and deep networks, even with untrainable Grassmannian kernels, across multiple datasets. Grassmannian kernels are best used with SGD optimization and works with weight decay or otherwise. Our results suggest improvement of the model’s capacity to learn by simply initializing the first layer filters to be maximally distant and capture as many diverse features as possible. In the future, we hope to extend this analysis to explore the several other options for metric used in the Grassmannian packing, and search for ways to inform the choice in metric.