Grassmannian Packings in Neural Networks: Learning with Maximal Subspace Packings for Diversity and Anti-Sparsity

Kernel sparsity ("dying ReLUs") and lack of diversity are commonly observed in CNN kernels, which decreases model capacity. Drawing inspiration from information theory and wireless communications, we demonstrate the intersection of coding theory and deep learning through the Grassmannian subspace packing problem in CNNs. We propose Grassmannian packings for initial kernel layers to be initialized maximally far apart based on chordal or Fubini-Study distance. Convolutional kernels initialized with Grassmannian packings exhibit diverse features and obtain diverse representations. We show that Grassmannian packings, especially in the initial layers, address kernel sparsity and encourage diversity, while improving classification accuracy across shallow and deep CNNs with better convergence rates.



There are no comments yet.


page 1

page 9


Low-Rank Embedding of Kernels in Convolutional Neural Networks under Random Shuffling

Although the convolutional neural networks (CNNs) have become popular fo...

FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

When designing Convolutional Neural Networks (CNNs), one must select the...

Multi-objective Evolutionary Approach for Efficient Kernel Size and Shape for CNN

While state-of-the-art development in CNN topology, such as VGGNet and R...

Understanding of Kernels in CNN Models by Suppressing Irrelevant Visual Features in Images

Deep learning models have shown their superior performance in various vi...

Optimization on Submanifolds of Convolution Kernels in CNNs

Kernel normalization methods have been employed to improve robustness of...

Steps Toward Deep Kernel Methods from Infinite Neural Networks

Contemporary deep neural networks exhibit impressive results on practica...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Filter level sparsity, commonly known as “dying kernels” or “dying ReLUs” are common phenomena in Convolutional Neural Networks (CNNs). This phenomenon has been attributed to adaptive optimizers (Adam, Adagrad, Adadelta), high L2 or weight decay, and the usage of the ReLU activation function

(mehta2019implicit; lu2019dying; yaguchi2018adam). Attempts to mitigate the dying ReLU issue with Leaky ReLUs have not been entirely effective, with minor overall impact on the emergent sparsity (mehta2019implicit).

Figure 1: Left: sparse filters in CNNs; Right: diverse, non-sparse Grassmannian kernels.

While kernel sparsity or pruning could benefit CNNs for faster inference and smaller models (han2015deep; molchanov2016pruning; li2016pruning; gale2019state), sparsity in the lower layers could be detrimental as the first few layers, especially the first layer, are “sensitive to initialization” are critical to forming good predictions (zhang2019all). Since the initial layers learn simple patterns and subsequent layers build upon initial layers to develop complex features, sparsity in initial layers would decrease model capacity.

Besides sparsity, kernel diversity has garnered much attention: Zeiler et. al. proposed a set of best practices to improve upon Alexnet

with different kernel sizes, stride-lengths and feature-scale clipping

(zeiler2014visualizing) to prevent specific filters from “dominating”. Capturing distinctive features, especially in lower level layers, ensures that each filter learns something different and improves computational capacity that the CNNs dispense at the classification problem(graff2016correlating).

Here, we seek to address two problems, namely sparsity and lack of diversity

of kernels, especially for the most important first few layers. We propose Grassmannian subspace packings as an initialization to project channel information onto a space where the basis vectors are maximally distant to capture as much diversity as possible and increase model capacity. We refer to the latter as “channel expander” layers. We demonstrate that models with first-layer Grassmannian kernels achieve better accuracy and convergence rate with diverse, non-sparse kernels, which improves model capacity.

2 Grassmannian line and subspace packings

Grassmannian subspace packing (conway1996packing) in experimental mathematics and information theory has been used in beamforming techniques for physical layer wireless communications, limited feedback codebook-based downlink beamforming schemes and sparse signal reconstruction (malioutov2005sparse; love_1; prabhu2009performance).

Set theoretic definition. The real Grassmann manifold is define as the set of all -dimensional subspaces in , where . The Grassmannian N-subspace packing problem is the problem of finding a set of -dimensional subspaces in that maximize the minimum pairwise distance between subspaces in the set under some metric (dhillon2008constructing). We define the pairwise distance between any two subspaces in in terms of principal angles.

Principle angles. Suppose that and are two subspaces in . These subspaces are inclined against one another by different principle angles, with the smallest principal angle formed by a pair of unit-length vectors drawn from , where


with denoting the inner product of vectors and , and denoting the norm metric such that .

The second principle angle is defined as the smallest angle obtained by a pair of unit length vectors from the same subspace that is orthogonal to the first pair, where


The remaining principle angles are defined similarly. The sequence of principle angles is non-decreasing and bounded in .

Optimizing Grassmannians using principle angles. For an packing problem of , let denote the set of subspaces in . We find where the minimum distance between any is maximized. The two popular distance metrics being chordal distance and Fubini-Study distance, which are defined below. With the restriction that , for two -dimensional subspaces and ,


For our experiments, we use generate a codebook matrix for subspace packing based on Fubini-Study distance which has better experimental results111 We first generate a codebook matrix, . is a rank- codebook characterized by the minimum distance of packing ,


We seek to find that maximizes ,


The Rankin bound (barg2002bounds) provides the theoretical upper bound of the maximum of the minimum pairwise distances between subspaces and is given by,


Grassmannian packing seeks to answer "what is the best way to optimally pack N k-dim subspaces in an M-dim space?". For the first convolutional kernel with width = 3, height = 3 (9 parameters per channel) with input channels = 3, output channels = 32, we ask:

"How can we optimally pack 9 3-dim subspace in a 32-dim space,

such that any two subspaces (kernels) are maximally separated?"

3 Experiments

We first generate Grassmannian codebooks for subspace packings, and load the codebook as CNN kernels. We compare accuracy and rate of convergence of shallow CNNs and deep CNNs. For a shallow 4-conv CNN, we trained 3 models: Kaiming-initialized CNNs, the same CNN but with Grassmannian as first layer (frozen and trainable). These models had their first layer initialized as 32 packings of for CIFAR10 and CIFAR100, and 32 packings of for single-channel MNIST and KMNIST222The first conv layer has parameters (width 3, height 3, 1 or 3 input channels and 32 output channels)..

For deep CNNs, we used ResNet18 architecture and initialized the first layer with 64 packings of 333The first layer of ResNets are conv kernels (width 7, height 7, 3 input channels and 64 output channels).. We trained ResNet with batch size of 64, SGD optimizer with learning rate of 0.01, momentum of 0.9, with a learning rate decay by a factor of

every 10 epochs

(he2016deep). Besides having Kaiming-initialized baseline, first-layer Grassmann and first-layer frozen Grassmann, we also scaled the Grassmann as per Kaiming init, such that the activations are correctly scaled that facilitates learning without vanishing gradients (he2015delving). We scale the Grassmannian basis such that the magnitude is of factor . Since the scaling factor is constant, the directions of Grassmannian subspace basis vectors are preserved.

We check for convergence rate by examining first-epoch accuracy of shallow CNNs on MNIST, KMNIST, CIFAR10 and CIFAR100, as well as accuracy, kernel sparsity and diversity. As adaptive optimizers and weight decay could affect sparsity, we test on different optimizers such as SGD, Adadelta and Adam, and with/without weight decay of 0.0001.

4 Results and Discussions

Grassmannian models achieve higher accuracy on the 10 class Imagenette imagenette problem with ResNets, even when Grassmannian kernels are frozen upon initialization. With weight decay, Grassmannian kernels achieve the highest accuracy out of the four, whereas Kaiming-scaled Grassmannian packings outperforms the rest without weight decay, as per Table 1.

Validation Accuracy (%)
Model Without Decay With Decay
Baseline 90.7 90.4
Grassmannian 92.4 91.8
Grassmannian, Frozen 90.8 91.6
Grassmannian, Kaiming-Scaled 92.8 91.5
Table 1: ResNet18 on 10-class ImageNette with SGD optimizer, with and without weight decay.

Grassmannian initialization also improves convergence for both shallow (Figure 6) and Deep CNNs (Figure 3), with both frozen and trainable Grassmannian models achieving higher first-epoch accuracy (Table 2). Grassmannian subspace initialization, where each convolutional filter was assigned the subspace basis vectors for their corresponding Grassmannian packing scheme, outperformed the baseline on every dataset.

Validation Accuracy (%)
Baseline 90.9 59.7 33.6 2.57
Grassmannian, Frozen 91.9 66.2 40.5 4.73
Grassmannian 92.6 66.4 41.1 4.70
Table 2: Faster Convergence of Shallow CNN, first epoch accuracy.

Interestingly, we see a marginal improvement over Grassmannian initialization by simply rendering the Grassmannian initialization untrainable. We postulate that in certain cases, simply projecting the input feature maps to be as far apart as possible as opposed to using another trainable convolutional layer is enough for later layers to make better use of their own capacity, at least in early training.

Qualitative inspection of the kernels (Figure 7

) reveal that Grassmannian kernels learn diverse features with a mixture of edge, color and pattern detectors in each kernel, without any kernel dominating that shows improved diversity and fewer sparsity. While baseline and Grassmannian packings gave mean close to 0, intra-kernel variance and norms are higher for Grassmannian packings, indicating diversity and fewer sparsity with the distribution across kernels shown in Figure


Figure 2: Distribution of mean, mariance and norm over the 64 kernels of first layer of ResNet.

Interestingly, trainable Grassmannian packings outperform baseline which outperforms frozen Grassmannian packings when adaptive optimizers are used (especially Adam, see Figure 5). Since Adam contributes to sparsity (mehta2019implicit; yaguchi2018adam)

, it is at odds with Grassmannian packings that are against sparsity and for diversity. Nonetheless, most models such as ResNet, RandWire, NASNet and DenseNet achieve SOTA results through SGD with a careful learning rate scheduler, which could further benefit from Grassmannian packings due to their demonstrated performance using SGD


5 Conclusion

In this work, we showed that the use of Grassmannian packings for first layer initialization improved performance in the contexts of shallow networks and deep networks, even with untrainable Grassmannian kernels, across multiple datasets. Grassmannian kernels are best used with SGD optimization and works with weight decay or otherwise. Our results suggest improvement of the model’s capacity to learn by simply initializing the first layer filters to be maximally distant and capture as many diverse features as possible. In the future, we hope to extend this analysis to explore the several other options for metric used in the Grassmannian packing, and search for ways to inform the choice in metric.


We would like to acknowledge Matthew McAteer for helpful discussions and contrbution of designs and plots for Figures 5 and 6.



Figure 3: ResNet with and without weight decay, with Grassmannian achieving better convergence and better final accuracy with everything else kept constant. Last image: zooming into last few epochs for ResNet with weight decay.
Figure 4: Illustration of how the first conv layer is initialized with Grassmannians, with visualizations of Grassmannian packings. Left: 5-packing of , which answers “how should 5 laser beams passing through a single point in be arranged so as to make the minimum principle angle between any two of the beams as large as possible". Right: 2-packing of . These concepts can be extended to higher dimensions.
(a) SGD
(b) Adadelta
(c) Adam
Figure 5: Distributions of 30 runs of first-epoch test accuracy at the first epoch with different optimizers on the same shallow CNN architecture, comparing initialization of first layer using standard Xavier initialization, frozen Grassmannian packings, and trainable Grassmannian packings.
(c) CIFAR10
(d) CIFAR100
Figure 6: Distributions of 30 runs of first-epoch test accuracy at the first epoch with SGD with different datasets, comparing initialization of first layer using standard Xavier initialization, frozen Grassmannian packings, and trainable Grassmannian packings.
(a) Grassmannian, no weight decay.
(b) Kaiming-scaled Grassmannian, no weight decay.
(c) Grassmannian, weight decay.
(d) Kaiming-scaled Grassmannian, weight decay.
Figure 7: Visualizations of Trainable Grassmannian Kernels of ResNet trained using SGD, with and without weight decay after training for 80 epochs on ImageNette.
Initialization Test Acc
Standard, Xavier 91.85
Grassmannian, Fixed 91.98
Grassmannian, Trainable 92.31
Table 3: Final Accuracy of ResNet-56 on CIFAR-10 with first layer Xavier, Grassmannian trainable, and Grassmannian Frozen.
Figure 8: Comparison of standard initialization and Grassmannian initialization of first layer as both trainable and untrainable parameters on ResNet56 trained on CIFAR-10. Grassmannian approaches have a faster convergence with marginally better test accuracy with Adam optimizer used in all 3 cases.