Towards Learning Convolutions from Scratch

07/27/2020 ∙ by Behnam Neyshabur, et al. ∙ Google 53

Convolution is one of the most essential components of architectures used in computer vision. As machine learning moves towards reducing the expert bias and learning it from data, a natural next step seems to be learning convolution-like structures from scratch. This, however, has proven elusive. For example, current state-of-the-art architecture search algorithms use convolution as one of the existing modules rather than learning it from data. In an attempt to understand the inductive bias that gives rise to convolutions, we investigate minimum description length as a guiding principle and show that in some settings, it can indeed be indicative of the performance of architectures. To find architectures with small description length, we propose β-LASSO, a simple variant of LASSO algorithm that, when applied on fully-connected networks for image classification tasks, learns architectures with local connections and achieves state-of-the-art accuracies for training fully-connected nets on CIFAR-10 (85.19 bridging the gap between fully-connected and convolutional nets.



There are no comments yet.


page 9

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since its inception, machine learning has moved from inserting expert knowledge as explicit inductive bias toward general-purpose methods that learn these biases from data, a trend significantly accelerated by the advent of deep learning 

(Krizhevsky et al., 2012). This trend is pronounced in areas such as computer vision (He et al., 2016b), speech recognition (Chan et al., 2016)

, natural language processing 

(Senior et al., 2020) and computational biology (Senior et al., 2020). In computer vision, for example, tools such as deformable part models (Felzenszwalb et al., 2009), SIFT features (Lowe, 1999) and Gabor filters (Mehrotra et al., 1992) have all been replaced with deep convolutional architectures. Crucially, in certain cases, these general-purpose models learn similar biases to the ones present in the traditional, expert-designed tools.

Gabor filters in convolutional nets are a prime example of this phenomenon: convolutional networks learn Gabor-like filters in their first layer (Zeiler and Fergus, 2014). However, simply replacing the first convolutional layer by a Gabor filter worsens performance, showing that the learned parameters differ from the Gabor filter in a helpful way. A natural analogy can then be drawn between Gabor filters and the use of convolution itself: convolution is a ‘hand-designed’ bias which expresses local connectivity and weight sharing. Is it possible to learn these convolutional biases from scratch, the same way convolutional networks learn to express Gabor filters?

Reducing inductive bias requires more data, more computation, and larger models—consider replacing a convolutional network with a fully-connected one with the same expressive capacity. Therefore it is important to reduce the bias in a way that does not damage the efficiency significantly, keeping only the core bias which enables high performance. Is there a core inductive bias that gives rise to local connectivity and weight sharing when applied to images, enabling the success of convolutions? The answer to this question would enable the design of algorithms which can learn, e.g., to apply convolutions to images, and apply more appropriate biases for other types of data.

Current research in architecture search is an example of efforts to reduce inductive bias but often without explicitly substituting the inductive bias with a simpler one (Zoph and Le, 2017; Real et al., 2019; Tan and Le, 2019). However, searching without a guidance makes the search computationally expensive. Consequently, the current techniques in architecture search are not able to find convolutional networks from scratch and only take convolution layer as a building block and focus on learning the interaction of the blocks.

Related Work

Previous work attempted to understand or improve the gap between convolutional and fully-connected networks. Perhaps the most related work is Urban et al. (2017), titled “Do deep convolutional nets really need to be deep and convolutional?” (the abstract begins “Yes, they do.”). Urban et al. (2017) demonstrate empirically that even when trained with distillation techniques, fully-connected networks achieve subpar performance on CIFAR10, with a best-reported accuracy of . Our work, however, suggests a different view, with results that significantly bridge this gap between fully-connected and convolutional networks. In another attempt, Mocanu et al. (2018) proposed sparse evolutionary training, achieving accuracy on CIFAR-10. Fernando et al. (2016) also proposes an evolution-based approach but they only evaluate their approach on MNIST dataset. To the best of our knowledge, the highest accuracy achieved by fully-connected networks on CIFAR-10 is (Lin et al., 2016), achieved through heavy data augmentation and pre-training with a zero-bias auto-encoder.

In order to understand the inductive bias of convolutional networks, d’Ascoli et al. (2019) embedded convolutional networks in fully-connected architectures, finding that combinations of parameters expressing convolutions comprise regions which are difficult to arrive at using SGD. Other works have studied simplified versions of convolutions from a theoretical prospective (Gunasekar et al., 2018; Cohen and Shashua, 2017; Novak et al., 2019). Relatedly, motivated by compression, many recent works (Frankle and Carbin, 2019; Lee et al., 2019; Evci et al., 2020; Dettmers and Zettlemoyer, 2019)

study sparse neural networks; studying their effectiveness on learning architectural bias from data would be an interesting direction.

Other recent interesting related work show variants of transformers are capable of succeeding in vision tasks and learning locality connected patterns (Cordonnier et al., 2020; Chen et al., 2020). In order to do so, one needs to provide the pixel location as input which enables the attention mechanism to learn locality. Furthermore, it is not clear that in order to learn convolutions such complex architecture is required. Finally, in a parallel work Zhou et al. (2020) proposed a method for meta-learning symmetries from data and showed that it possible to use such method to meta-learn convolutional architectures from synthetic data.


Our contributions in this paper are as follows:

  • We introduce shallow (s-conv) and deep (d-conv) all-convolutional networks (Springenberg et al., 2015) with desirable properties for studying convolutions. Through systematic experiments on s-conv and d-conv and their locally connected and fully-connected counterparts, we make several observations about the role of depth, local connectivity and weight sharing (Section itemize

  • Local connectivity appears to have the greatest influence on performance.

  • The main benefit of depth appears to be efficiency in terms of memory and computation. Consequently, training shallow architectures with many more parameters for a long time would compensate most of the lost performance due to lack of depth.

  • The benefit of depth diminishes even further without weight-sharing.

  • We look at Minimum Description Length (MDL) as a guiding principle to what architectures generalize better (Section itemize

  • Showing that MDL can be bounded by number of parameters, we argue and demonstrate empirically that architecture families that need fewer parameters to fit a training set a certain degree tend to generalize better in the over-parameterized regime.

  • We prove an MDL-based generalization bound for architectures search which suggests that the sparsity of the found architecture has great affect on generalization. However, weight sharing is only effective if it has a simple structure.

  • Inspired by MDL, we propose a training algorithm -lasso , a variant of lasso with a more aggressive soft-thresholding to find architectures with few parameters and hence, a small description length. We present the following empirical findings for -lasso (Section itemize

  • -lasso achieves state-of-the-art results on training fully connected networks on CIFAR10, CIFAR-100 and SVHN tasks. The results are on par with the reported performance of multi-layer convolutional networks around year 2013 (Hinton et al., 2012; Zeiler and Fergus, 2013). Moreover, unlike convolutional networks, these results are invariant to permuting pixels.

  • We show that the learned networks have fewer parameters than their locally connected counterparts. By visualizing the filters, we observe that -lasso has indeed learned local connectivity but it has also learned to sample more sparsely in a local neighborhood to increase the receptive field while keeping the number of parameters low.

  • Somewhat related to the main goal of the paper, we trained ResNet18 with different kernel sizes using -lasso and we observed that for all kernel sizes, -lasso improves over SGD on CIFAR10, CIFAR-100 and SVHN datasets.

    Figure 4: d-conv and s-conv architectures and their scaling

    : Left panel shows the architectures. Each convolution or fully-connected layer (except the last one) has batch-normalization followed by ReLU activation. The right panel indicates how the number of parameters in each architecture and their corresponding locally and fully-connected scales with respect to the number of base channels (shown by

    in the left panel) and image dimension. The dotted black line is the maximum model size for training using 16-bits on a V100 GPU.
  • 2 Disentangling Depth, Weight sharing and Local connectivity

    To study the inductive bias of convolutions, we take a similar approach to d’Ascoli et al. (2019), examining two large hypothesis classes that encompass convolutional networks: locally-connected and fully-connected networks. Given a convolutional network, its corresponding locally-connected version features the same connectivity, but eliminates weight sharing. The corresponding fully-connected network then adds connections between all nodes in adjacent layers. Functions expressible by the fully-connected networks encompass those of locally-connected networks, which encompass convolutional networks.

    One challenge in studying the inductive bias of convolutions is that the existence of other components such as pooling and residual connections makes it difficult to isolate the effect of convolutions in modern architectures. One solution is to simply remove those from the current architectures. However, that would result is a considerable performance loss since the other design choices of the architecture family were optimized with those components. Alternatively, one can construct an all-convolutional network with a desirable performance.

    Springenberg et al. (2015) have proposed a few all-convolutional architectures with favorable performance. Unfortunately, these models cannot be used for studying the convolution since fully-connected networks that can represent such architectures are too large. Another way to resolve this is by scaling down the number of parameters in the conventional architecture which unfortunately degrades the performance significantly. To this end, below we propose two all-convolutional networks to overcome the discussed issues.

    2.1 Introducing d-conv and s-conv for Studying Convolutions

    In this work, we propose d-conv and s-conv, two all-convolutional networks that perform relatively well on image classification tasks while also enjoying desirable scaling with respect to the number of channels in the corresponding convolutional network and input image size. As shown in the left panel of Figure has 8 convolutional layers followed by two fully-connected layers. However, s-conv has only one convolutional layer which is followed by two fully-connected layers. The right panel in Figure ows how the number of parameters of these models and their corresponding locally-connected (d-local, s-local) and fully-connected (d-fc, s-fc) networks scale with respect to the number of base channels (channels in the first convolutional layer) and the input image size. d-fc has the highest number of parameters given the same number of base channels, which means the largest d-conv that can be studied will not have many based channels. On the other hand s-fc has a better scaling which allows us to have more base channels in the corresponding s-conv. The other interesting observation is that the number of parameters in fully-connected networks depends on the fourth power of the input image dimension (e.g. 32 for CIFAR-10). However, the local and convolutional networks have a quadratic dependency on the image dimension. Note that the quadratic dependency of d-conv and s-conv to the image size is due to of lack of global pooling.

    Figure 8: Performance scaling of different architectures. The left and middle panels show the test accuracy of four architectures trained on CIFAR-10 and CIFAR-100 datasets. See the Appendix for SVHN plot and experiment details. The right panel shows the maximum test accuracy of architectures in the over-parameterized regime against the minimum number of parameters they need in order to achieve a certain training accuracy that is fixed for each dataset.

    Figure mpares the performance of d-conv and s-conv against ResNet18 (He et al., 2016a), and a 3-layer (2 hidden layers) fully-connected network with equal number of hidden units in each hidden layers denoted by 3-fc. It is clear that on all tasks, the performance of d-conv is comparable but slightly worse than ResNet18 which is expected. Moreover, s-conv has a performance which is significantly better than fully-connected networks but worse than d-conv .

    Models #param (M) CIFAR10 CIFAR100 SVHN
    orig. fc-emb. 400 ep. 4000 ep. 400 ep. 4000 ep. 400 ep. 4000 ep.
    d-conv 1.45 256 88.84 89.79 63.73 62.26 95.65 95.86
    d-local 3.42 256 86.13 86.07 58.58 55.71 95.71 95.85
    d-fc 256 256 64.78 63.62 36.61 35.51 92.52 90.97
    s-conv 138 256 84.14 87.05 59.48 62.51 92.34 93.38
    s-local 147 256 81.52 85.86 56.64 62.03 92.51 93.98
    s-fc 256 256 72.77 78.63 47.72 51.43 88.64 91.80
    3-fc 256 256 69.19 75.12 44.95 50.75 85.98 86.02
    Table 1: Performance of d-conv, s-conv and their locally and fully-connected counterparts. The results indicate that in this regime, the performance of s-local is comparable to that of d-local and s-conv . Please refer to Appendix for details of the training procedure.

    2.2 Empirical Investigation

    In this section, we disentangle the effect of depth, weight sharing and local connectivity. Table ports the test accuracy of d-conv, s-conv, their counterparts and 3-fc on three datasets. For each architecture, the base channels is chosen such that the corresponding fully-connected network has roughly 256 million parameters. First note that with that constraint, deep convolutional and locally connected architectures have much smaller number of parameters compare to others. Moreover, for both deep and shallow architectures, there is not a considerable difference in the number of parameters of the convolutional and locally connected networks. Please refer to Figure r

    the scaling of each architecture. For each experiment, the results for training 400 and 4000 epochs are reported. Several observations can be made from Table 


    Importance of locality: For both deep and shallow architectures and across all three datasets, the gap between a locally connected network and its corresponding fully-connected version is much higher than that between convolutional and locally-connected networks. That suggests that the main benefit of convolutions comes from local connectivity.

    Shallow architectures eventually catch up to deep ones (mostly): While training longer for deep architectures does not seem to improve the performance, it significantly does so in the case of shallow architectures across all datasets. As a result, the gap between deep and shallow architectures shrinks significantly after training for 4000 epochs.

    Without weight sharing, the benefit of depth disappears: s-fc outperforms d-fc in all experiments. Moreover, when training for 4000 epochs, none of d-local and s-local have clear advantage over each other.

    Structure of fully-connected network matters: s-fc outperforms 3-fc and d-fc in all experiments by a considerable margin. Even more interesting is that s-fc and 3-fc have the same number of parameters and depth but s-fc has much more hidden units in the first layer compare to 3-fc. Please see the appendix for the exact details.

    The results in Table ggests that s-conv performs comparably to d-conv and the gap between s-conv and s-local is negligible compare to the gap between s-local and s-fc . Therefore, in the rest of the paper, we try to bridge the gap between the performance of s-fc and s-local .

    3 Minimum Description Length as a Guiding Principle

    In this section, we take a short break from experiments and look at the minimum description length as a way to explain the differences in the performance architectures and a guiding principle for finding models that generalize well. Consider the supervised learning setup where given an input data

    , we want to predict a label under the common assumption that the pair are generated from a distribution . Let be a training set sampled i.i.d from and

    be a loss function. The task is then to learn a function/hypothesis

    with low population loss also known as generalization error. In order to find the hypothesis , we start by picking a hypothesis class

    (eg. linear classifiers, neural networks of a certain family, etc.) and a learning algorithm

    . Finally, returns a hypothesis which will be used for prediction. The learning algorithm usually finds a hypothesis among the ones that have low sample loss .

    A fundamental question in learning concerns the generalization gap between the training error and the generalization error. One way to think about generalization is to think of a hypothesis as an explanation for association of label to input data (eg. given a picture and the label “dog”, the hypothesis is an explanation of what makes this picture, a picture of a dog). The Occam’s razor principle provides an intuitive way of thinking about generalization of a hypothesis (Shalev-Shwartz and Ben-David, 2014):

    A short explanation tends to be more valid than a long explanation.

    The above philosophical message can indeed be formalized as follows:

    Theorem 1.

    (Shalev-Shwartz and Ben-David, 2014, Theorem 7.7) Let be a hypothesis class and be a prefix-free description language for . Then for any distribution , sample size

    , and probability

    , with probability over the choice of we have that for any


    where is the length of .

    The above theorem connects the generalization gap of a hypothesis to its description length using a prefix-free description language, i.e. for any two distinct and , is not a prefix of . Importantly, the description language should be chosen before observing the training set. The simplest form of a prefix free description language is the bit representation of the parameters. If a model has parameters each of which stored in bits, then the total number of bits to describe a model is . Furthermore, note that this is prefix-free because all hypotheses have the same description length. According to this language, the generalization gap is simply controlled by the number of parameters.

    Empirical investigations on generalization of over-parameterized models suggest that the number of parameters is a very loose upper bound on the capacity (Lawrence et al., 1998; Neyshabur et al., 2015; Zhang et al., 2017; Dziugaite and Roy, 2017; Neyshabur et al., 2017) (or in MDL language, it is not a compact encoding). However, it is possible that the number of parameters is a compact encoding in the under-parameterized regime but some other encoding becomes optimal in the over-parameterized regime. One empirical observation about many neural network architectures is that with a well-chosen approach to scale the number of parameters, one can expect that over-parametrization does not make a superior architecture inferior. While this is not always the case, it might be enough to distinguish architectures that have very different inductive biases. For example, the left two panels of Figure ow the performance plots of different architectures do not cross each other across the scale. This was also the central observation used in architecture search by Tan and Le (2019).

    When the ordering of architectures with the same number of parameters based on their test performance is the same for any number of parameters (similar to the left two panels of Figure the performance in under-parameterized regime can be indicative of the performance in over-parameterized regime and hence if an architecture can fit the training data with fewer number of parameters, that most likely translate to the superiority of the architecture in the over-parameterized regime. In the right panel of Figure e can see that it is indeed the case for all 4 architectures and 3 datasets that we study in this work. ResNet18 requires least number of parameters to fit any dataset while the next ones are d-conv, s-conv and 3-fc respectively and this is the exact order in terms of generalization performance in over-parameterized regime where models have around 1 billion parameters.

    MDL-based generalization bound for architecture search

    Theorem ves a bound on the description length of a hypothesis and we discussed how it can be bounded by number of parameters. What if the learning algorithm searches over many architectures and finds one with small number of parameters? In this case, the active parameters can be denoted as the parameters with non-zero values and the parameter sharing can be modeled as parameters with the same value. Does the performance then only depend on the number of parameters of the final architecture? How does the search space come to the picture and how does weight sharing affect the performance? Let be the number of bits used to present each parameter (usually 16 or 32) and be the maximum number of allowed parameters in the architectures found by the architecture algorithm. Also, let be the number of parameters in the architecture found by the architecture search algorithm. The next theorem shows how the generalization gap can be bounded in this case.

    Theorem 2.

    Let be a -bit presentation of real numbers (), be all possible mappings from numbers to for any and a be prefix-free description language for . For any distribution , sample size , and , with probability over the choice of , for any parameterized function where :


    where is the length of and for some . Moreover, there exists a prefix free language such that for any , .

    The proof is given in the appendix. The above theorem shows that the capacity of the learned model is bounded by two terms: number bits to present the parameters of the learned architecture () and the description length of the weight sharing (). This is rewarding for finding architectures with small number of non-zero weights because in that case the generalization gap would mostly depend on the number of non-zero parameters (using ). However, in the case of weight sharing, the picture is very different. The bound suggests that the generalization gap depends on the description length of the weight sharing and that could in worst case depend on the number of non-zero parameters if there is no structure. Otherwise, when weight-sharing is structured, it can be encoded more efficiently and the generalization could potentially only depend on the number of parameters of the final model found by the architecture search. This suggests that simply encouraging parameter values to be close to each other does not improve generalization unless there is a structure in place. Therefore, we leave this for future work and focus on learning networks with small number of non-zero weights in the rest of the paper which seemed to be the most important element based on our empirical investigation in Section algorithm -lasso

    1:$f(θ)$: stochastic objective function with parameters $θ$, $θ

    _0$: the initial parameter vector, $

    λ$: coefficient of $ℓ_1$ regularizer, $β$: threshold coefficient, $η$: learning rate, $τ$: number of updates
    2:for $t=1$ to $τ$ do
    3:     $θ_t θ_t-1 - η(_θf(θ_t-1) + λsign(θ_t-1))$
    4:     $θ_t θ_t (|θ_t| βλ)_+$
    5:Return $θ_t$

    4 -lasso : Learning Local Connectivity from Scratch

    In the previous section, we discussed that if a learning algorithm can find an architecture with a small number of non-zero weights, then the generalization gap of that architecture would mostly depend on the number of non-zero weights and the dependence on the number of original parameters would be only logarithmic. This is very encouraging. On the other hand, we know from our empirical investigation in Section at locally connected networks perform considerably better than fully-connected ones. Is it possible to find locally connected networks by simply encouraging sparse connections when training on images? Could such networks bridge the gap between performance of fully-connected and convolutional networks?

    We are interested in architectures with small number of parameters. In order to achieve this, we try the simplest form of encouraging the sparsity by adding an regularizer. In particular, we propose -lasso , a simple algorithm that is very similar to lasso (Tibshirani, 1996) except it has an extra parameter that allows for more aggressive soft thresholding. The algorithm is shown in Algorithm Training fully-connected Networks Table mpares the performance of s-fc trained with -lasso to state-of-the-art methods in training fully-connected networks. The results show a significant improvement over previous work even considering complex methods such distillation or pretraining. Moreover, there is only very small gap between performance of s-fc trained with -lasso and its locally connected and convolutional counterparts. However, note that unlike s-conv and s-local, the performance of s-fc is invariant to permuting the pixels which is a considerable advantage when little is known about the structure of data. To put these results to perspective, these accuracies are on par with best results for convolutional networks in 2013 (Hinton et al., 2012; Zeiler and Fergus, 2013). We fix in the experiments but tune the regularization parameter for each datasets. We also observed that having higher for the layer that corresponds to the convolution, improves the performance which is aligned with our understanding that such a layer could benefit the most from sparsity.

    Model Training Method CIFAR-10 CIFAR-100 SVHN
    s-conv SGD 87.05 62.51 93.38
    s-local SGD 85.86 62.03 93.98
    mlp (Neyshabur et al., 2019) SGD (no Augmentation) 58.1 - 84.3
    mlp (Mukkamala and Hein, 2017)


    72.2 39.3 -
    mlp (Mocanu et al., 2018) SET(Sparse Evolutionary Training) 74.84 - -
    mlp (Urban et al., 2017) deep convolutional teacher 74.3 - -
    mlp (Lin et al., 2016) unsupervised pretraining with ZAE 78.62 - -
    mlp (3-fc) SGD 75.12 50.75 86.02
    mlp (s-fc) SGD 78.63 51.43 91.80
    mlp (s-fc) -lasso () 82.45 55.58 93.80
    mlp (s-fc) -lasso () 82.52 55.96 93.66
    mlp (s-fc) -lasso () 85.19 59.56 94.07
    Table 2: Comparing the performance s-fc trained with -lasso to other methods for training fully-connected networks. Please see the appendix for details of the training procedure.
    Figure 12: Number of non-zero parameters in different layers of s-fc trained with -lasso . The left panel indicates that even though the first layers of s-conv and s-local have much fewer number of parameters (red dashed lines) compare to s-fc , the number of non-zero parameters in trained s-fc is between that of s-local and s-conv . The middle and right panels show the same quantity for the second and third layers of s-fc which have the same parametrization as s-conv and s-local . The plots suggest that -lasso encourages the weights of the second layer to be much sparser but the last layer remains dense.

    Furthermore, to see if -lasso succeeds in learning architectures that are as sparse as s-local, we measure the number of non-zero weights in each layer separately. The results are shown in Figure he left panel corresponds to the first layer which is convolutional in s-conv and locally connected in s-local but fully-connected in s-fc . As you can see, the solution found by -lasso has less nonzero parameters than the number of parameters in s-local and has only slightly more parameters than s-conv even though s-conv is benefiting from weight-sharing. The middle and left plots show that in other layers, the solution found by -lasso is still sparse but less so in the final layer.

    Our further investigation into the learned filters resulted in surprising results that is shown in Figure trained with SGD learned filters that are very dense but with locally correlated values. However, the same network trained with -lasso learns a completely different set of filters. The filters learned by -lasso are sparse and locally connected in a similar fashion to s-local filters. Moreover, it appears that the networks trained with -lasso have learned that immediate pixels have little new information. In order to benefit from larger receptive fields while remaining local and sparse, these networks have learned filters that seem like a sparse sampling of a local neighborhood in the image. This encouraging finding validates that by using a learning algorithm with better inductive bias, one can start from an architecture without much bias and learn it from data.

    Figure 13: Comparing first layer filters of s-fc trained with -lasso to that of s-fc and s-local trained with SGD. The filters learned by SGD training of s-fc are dense but locally correlated. However, the filters learned when trained with -lasso are locally connected in a similar way to s-local . Furthermore, it seems that the network has learned that nearby pixels have similar information and therefore to get more information while remaining local, it learns to look at a sparse sampling of a local neighborhood.

    4.2 Training Convolutional Networks with Larger Kernel Size

    Slightly deviating from our main goal, we tried training ResNet18 with different kernel sizes using -lasso and compare it with SGD. Figure ows that -lasso improves over SGD for all kernel sizes across all datasets. The improvement is predictably more so when kernel size is large since -lasso would learn to adjust it automatically. These results again confirm that -lasso can be used in many different settings to adoptively learn the structure.

    Figure 17: Performance of ResNet18 trained with different kernel sizes.

    5 Discussion and Future Work

    In this work, we studied the inductive bias of convolutional networks through empirical investigations and MDL theory. We proposed a simple algorithm called -lasso that significantly improves training of fully-connected networks. -lasso does not have any specific inductive bias related to images. For example, permuting all pixels in our experiments does not change the performance of -lasso. It is therefore interesting to see if -lasso can be used in other contexts such as natural language processing or in domains such as computational biology where the data is structured but our knowledge of the structure of the data is much more limited compare to computer vision. Another promising direction is to improve efficiency of -lasso to benefit from sparsity for faster computation and less memory usage. The current implementation does not benefit from sparsity in terms of memory or computation. In order to scale up -lasso to be able to handle larger networks and input data dimensions, these barriers should be removed. Finally, we want to emphasize general purpose algorithms such as -lasso that are able to learn the structure become even more promising as training larger models become more accessible.


    We thank Sanjeev Arora, Ethan Dyer, Guy Gur-Ari, Aitor Lewkowycz and Yuhuai Wu for many fruitful discussions and Vaishnavh Nagarajan and Vinay Ramasesh for their useful comments on this draft.


    • Chan et al. (2016) Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960–4964. IEEE.
    • Chen et al. (2020) Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Dhariwal, P., Luan, D., and Sutskever, I. (2020). Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning.
    • Cohen and Shashua (2017) Cohen, N. and Shashua, A. (2017). Inductive bias of deep convolutional networks through pooling geometry. In Proceeding of the International Conference on Learning Representations.
    • Cordonnier et al. (2020) Cordonnier, J.-B., Loukas, A., and Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. In International Conference on Learning Representations.
    • d’Ascoli et al. (2019) d’Ascoli, S., Sagun, L., Biroli, G., and Bruna, J. (2019). Finding the needle in the haystack with convolutions: on the benefits of architectural bias. In Advances in Neural Information Processing Systems, pages 9330–9340.
    • Dettmers and Zettlemoyer (2019) Dettmers, T. and Zettlemoyer, L. (2019). Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840.
    • Dziugaite and Roy (2017) Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In

      Conference on Uncertainty in Artificial Intelligence

    • Evci et al. (2020) Evci, U., Gale, T., Menick, J., Castro, P. S., and Elsen, E. (2020). Rigging the lottery: Making all tickets winners. In Proceedings of the 37th International Conference on Machine Learning.
    • Felzenszwalb et al. (2009) Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and Ramanan, D. (2009). Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645.
    • Fernando et al. (2016) Fernando, C., Banarse, D., Reynolds, M., Besse, F., Pfau, D., Jaderberg, M., Lanctot, M., and Wierstra, D. (2016). Convolution by evolution: Differentiable pattern producing networks. In

      Proceedings of the Genetic and Evolutionary Computation Conference 2016

      , pages 109–116.
    • Frankle and Carbin (2019) Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proceeding of the International Conference on Learning Representations.
    • Gunasekar et al. (2018) Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. (2018). Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471.
    • He et al. (2016a) He, K., Zhang, X., Ren, S., and Sun, J. (2016a). Deep residual learning for image recognition. In

      Proceedings of the IEEE conference on computer vision and pattern recognition

      , pages 770–778.
    • He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. (2016b). Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer.
    • Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
    • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
    • Lawrence et al. (1998) Lawrence, S., Giles, C. L., and Tsoi, A. C. (1998).

      What size neural network gives optimal generalization? convergence properties of backpropagation.

      Technical report.
    • Lee et al. (2019) Lee, N., Ajanthan, T., and Torr, P. H. (2019). Snip: Single-shot network pruning based on connection sensitivity. In Proceeding of the International Conference on Learning Representations.
    • Lim et al. (2019) Lim, S., Kim, I., Kim, T., Kim, C., and Kim, S. (2019). Fast autoaugment. In Advances in Neural Information Processing Systems, pages 6665–6675.
    • Lin et al. (2016) Lin, Z., Memisevic, R., and Konda, K. (2016). How far can we go without convolution: Improving fully-connected networks. Proceeding of the International Conference on Learning Representations workshop track.
    • Lowe (1999) Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee.
    • Mehrotra et al. (1992) Mehrotra, R., Namuduri, K. R., and Ranganathan, N. (1992). Gabor filter-based edge detection. Pattern recognition, 25(12):1479–1494.
    • Mocanu et al. (2018) Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. (2018). Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1–12.
    • Mukkamala and Hein (2017) Mukkamala, M. C. and Hein, M. (2017). Variants of rmsprop and adagrad with logarithmic regret bounds. In Proceedings of the 34th International Conference on Machine Learning, pages 2545–2553. JMLR. org.
    • Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956.
    • Neyshabur et al. (2019) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. (2019). Towards understanding the role of over-parametrization in generalization of neural networks. In Proceeding of the International Conference on Learning Representations.
    • Neyshabur et al. (2015) Neyshabur, B., Tomioka, R., and Srebro, N. (2015). In search of the real inductive bias: On the role of implicit regularization in deep learning. International Conference on Learning Representations workshop track.
    • Novak et al. (2019) Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D. A., Pennington, J., and Sohl-Dickstein, J. (2019). Bayesian deep convolutional networks with many channels are gaussian processes. Proceeding of the International Conference on Learning Representations.
    • Real et al. (2019) Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. (2019). Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 4780–4789.
    • Ritchie et al. (2020) Ritchie, S., Slone, A., and Ramasesh, V. (2020). Caliban: Docker-based job manager for reproducible workflows.
    • Senior et al. (2020) Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A. W., Bridgland, A., et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature, pages 1–5.
    • Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
    • Springenberg et al. (2015) Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2015). Striving for simplicity: The all convolutional net. In International Conference on Learning Representations.
    • Tan and Le (2019) Tan, M. and Le, Q. V. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 37th International Conference on Machine Learning.
    • Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
    • Urban et al. (2017) Urban, G., Geras, K. J., Kahou, S. E., Aslan, O., Wang, S., Caruana, R., Mohamed, A., Philipose, M., and Richardson, M. (2017). Do deep convolutional nets really need to be deep and convolutional? In International Conference on Learning Representations.
    • Zeiler and Fergus (2013) Zeiler, M. D. and Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. In International Conference on Learning Representations.
    • Zeiler and Fergus (2014) Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer.
    • Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
    • Zhou et al. (2020) Zhou, A., Knowles, T., and Finn, C. (2020). Meta-learning symmetries by reparameterization. arXiv preprint arXiv:2007.02933.
    • Zoph and Le (2017) Zoph, B. and Le, Q. V. (2017).

      Neural architecture search with reinforcement learning.

      International Conference on Learning Representations.

    Appendix A Experimental Setup

    We used Caliban (Ritchie et al., 2020) to manage all experiments in a reproducible environment in Google Cloud’s AI Platform. In all experiments, we used 16-bit operations on V100 GPUs except Batch Normalization which was kept at 32-bit.

    a.1 Architectures

    The ResNet18 architecture used in our experiments is the same its original version introduced in He et al. (2016a)

    for training ImageNet dataset except that the first convolution layer has kernel size 3 by 3 and it is not followed by a max-pooling layer. This changes are made to adjust the architecture for smaller image sizes used in our experiments. 3-

    fc architecture is a fully connected network with two hidden layers having the same number of hidden units with ReLU activation and Batch Normalization. Finally, please see Tables d r details of d-conv, s-conv and their locally and fully connected counterparts.

    Layer Description #param
    d-conv d-local d-fc
    conv1 conv3x3
    conv2 conv3x3
    conv3 conv3x3
    conv4 conv3x3
    conv5 conv3x3
    conv6 conv3x3
    conv7 conv3x3
    conv8 conv3x3
    fc1 fc
    fc2 fc(c)
    Total d-conv()
    Table 3: d-conv and its counterparts. denotes the number of base channels which determines the total number of parameters of the architecture. The convolutional modules in d-conv turn into locally connected and fully connected modules for d-local and d-fc respectively.

    a.2 Hyperparameters

    All architectures in this paper are trained with Cosine Annealing learning rate schedule with initial learning rate , batch size and data augmentation proposed in Lim et al. (2019). We looked at learning rates 0.01 and 1 as well but observed that 0.1 works well across experiments. For models trained with SGD, we tried values 0 and 0.9 for the momentum, and 0, , , , for the weight decay. As for -lasso, we did not use momentum, set and tried , , , and for the choice of regularization for layers that correspond to convolutional and fully connected layers separately. Moreover, for all models, we tried both adding dropout to the last two fully connected layers and not adding dropout to any layers. Models in Table e trained for 4000 epochs. All other models in this paper are trained for 400 epochs unless mentioned otherwise. For each experiment, picked models with highest validation accuracy among these choices of hyper-parameters.

    Layer Description #param
    d-conv d-local d-fc
    conv1 conv9x9
    fc1 fc
    fc2 fc(c)
    Total s-conv()
    Table 4: s-conv and its counterparts. denotes the number of base channels which determines the total number of parameters of the architecture. The convolutional modules in s-conv turn into locally connected and fully connected modules for s-local and s-fc respectively.

    Appendix B Proof of Theorem 2

    In this section, we restate and prove Theorem *


    We build our proof based on theorem d construct a prefix-free description language for . We use the first bits to encode which is the dimension of parameters. The next bits then would be used to store the values of parameters. Note that so far, the encoding has varied length based on but it is prefix free because for any two dimensions , the first bits are different. Finally, we add the encoding for which takes bit based on theorem statement. After adding the encoding remains prefix-free. The reason is that for any two , if they have different number of parameters, the first bits will be different. Otherwise, if they have the exact same number of parameters and the same parameter values, the length of the encoding before will be exactly and since is prefix free, the whole encoding remains prefix-free.

    Next we construct a prefix-free description such that for any , . Instead of assigning all weights to parameters, it is enough to specify the weights that are assigned to non-zero parameters and the rest of the weights will then be assigned to the parameter with value zero. Similar to before, we use the first bits to specify the number of non-zero weights, i.e. and the next bits to specify the number of parameters . This again helps to keep the encoding prefix-free. Finally for each non-zero weight, we use bits to identify the index and to specify the index of its corresponding parameter. Therefore, the total length will be and this completes the proof. ∎

    Appendix C Supplementary Figures

    Figure 24: Performance scaling of different architectures. The top row shows the training accuracy and the bottom row shows test accuracy
    (a) CIFAR-10
    (b) CIFAR-100
    (c) SVHN
    Figure 28: First layer filters of s-fc trained with SGD. This filters are chosen randomly among all filters with at least 20 non-zero values.
    (a) CIFAR-10
    (b) CIFAR-100
    (c) SVHN
    Figure 32: First layer filters of s-fc trained with -lasso. This filters are chosen randomly among all filters with at least 20 non-zero values.
    (a) CIFAR-10
    (b) CIFAR-100
    (c) SVHN
    Figure 36: First layer filters of s-local trained with SGD. This filters are chosen randomly among all filters with at least 20 non-zero values.