1 Introduction
In the classic dichotomy between modelbased and databased approaches to solving complex tasks, Convolutional Neural Networks correspond to a particularly efficient tradeoff. CNNs capture key geometric prior information for spatial/temporal tasks through the notion of local translation invariance. Yet, they combine this prior with high flexibility, that allows them to be scaled to millions of parameters and leverage large datasets with gradientdescent learning strategies, typically operating in the ‘interpolating’ regime, i.e. where the training data is fit perfectly.
Such regime challenges the classic notion of model selection in statistics, whereby increasing the number of parameters trades off bias by variance
Zhang et al. (2016). On the one hand, several recent works studying the role of optimization in this tradeoff argue that model size is not always a good predictor for overfitting (Neyshabur et al., 2018; Zhang et al., 2016; Neal et al., 2018; Geiger et al., 2019; Belkin et al., 2018), and consider instead other complexity measures of the function class, which favor CNNs due to their smaller complexity Du et al. (2018). On the other hand, authors have also considered geometric aspects of the energy landscape, such as width of basins Keskar et al. (2016), as a proxy for generalisation. However, these properties of the landscape do not appear to account for the benefits associated with specific architectures. Additionally, considering the implicit bias due to the optimization scheme (Soudry et al., 2018; Gunasekar et al., 2018) is not enough to justify the performance gains without considering the architectural bias. Despite the important insights on the role of overparametrization in optimization Du et al. (2017); Arora et al. (2018); Venturi et al. (2018), the architectural bias prevails as a major factor to explain good generalization in visual classification tasks – overparametrized CNN models generalize well, but large neural networks without any convolutional constraints do not.In this work, we attempt to further disentangle the bias stemming from the architecture and the optimization scheme by hypothesizing that the CNN prior plays a favorable role mostly at the beginning of optimization. Geometrically, the CNN prior defines a lowdimensional subspace within the space of parameters of generic FullyConnected Networks (FCN) (this subspace is linear since the CNN constraints of weight sharing and locality are linear, see Figure 1
for a sketch of the core idea). Even though the optimization scheme is able to minimize the training loss with or without the constraints (for sufficiently overparametrized models
Geiger et al. (2018); Zhang et al. (2016)), the CNN subspace provides a “better route” that navigates the optimization landscape to solutions with better generalization performance.Yet, surprisingly, we observe that leaving this subspace at an appropriate time results in a nonCNN model with an equivalent or even better generalization. Our numerical experiments suggest that the CNN subspace as well as its vicinity are good candidates for highperformance solutions. Furthermore, we observe a threshold distance to the CNN space beyond which the performance degrades quickly close to the regular FCN accuracy level. Our results offer a new perspective on the success of the convolutional architecture: within FCN loss landscapes there exist rare basins associated to very good generalization, characterised not only by their width but rather by their distance to the CNN subspace. These can be accessed thanks to the CNN prior, and are otherwise missed in the usual training of FCNs.
The rest of the paper is structured as follows. Section 2 discusses prior work in relating architecture and optimization biases. Section 3 presents our CNN to FCN embedding algorithm and training procedure, and Section 4 describes and analyses the experiments performed on the CIFAR10 dataset (Krizhevsky and Hinton, 2009). We conclude in Section 5 by describing theoretical setups compatible with our observations and consequences for practical applications.
2 Related Work
The relationship between CNNs and FCNs is an instance of tradingoff prior information with expressivity within Neural Networks. There is abundant literature that explored the relationship between different neural architectures, for different purposes. One can roughly classify these works on whether they attempt to map a large model into a smaller one, or viceversa.
In the first category, one of the earliest efforts to introduce structure within FCNs with the goal of improving generalization was Nowlan and Hinton’s soft weight sharing networks Nowlan and Hinton (1992), in which the weights are regularized via a Mixture of Gaussians. Another highly popular line of work attempts to distill the “knowledge” of a large model (or an ensemble of models) into a smaller one Buciluǎ et al. (2006); Hinton et al. (2015), with the goal of improving both computational efficiency and generalization performance. Network pruning Han et al. (2015) and the recent “Lottery Ticket Hypothesis” Frankle and Carbin (2018) are other remarkable instances of the benefits of model reduction.
In the second category, which is more directly related to our work, authors have attempted to build larger models by embedding small architectures into larger ones, such as the Net2Net model Chen et al. (2015) or more evolved followups Saxena and Verbeek (2016). In these works, however, the motivation is to accelerate learning by some form of knowledge transfer between the small model and the large one, whereas our motivation is to understand the specific role of architectural bias in generalization.
The links between generalization error and the geometry and topology of the optimization landscape have been also extensively studied in recent times. Du et al. (2018) compare generalisation bounds between CNNs and FCNs, establishing a sample complexity advantage in the case of linear activations. Long and Sedghi (2019); Lee and Raginsky (2018) obtain specific generalisation bounds for CNN architectures. Chaudhari et al. (2016) proposed a different optimization objective, whereby a bilateral filtering of the landscape favors dynamics into wider valleys. Keskar et al. (2016) explored the link between sharpness of local minima and generalization through Hessian analysis Sagun et al. (2017), and Wu et al. (2017) argued in terms of the volume of basins of attraction. The characterization of the loss landscape along paths connecting different models have been studied recently, e.g. in Freeman and Bruna (2016), Garipov et al. (2018), and Draxler et al. (2018). The existence of rare basins leading to better generalization was found and highlighted in simple models in Baldassi et al. (2016, 2019). The role of the CNN prior within the ambient FCNs loss landscape and its implication for generalization properties were not considered in any of these works. In the following we address this point by building on these previous investigations of the landscape properties.
3 CNN to FCN Embedding
In both FCNs and CNNs, each feature of a layer is calculated by applying a nonlinearity to a weighted sum over the features of the previous layer (or over all the pixels of the image, for the first hidden layer). CNNs are a particular type of FCNs, which make use of two key ingredients to reduce their number of redundant parameters: (1) locality and (2) weight sharing.
Locality: In FCNs, the sum is taken over all the features of the previous layer. In locally connected networks (LCNs), the locality is imposed by restricting the sum to a small receptive field (a box of adjacent features of the previous layer). The set of weights of this restricted sum is called a filter. For a given receptive field, one may create multiple features (or channels) by using several different filters. This procedure makes use of the spatial structure of the data and reduces the number of fitting parameters.
Weight sharing:
CNNs are a particular type of LCNs where all the filters of a given channel use the same set of weights. This procedure makes use of the somewhat universal properties of feature extracting filters such as edge detectors and reduces even more drastically the number of fitting parameters.
When mapping a CNN to its equivalent FCN (eFCN), we obtain very sparse (due to locality) and redundant (due to weight sharing) weight matrices. This typically results in a large memory overhead as the eFCN of a simple CNN can take several orders of magnitude larger space in the memory. Therefore, we present the core ideas on a simple 3layer CNN on CIFAR10 (Krizhevsky et al., 2012), and show similar results for AlexNet on CIFAR100 in Appendix B.
In the mapping,^{1}^{1}1The source code may be found at: https://github.com/sdascoli/anarchitecturalsearch.
, all layers apart form the convolutional layers (ReLU, Dropout, MaxPool and fullyconnected) are left unchanged except for proper reshaping. Each convolutional layer is mapped to a fullyconnected layer. For a convolutional layer with
input features, input channels, output features, output channels, the corresponding fullyconnected layer is of sizeAs a result, for a given CNN, we obtain its eFCN counterpart with an endtoend fullyconnected architecture which is functionally identical to the original CNN.
4 Experiments
Given inputlabel pairs for a supervised classification task, , let and is the index of the correct class for a given image (typically,
is represented in a vector of dimension equal to the number of classes to separate, however, we shall not worry about this for the purposes of the present text). The network, parametrized by
, outputs . To distinguish between different architectures we denote the CNN weights by and the eFCNs weights by . Let’s denote the embedding function described in Sec. 3 by where and with a slight abuse of notation use for both CNN and eFCN. Dropping the explicit input dependency for simplicity we have:For the experiments, we prepare the CIFAR10 dataset for training without data augmentation. The optimizer is set to stochastic gradient descent with a constant learning rate at 0.1 and a minibatch size of 250. We turn off the momentum and weight decay to simply focus on the stochastic gradient dynamics and we do not adjust the learning rate throughout the training process. In the following, we focus on a convolutional architecture with 3 layers, 64 channels at each layer that are followed by ReLU and MaxPooling operators, and a single fully connected layer that outputs prediction probabilities. In our experience, this VanillaCNN strikes a good balance of simplicity and performance in that its equivalent FCN version does not suffer from memory issues yet it significantly outperforms any FCN model trained from scratch. We study the following protocol:

Initialize the VanillaCNN at
and train for 150 epochs. At the end of training
reaches test accuracy. 
Along the way, save snapshots of the weights at logarithmically spaced epochs: . It provides CNN points denoted by .

Lift each one to its equivalent fully connected space: (so that only among a total of parameters are nonzero).

Initialize fully connected models at and train in the FCN space for 100 epochs on the same training data and same optimizer except with a smaller constant learning rate of 0.01 (so as to not blow up training) and obtain solutions

Finally, train a standard FCN (with the same architecture as the eFCNs but with the default PyTorch initialization) for 100 epochs on the same training data and same optimizer except with a smaller constant learning rate of 0.01, denote the resulting weights by
. The latter reaches test accuracy.
This process gives us one CNN solution, one FCN solution, and eFCN solutions that are labeled as
(1) 
which we analyze in the following subsections.
4.1 Performance and training dynamics of eFCNs
Our first aim is to characterize the training dynamics of eFCNs and study how their training evolution depends on their switch time (in epochs). When the architectural constraint is relaxed, the loss decreases monotonically to zero (see the left panel of Fig. 2). The initial losses are smaller for larger s, as expected since those s correspond to CNNs trained for longer. In the right panel of Fig. 2, we show a more surprising result: Test accuracy increases monotonously in time for all s, thus showing that relaxing the constraints does not lead to overfitting or catastrophic forgetting. Hence, from the point of view of the eFCN space, it’s not as if CNN dynamics took place on an unstable “cliff” and the constraints of locality and weight sharing prevented it from falling off. It is quite the contrary instead: the CNN dynamics takes place in a basin, and when the constraints are relaxed, the system keeps going down on the training surface and up in test accuracy, as opposed to falling back to the standard FCN regime.
In Fig. 3 (left) we compare the final test accuracies reached by eFCN with the ones of the CNN and the standard FCN. We find two main results. First, the accuracy of the eFCN for is approximately at , well above the standard FCN result of . This shows that even imposing an untrained CNN prior is already enough to find a solution with much better performance than a standard FCN. The second result, perhaps even more remarkable, is that at intermediate switch times ( epochs), the eFCN reaches—and exceeds—the final test accuracy reached by the CNN it stemmed from. This supports the idea that the constraints play a favorable role mostly at the beginning of optimization. At late switch times, the eFCN is initialized close to the bottom of the landscape and has little room for improvement, hence the test accuracy converges to that of the fully trained CNN.
: Largest eigenvalue of the Hessian for eFCNs at the beginning and at the end of training. In all figures the
axis, , indicates the time index of the CNN used to initialize the eFCN.4.2 A closer look at the landscape
A widespread idea in the deep learning literature is that the sharpness of the minima of the training loss is related to generalization performance
(Keskar et al., 2016; Jastrzebski et al., 2017). The intuition being that flat minima reduce the effect of the difference between training loss and test loss. This motivates us to compare the first and second order properties of the landscape explored by the eFCNs and the CNNs they stem from. To do so, we investigate the norm of the gradient of the training loss, , and the top eigenvalue of the Hessian of the training loss, , in the central and right panels of Fig. 3 (we calculate the latter using a power method).We point out several interesting observations. First, the sharpness indicators and of the eFCNs at initialization display a maximum around , which coincides with the switch time of best improvement for the eFCN. Second, we see that after training the eFCNs these indicators plummet by an order of magnitude, which is particularly surprising at very late switch time where it appeared in the left panel of Fig. 3 (see also 4) as if the eFCN was hardly moving away from initialization. This supports the idea that making use of the CNN priors then relaxing them leads to wide basins, possibly explaining the gain in performance.
4.3 How far does the eFCN escape from the CNN space?
A major question naturally arises: how far do the eFCNs move away from their initial condition? In other words, do they stay in the sparse configuration they were initialized in (whether they preserve the weight sharing will be studied later)? To answer this question, we need to know if the locality constraint is violated once the constraints are relaxed. To this end, we consider a natural decomposition of the weights in the FCN space into two parts, , where for an eFCN when it is initialized from a CNN. A visualization of these blocks may be found in Appendix A. We then study the ratio of the norm of the offlocal weights to the total norm, , which is a measure of the deviation of the model from the CNN subspace.
Fig. 4 (left) shows that the deviation at the end of eFCN training decreases monotonically with its switch time . In other words, the earlier we relax the constraints (and therefore the higher the initial loss of the eFCN) the further the eFCN escapes from the CNN subspace, as emphasized in Fig. 4 (middle). Fig. 4 (right) shows that when we move away from the CNN subspace, performance stays constant and even increases a bit, then plummets. This allows one to define a critical distance from the CNN subspace within which eFCNs behave like CNNs, and beyond which they fall back to the standard FCN regime. Note that since the number of offlocal weights is much larger than the number of local weights, is close to unity for a standard FCN, whereas it never exceeds 8% for eFCNs, which overall stay rather close to the CNN subspace. This underlines the persistence of the architectural bias under the stochastic gradient dynamics.
4.4 What is the role of offlocal blocks in learning?
It is interesting to study the kind of representation learned by the offlocal blocks during training of the eFCNs. To this end, we show in Fig. 5 a “filter” from the first layer of the eFCN, whose receptive field is of the size of the images since locality is relaxed. Note that each CNN filter gives rise to many eFCN filters, one for each position on the image since weight sharing is relaxed; here we show the one obtained when the CNN filter (local block) is on the top left. We see that offlocal blocks stay orders of magnitude smaller than the local ones, as expected from Sec. 4.3 where we saw that locality was almost conserved, and local blocks change very little during training, showing that weight sharing is also almost conserved.
More surprisingly, we see that for distinctive shapes of the images are learned by the eFCN offlocal blocks, which perform some kind of templatematching. Note that the silhouettes are particularly clear for the intermediate switch time (middle row), at which we know from Sec. 4.1 that the eFCN had the best improvement over the CNN. This learning procedure is usually very inefficient for complicated images such as those of the CIFAR10 dataset, as shown in Fig. 6 where we reproduce the counterpart of Fig. 5 for the FCN in the left and middle images (they correspond to initial and final training times respectively). By making a difference between the two images, i.e. focusing on the change due to training, (right image of Fig. 6) some signal emerges. It remains nevertheless much more blurred than the silhouettes obtained by eFCN offlocal blocks.
From Fig. 7, it is clear that the offlocal part is useless on its own, however when combined with the local part of the eFCN, it may greatly improve performance when the constraints are relaxed early enough. This hints to the fact that the eFCNs do a complementary optimization of the local and offlocal parts of the weights by combining template matching with convolutional feature extraction.
5 Discussion and Conclusion
In this work, we considered the question of CNN architecture bias in the context of visual tasks, and challenged the accepted view that CNNs provide an essential inductive bias for good generalization. Specifically, we asked whether such inductive bias is necessary throughout all the training process, or only useful at the early stages of training, to prevent the unconstrained FCN from falling prey of spurious solutions with poor generalization too early.
Our experimental results favor the latter hypothesis, suggesting that there exists a vicinity of the CNN subspace with high generalization properties, and one may even enhance the performance of CNNs by exploring it, if one relaxes the CNN constraints at an appropriate time during training. This hypothesis offers interesting theoretical perspectives, in relation to other highdimensional estimation problems, such as in spiked tensor models
Anandkumar et al. (2016), where a smart initialization, containing prior information on the problem, is used to provide an initial condition that bypasses the regions where the estimation landscape is “rough” and full of spurious minima.Another result that is evident from our experiments is that the correlation between local geometric properties of a solution and its generalization performance is not robust under architectural changes. This finding once again implies the importance of the prior induced by the architecture, rather than the pure geometry of the solution.
On the practical front, despite the performance gains obtained, our algorithm remains highly impractical due to the large number of degrees of freedom required on our eFCNs. However, more efficient strategies that would involve a less drastic relaxation of the CNN constraints (e.g., relaxing the weight sharing but keeping the locality constraint such as locallyconnected networks Coates and Ng (2011)) could be of potential interest to practitioners.
Acknowledgments
We would like to thank Alp Riza Guler and Ilija Radosavovic for helpful discussions. We acknowledge funding from the Simons Foundation (#454935, Giulio Biroli). JB acknowledges the partial support by the Alfred P. Sloan Foundation, NSF RI1816753, NSF CAREER CIF 1845360, and Samsung Electronics.
References
 Anandkumar et al. (2016) Anima Anandkumar, Yuan Deng, Rong Ge, and Hossein Mobahi. Homotopy analysis for tensor pca. arXiv preprint arXiv:1610.09322, 2016.
 Arora et al. (2018) Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018.
 Baldassi et al. (2016) Carlo Baldassi, Christian Borgs, Jennifer T Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences, 113(48):E7655–E7662, 2016.
 Baldassi et al. (2019) Carlo Baldassi, Fabrizio Pittorino, and Riccardo Zecchina. Shaping the learning landscape in neural networks around wide flat minima. arXiv preprint arXiv:1905.07833, 2019.
 Belkin et al. (2018) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the biasvariance tradeoff. arXiv preprint arXiv:1812.11118, 2018.
 Buciluǎ et al. (2006) Cristian Buciluǎ, Rich Caruana, and Alexandru NiculescuMizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006.
 Chaudhari et al. (2016) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropysgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
 Chen et al. (2015) Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
 Coates and Ng (2011) Adam Coates and Andrew Y Ng. Selecting receptive fields in deep networks. In Advances in neural information processing systems, pages 2528–2536, 2011.
 Draxler et al. (2018) Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred A Hamprecht. Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885, 2018.
 Du et al. (2017) Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
 Du et al. (2018) Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Ruslan R Salakhutdinov, and Aarti Singh. How many samples are needed to estimate a convolutional neural network? In Advances in Neural Information Processing Systems, pages 373–383, 2018.
 Frankle and Carbin (2018) Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
 Freeman and Bruna (2016) C Daniel Freeman and Joan Bruna. Topology and geometry of deep rectified network optimization landscapes. arXiv preprint arXiv:1611.01540, 2016.
 Garipov et al. (2018) Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pages 8789–8798, 2018.
 Geiger et al. (2018) Mario Geiger, Stefano Spigler, Stéphane d’Ascoli, Levent Sagun, Marco BaityJesi, Giulio Biroli, and Matthieu Wyart. The jamming transition as a paradigm to understand the loss landscape of deep neural networks. arXiv preprint arXiv:1809.09349, 2018.
 Geiger et al. (2019) Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. arXiv preprint arXiv:1901.01608, 2019.
 Gunasekar et al. (2018) Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. arXiv preprint arXiv:1802.08246, 2018.
 Han et al. (2015) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Jastrzebski et al. (2017) Stanisław Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
 Keskar et al. (2016) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Technical Report, pages 1–60, 2009. ISSN 10986596. doi: 10.1.1.222.9220. URL http://scholar.google.com/scholar?hl=en{&}btnG=Search{&}q=intitle:Learning+Multiple+Layers+of+Features+from+Tiny+Images{#}0.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Lee and Raginsky (2018) Jaeho Lee and Maxim Raginsky. Learning finitedimensional coding schemes with nonlinear reconstruction maps. arXiv preprint arXiv:1812.09658, 2018.
 Long and Sedghi (2019) Philip M Long and Hanie Sedghi. Sizefree generalization bounds for convolutional neural networks. arXiv preprint arXiv:1905.12600, 2019.
 Neal et al. (2018) Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon LacosteJulien, and Ioannis Mitliagkas. A modern take on the biasvariance tradeoff in neural networks. arXiv preprint arXiv:1810.08591, 2018.
 Neyshabur et al. (2018) Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of overparametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
 Nowlan and Hinton (1992) Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weightsharing. Neural computation, 4(4):473–493, 1992.
 Sagun et al. (2017) Levent Sagun, Utku Evci, V. Uğur Güney, Yann Dauphin, and Léon Bottou. Empirical analysis of the hessian of overparametrized neural networks. ICLR 2018 Workshop Contribution, arXiv:1706.04454, 2017.
 Saxena and Verbeek (2016) Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems, pages 4053–4061, 2016.

Soudry et al. (2018)
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan
Srebro.
The implicit bias of gradient descent on separable data.
Journal of Machine Learning Research
, 19(70), 2018.  Venturi et al. (2018) Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys. arXiv preprint arXiv:1802.06384, 2018.
 Wu et al. (2017) Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.
 Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix A Visualizing the embedding
We give in Fig. 8 an idea of the structure of the weight matrices of eFCNs.
: Same after training the eFCN for 100 epochs. The offlocal blocks appear as blue squares and the local blocks appear as yellow parallelograms; note that the weights of the offlocal blocks are several orders of magnitude smaller in absolute value than those of the local blocks. Note that due to the padding many weights stay at zero even after relaxing the constraints. Each one of the blue squares gives rise to an image like the one shown in Fig.
11 (left).Appendix B Results with AlexNet on CIFAR100
In this section, we show that the ideas we presented in the main text hold for various classes of data and architecture. We show results obtained using AlexNet [Krizhevsky et al., 2012] on the CIFAR100 dataset. Each subsection contains figures which are the counterpart of the ones of the main text. Performance and training dynamics of eFCNs in Fig. 9, measuring the deviations of the eFCN from the CNN space in Fig. 10, the role of offlocal blocks in learning in Fig. 11 are presented.
Appendix C Interpolating between CNNs and FCNs
Another way to understand the dynamics of the eFCNs is to examine the paths that connect them to the CNN they stemmed from in the FCN weight space. Interpolating in the weight space has received some attention in recent literature, in papers such as [Draxler et al., 2018, Garipov et al., 2018], where it has been shown that contrary to previous beliefs the bottom of the landscapes of deep neural networks resembles a flat, connected level set since one can always find a path of low energy connecting minima.
Here we use two interpolation methods in weight space. The first method, labeled "linear", consists in sampling equally spaced points along the linear path connecting the weights. Of course, the interpolated points generally have higher training loss than the endpoints.
The second method, labeled "string", consists in starting from the linear interpolation path, and letting the interpolated points fall down the landscape following gradient descent, while ensuring that they stay close enough together by adding an elastic term in the loss :
(2) 
By adjusting the stiffness constant we can control how straight the string is: at high we recover the linear interpolation, whereas at low the points decouple and reach the bottom of the landscape, but are far apart and don’t give us an actual path. Note that this method is a simpler form of the one used in [Draxler et al., 2018], where we don’t use the "nudging" trick.
For comparison, we also show the performance obtained when interpolating directly in output space (as done in ensembling methods).
Results are shown in figure 12. We see that for both the linear and string interpolations, the training loss displays a barrier, except at late where the the eFCN has not escaped far from the CNN subspace. A similar phenomenon may be seen in training accuracy.
However, the behavior of test accuracy is much more interesting. From subfigures (a) to (d), the test accuracy of the eFCN, at , increases as we know from Fig. 9. What is very surprising is that in all cases, the interpolated paths, with both linear and string methods, reach higher test accuracies than the endpoints, even at early when the eFCN and the CNN are quite far from each other. This suggests that although relaxing the constraints can be beneficial and improve test accuracy, the optimum performance is actually found somewhere in between the solution found by the CNN and the solution found by the eFCN. This offers yet another procedure to improve the performance in practice. However, in all cases we note that the gain in accuracy is lower than the gain obtained by interpolating in output space.
Comments
There are no comments yet.