Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias

06/16/2019 ∙ by Stéphane d'Ascoli, et al. ∙ EPFL Cole Normale Suprieure NYU college 0

Despite the phenomenal success of deep neural networks in a broad range of learning tasks, there is a lack of theory to understand the way they work. In particular, Convolutional Neural Networks (CNNs) are known to perform much better than Fully-Connected Networks (FCNs) on spatially structured data: the architectural structure of CNNs benefits from prior knowledge on the features of the data, for instance their translation invariance. The aim of this work is to understand this fact through the lens of dynamics in the loss landscape. We introduce a method that maps a CNN to its equivalent FCN (denoted as eFCN). Such an embedding enables the comparison of CNN and FCN training dynamics directly in the FCN space. We use this method to test a new training protocol, which consists in training a CNN, embedding it to FCN space at a certain 'switch time' t_w, then resuming the training in FCN space. We observe that for all switch times, the deviation from the CNN subspace is small, and the final performance reached by the eFCN is higher than that reachable by the standard FCN. More surprisingly, for some intermediate switch times, the eFCN even outperforms the CNN it stemmed from. The practical interest of our protocol is limited by the very large size of the highly sparse eFCN. However, it offers an interesting insight into the persistence of the architectural bias under the stochastic gradient dynamics even in the presence of a huge number of additional degrees of freedom. It shows the existence of some rare basins in the FCN space associated with very good generalization. These can be accessed thanks to the CNN prior, and are otherwise missed.



There are no comments yet.


page 8

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the classic dichotomy between model-based and data-based approaches to solving complex tasks, Convolutional Neural Networks correspond to a particularly efficient tradeoff. CNNs capture key geometric prior information for spatial/temporal tasks through the notion of local translation invariance. Yet, they combine this prior with high flexibility, that allows them to be scaled to millions of parameters and leverage large datasets with gradient-descent learning strategies, typically operating in the ‘interpolating’ regime, i.e. where the training data is fit perfectly.

Such regime challenges the classic notion of model selection in statistics, whereby increasing the number of parameters trades off bias by variance

Zhang et al. (2016). On the one hand, several recent works studying the role of optimization in this tradeoff argue that model size is not always a good predictor for overfitting (Neyshabur et al., 2018; Zhang et al., 2016; Neal et al., 2018; Geiger et al., 2019; Belkin et al., 2018), and consider instead other complexity measures of the function class, which favor CNNs due to their smaller complexity Du et al. (2018). On the other hand, authors have also considered geometric aspects of the energy landscape, such as width of basins Keskar et al. (2016), as a proxy for generalisation. However, these properties of the landscape do not appear to account for the benefits associated with specific architectures. Additionally, considering the implicit bias due to the optimization scheme (Soudry et al., 2018; Gunasekar et al., 2018) is not enough to justify the performance gains without considering the architectural bias. Despite the important insights on the role of over-parametrization in optimization Du et al. (2017); Arora et al. (2018); Venturi et al. (2018), the architectural bias prevails as a major factor to explain good generalization in visual classification tasks – over-parametrized CNN models generalize well, but large neural networks without any convolutional constraints do not.

In this work, we attempt to further disentangle the bias stemming from the architecture and the optimization scheme by hypothesizing that the CNN prior plays a favorable role mostly at the beginning of optimization. Geometrically, the CNN prior defines a low-dimensional subspace within the space of parameters of generic Fully-Connected Networks (FCN) (this subspace is linear since the CNN constraints of weight sharing and locality are linear, see Figure 1

for a sketch of the core idea). Even though the optimization scheme is able to minimize the training loss with or without the constraints (for sufficiently over-parametrized models

Geiger et al. (2018); Zhang et al. (2016)), the CNN subspace provides a “better route” that navigates the optimization landscape to solutions with better generalization performance.

Yet, surprisingly, we observe that leaving this subspace at an appropriate time results in a non-CNN model with an equivalent or even better generalization. Our numerical experiments suggest that the CNN subspace as well as its vicinity are good candidates for high-performance solutions. Furthermore, we observe a threshold distance to the CNN space beyond which the performance degrades quickly close to the regular FCN accuracy level. Our results offer a new perspective on the success of the convolutional architecture: within FCN loss landscapes there exist rare basins associated to very good generalization, characterised not only by their width but rather by their distance to the CNN subspace. These can be accessed thanks to the CNN prior, and are otherwise missed in the usual training of FCNs.

The rest of the paper is structured as follows. Section 2 discusses prior work in relating architecture and optimization biases. Section 3 presents our CNN to FCN embedding algorithm and training procedure, and Section 4 describes and analyses the experiments performed on the CIFAR-10 dataset (Krizhevsky and Hinton, 2009). We conclude in Section 5 by describing theoretical setups compatible with our observations and consequences for practical applications.

Figure 1: White background: ambient, -dimensional, fully-connected space. Blue subspace: linear, -dimensional convolutional subspace. We have . Green manifold: (near-)zero loss valued, (approximate-)solution set for a given training data. Note that it is a nontrivial manifold due to continuous symmetries (also, see the related work section on mode connectivity) and it intersects with the CNN subspace. Red path: a CNN initialized and trained with the convolutional constraints. Pink path: a FCN model initialized and trained without the constraints. Red initialized orange paths: Snapshots taken along the CNN training that are lifted to the ambient FCN space, and trained in the FCN space without the constraints.

2 Related Work

The relationship between CNNs and FCNs is an instance of trading-off prior information with expressivity within Neural Networks. There is abundant literature that explored the relationship between different neural architectures, for different purposes. One can roughly classify these works on whether they attempt to map a large model into a smaller one, or vice-versa.

In the first category, one of the earliest efforts to introduce structure within FCNs with the goal of improving generalization was Nowlan and Hinton’s soft weight sharing networks Nowlan and Hinton (1992), in which the weights are regularized via a Mixture of Gaussians. Another highly popular line of work attempts to distill the “knowledge” of a large model (or an ensemble of models) into a smaller one Buciluǎ et al. (2006); Hinton et al. (2015), with the goal of improving both computational efficiency and generalization performance. Network pruning Han et al. (2015) and the recent “Lottery Ticket Hypothesis” Frankle and Carbin (2018) are other remarkable instances of the benefits of model reduction.

In the second category, which is more directly related to our work, authors have attempted to build larger models by embedding small architectures into larger ones, such as the Net2Net model Chen et al. (2015) or more evolved follow-ups Saxena and Verbeek (2016). In these works, however, the motivation is to accelerate learning by some form of knowledge transfer between the small model and the large one, whereas our motivation is to understand the specific role of architectural bias in generalization.

The links between generalization error and the geometry and topology of the optimization landscape have been also extensively studied in recent times. Du et al. (2018) compare generalisation bounds between CNNs and FCNs, establishing a sample complexity advantage in the case of linear activations. Long and Sedghi (2019); Lee and Raginsky (2018) obtain specific generalisation bounds for CNN architectures. Chaudhari et al. (2016) proposed a different optimization objective, whereby a bilateral filtering of the landscape favors dynamics into wider valleys. Keskar et al. (2016) explored the link between sharpness of local minima and generalization through Hessian analysis Sagun et al. (2017), and Wu et al. (2017) argued in terms of the volume of basins of attraction. The characterization of the loss landscape along paths connecting different models have been studied recently, e.g. in Freeman and Bruna (2016), Garipov et al. (2018), and Draxler et al. (2018). The existence of rare basins leading to better generalization was found and highlighted in simple models in Baldassi et al. (2016, 2019). The role of the CNN prior within the ambient FCNs loss landscape and its implication for generalization properties were not considered in any of these works. In the following we address this point by building on these previous investigations of the landscape properties.

3 CNN to FCN Embedding

In both FCNs and CNNs, each feature of a layer is calculated by applying a non-linearity to a weighted sum over the features of the previous layer (or over all the pixels of the image, for the first hidden layer). CNNs are a particular type of FCNs, which make use of two key ingredients to reduce their number of redundant parameters: (1) locality and (2) weight sharing.

Locality: In FCNs, the sum is taken over all the features of the previous layer. In locally connected networks (LCNs), the locality is imposed by restricting the sum to a small receptive field (a box of adjacent features of the previous layer). The set of weights of this restricted sum is called a filter. For a given receptive field, one may create multiple features (or channels) by using several different filters. This procedure makes use of the spatial structure of the data and reduces the number of fitting parameters.

Weight sharing:

CNNs are a particular type of LCNs where all the filters of a given channel use the same set of weights. This procedure makes use of the somewhat universal properties of feature extracting filters such as edge detectors and reduces even more drastically the number of fitting parameters.

When mapping a CNN to its equivalent FCN (eFCN), we obtain very sparse (due to locality) and redundant (due to weight sharing) weight matrices. This typically results in a large memory overhead as the eFCN of a simple CNN can take several orders of magnitude larger space in the memory. Therefore, we present the core ideas on a simple 3-layer CNN on CIFAR-10 (Krizhevsky et al., 2012), and show similar results for AlexNet on CIFAR-100 in Appendix B.

In the mapping,-111The source code may be found at:

, all layers apart form the convolutional layers (ReLU, Dropout, MaxPool and fully-connected) are left unchanged except for proper reshaping. Each convolutional layer is mapped to a fully-connected layer. For a convolutional layer with

input features, input channels, output features, output channels, the corresponding fully-connected layer is of size

As a result, for a given CNN, we obtain its eFCN counterpart with an end-to-end fully-connected architecture which is functionally identical to the original CNN.

4 Experiments

Given input-label pairs for a supervised classification task, , let and is the index of the correct class for a given image (typically,

is represented in a vector of dimension equal to the number of classes to separate, however, we shall not worry about this for the purposes of the present text). The network, parametrized by

, outputs . To distinguish between different architectures we denote the CNN weights by and the eFCNs weights by . Let’s denote the embedding function described in Sec. 3 by where and with a slight abuse of notation use for both CNN and eFCN. Dropping the explicit input dependency for simplicity we have:

For the experiments, we prepare the CIFAR-10 dataset for training without data augmentation. The optimizer is set to stochastic gradient descent with a constant learning rate at 0.1 and a minibatch size of 250. We turn off the momentum and weight decay to simply focus on the stochastic gradient dynamics and we do not adjust the learning rate throughout the training process. In the following, we focus on a convolutional architecture with 3 layers, 64 channels at each layer that are followed by ReLU and MaxPooling operators, and a single fully connected layer that outputs prediction probabilities. In our experience, this VanillaCNN strikes a good balance of simplicity and performance in that its equivalent FCN version does not suffer from memory issues yet it significantly outperforms any FCN model trained from scratch. We study the following protocol:

  1. Initialize the VanillaCNN at

    and train for 150 epochs. At the end of training

    reaches test accuracy.

  2. Along the way, save snapshots of the weights at logarithmically spaced epochs: . It provides CNN points denoted by .

  3. Lift each one to its equivalent fully connected space: (so that only among a total of parameters are non-zero).

  4. Initialize fully connected models at and train in the FCN space for 100 epochs on the same training data and same optimizer except with a smaller constant learning rate of 0.01 (so as to not blow up training) and obtain solutions

  5. Finally, train a standard FCN (with the same architecture as the eFCNs but with the default PyTorch initialization) for 100 epochs on the same training data and same optimizer except with a smaller constant learning rate of 0.01, denote the resulting weights by

    . The latter reaches test accuracy.

This process gives us one CNN solution, one FCN solution, and eFCN solutions that are labeled as


which we analyze in the following subsections.

4.1 Performance and training dynamics of eFCNs

Our first aim is to characterize the training dynamics of eFCNs and study how their training evolution depends on their switch time (in epochs). When the architectural constraint is relaxed, the loss decreases monotonically to zero (see the left panel of Fig. 2). The initial losses are smaller for larger s, as expected since those s correspond to CNNs trained for longer. In the right panel of Fig. 2, we show a more surprising result: Test accuracy increases monotonously in time for all s, thus showing that relaxing the constraints does not lead to overfitting or catastrophic forgetting. Hence, from the point of view of the eFCN space, it’s not as if CNN dynamics took place on an unstable “cliff” and the constraints of locality and weight sharing prevented it from falling off. It is quite the contrary instead: the CNN dynamics takes place in a basin, and when the constraints are relaxed, the system keeps going down on the training surface and up in test accuracy, as opposed to falling back to the standard FCN regime.

Figure 2: Training loss (left) and test accuracy (right) on CIFAR-100 vs. training time in logarithmic scale including the initial point. Different models are color coded as follows: the VanillaCNN is shown in black, standard FCN is in red, and the eFCNs with their switch time s are indicated by the gradient ranging from purple to light green. The switch time values in epochs.

In Fig. 3 (left) we compare the final test accuracies reached by eFCN with the ones of the CNN and the standard FCN. We find two main results. First, the accuracy of the eFCN for is approximately at , well above the standard FCN result of . This shows that even imposing an untrained CNN prior is already enough to find a solution with much better performance than a standard FCN. The second result, perhaps even more remarkable, is that at intermediate switch times ( epochs), the eFCN reaches—and exceeds—the final test accuracy reached by the CNN it stemmed from. This supports the idea that the constraints play a favorable role mostly at the beginning of optimization. At late switch times, the eFCN is initialized close to the bottom of the landscape and has little room for improvement, hence the test accuracy converges to that of the fully trained CNN.

Figure 3: Left: The performance of eFCNs reached at the end of training (red crosses) compared to its counterpart for the best CNN accuracy (straight line) and the best FCN accuracy (dashed line). Center: Norm of the gradient for eFCNs at the beginning and at the end of training. Right

: Largest eigenvalue of the Hessian for eFCNs at the beginning and at the end of training. In all figures the

-axis, , indicates the time index of the CNN used to initialize the eFCN.

4.2 A closer look at the landscape

A widespread idea in the deep learning literature is that the sharpness of the minima of the training loss is related to generalization performance

(Keskar et al., 2016; Jastrzebski et al., 2017). The intuition being that flat minima reduce the effect of the difference between training loss and test loss. This motivates us to compare the first and second order properties of the landscape explored by the eFCNs and the CNNs they stem from. To do so, we investigate the norm of the gradient of the training loss, , and the top eigenvalue of the Hessian of the training loss, , in the central and right panels of Fig. 3 (we calculate the latter using a power method).

We point out several interesting observations. First, the sharpness indicators and of the eFCNs at initialization display a maximum around , which coincides with the switch time of best improvement for the eFCN. Second, we see that after training the eFCNs these indicators plummet by an order of magnitude, which is particularly surprising at very late switch time where it appeared in the left panel of Fig. 3 (see also 4) as if the eFCN was hardly moving away from initialization. This supports the idea that making use of the CNN priors then relaxing them leads to wide basins, possibly explaining the gain in performance.

4.3 How far does the eFCN escape from the CNN space?

Figure 4: Left panel: Switch time of the eFCN vs. , the measure of deviation from the CNN subspace through the locality constraint, at the final point of eFCN training. Middle panel: vs. the initial loss value. Even when the eFCN starts at a very low loss, is still deviates from the CNN subspace yet it doesn’t harm the performance. Right panel: vs. final test accuracy of eFCN models. Deviation from the CNN space is robust in test performance until a certain theshold, at which the test accuracy drops abruptly. For reference, the blue point in the middle and right panels indicate the deviation measure for a standard FCN, where .

A major question naturally arises: how far do the eFCNs move away from their initial condition? In other words, do they stay in the sparse configuration they were initialized in (whether they preserve the weight sharing will be studied later)? To answer this question, we need to know if the locality constraint is violated once the constraints are relaxed. To this end, we consider a natural decomposition of the weights in the FCN space into two parts, , where for an eFCN when it is initialized from a CNN. A visualization of these blocks may be found in Appendix A. We then study the ratio of the norm of the off-local weights to the total norm, , which is a measure of the deviation of the model from the CNN subspace.

Fig. 4 (left) shows that the deviation at the end of eFCN training decreases monotonically with its switch time . In other words, the earlier we relax the constraints (and therefore the higher the initial loss of the eFCN) the further the eFCN escapes from the CNN subspace, as emphasized in Fig. 4 (middle). Fig. 4 (right) shows that when we move away from the CNN subspace, performance stays constant and even increases a bit, then plummets. This allows one to define a critical distance from the CNN subspace within which eFCNs behave like CNNs, and beyond which they fall back to the standard FCN regime. Note that since the number of off-local weights is much larger than the number of local weights, is close to unity for a standard FCN, whereas it never exceeds 8% for eFCNs, which overall stay rather close to the CNN subspace. This underlines the persistence of the architectural bias under the stochastic gradient dynamics.

4.4 What is the role of off-local blocks in learning?

Figure 5: Visualization of an eFCN “filter” from the the first layer just after embedding (left column), after training after 11 epochs (middle column), and training after 78 epochs (right column); where the eFCN is initialized at switch times (top row), (middle row), and (bottom row). The colors indicate the natural logarithm of the absolute value of the weights. Note that the convolutional filters vary little and remain orders of magnitude larger than the off-local blocks and vary little, and the off-local blocks pick up strong signals from images as sharp silhouettes appear.
Figure 6: Visualization of the same standard FCN at a randomly initialized point (left) and after training for 150 epochs (middle). The colors indicate the natural logarithm of the absolute value of the weights. In this case, we expect the difference to display a comparable pattern (since the off-local weights start at zero), and the difference is shown on the right panel. A loose texture emerges, however, it is not as sharp of a silhouette as eFCN weights after training, in particular the ones that are initialized at mid/late times.
Figure 7: Contributions to the test accuracy of the local blocks (off-local blocks masked out) and off-local blocks (local blocks masked out)

It is interesting to study the kind of representation learned by the off-local blocks during training of the eFCNs. To this end, we show in Fig. 5 a “filter” from the first layer of the eFCN, whose receptive field is of the size of the images since locality is relaxed. Note that each CNN filter gives rise to many eFCN filters, one for each position on the image since weight sharing is relaxed; here we show the one obtained when the CNN filter (local block) is on the top left. We see that off-local blocks stay orders of magnitude smaller than the local ones, as expected from Sec. 4.3 where we saw that locality was almost conserved, and local blocks change very little during training, showing that weight sharing is also almost conserved.

More surprisingly, we see that for distinctive shapes of the images are learned by the eFCN off-local blocks, which perform some kind of template-matching. Note that the silhouettes are particularly clear for the intermediate switch time (middle row), at which we know from Sec. 4.1 that the eFCN had the best improvement over the CNN. This learning procedure is usually very inefficient for complicated images such as those of the CIFAR-10 dataset, as shown in Fig. 6 where we reproduce the counterpart of Fig. 5 for the FCN in the left and middle images (they correspond to initial and final training times respectively). By making a difference between the two images, i.e. focusing on the change due to training, (right image of Fig. 6) some signal emerges. It remains nevertheless much more blurred than the silhouettes obtained by eFCN off-local blocks.

From Fig. 7, it is clear that the off-local part is useless on its own, however when combined with the local part of the eFCN, it may greatly improve performance when the constraints are relaxed early enough. This hints to the fact that the eFCNs do a complementary optimization of the local and off-local parts of the weights by combining template matching with convolutional feature extraction.

5 Discussion and Conclusion

In this work, we considered the question of CNN architecture bias in the context of visual tasks, and challenged the accepted view that CNNs provide an essential inductive bias for good generalization. Specifically, we asked whether such inductive bias is necessary throughout all the training process, or only useful at the early stages of training, to prevent the unconstrained FCN from falling prey of spurious solutions with poor generalization too early.

Our experimental results favor the latter hypothesis, suggesting that there exists a vicinity of the CNN subspace with high generalization properties, and one may even enhance the performance of CNNs by exploring it, if one relaxes the CNN constraints at an appropriate time during training. This hypothesis offers interesting theoretical perspectives, in relation to other high-dimensional estimation problems, such as in spiked tensor models

Anandkumar et al. (2016), where a smart initialization, containing prior information on the problem, is used to provide an initial condition that bypasses the regions where the estimation landscape is “rough” and full of spurious minima.

Another result that is evident from our experiments is that the correlation between local geometric properties of a solution and its generalization performance is not robust under architectural changes. This finding once again implies the importance of the prior induced by the architecture, rather than the pure geometry of the solution.

On the practical front, despite the performance gains obtained, our algorithm remains highly impractical due to the large number of degrees of freedom required on our eFCNs. However, more efficient strategies that would involve a less drastic relaxation of the CNN constraints (e.g., relaxing the weight sharing but keeping the locality constraint such as locally-connected networks Coates and Ng (2011)) could be of potential interest to practitioners.


We would like to thank Alp Riza Guler and Ilija Radosavovic for helpful discussions. We acknowledge funding from the Simons Foundation (#454935, Giulio Biroli). JB acknowledges the partial support by the Alfred P. Sloan Foundation, NSF RI-1816753, NSF CAREER CIF 1845360, and Samsung Electronics.


Appendix A Visualizing the embedding

We give in Fig. 8 an idea of the structure of the weight matrices of eFCNs.

Figure 8: Left: Heatmap of a slice of the weight matrix of the first layer of the eFCN just after its initialization from the converged VanillaCNN, with the colorscale indicating the natural logarithm of the absolute value of the weights. The white blocks illustrate the sparsity due to the locality constraint, and the repeating patterns illustrate the redundancy due to the weight sharing constraint. Right

: Same after training the eFCN for 100 epochs. The off-local blocks appear as blue squares and the local blocks appear as yellow parallelograms; note that the weights of the off-local blocks are several orders of magnitude smaller in absolute value than those of the local blocks. Note that due to the padding many weights stay at zero even after relaxing the constraints. Each one of the blue squares gives rise to an image like the one shown in Fig. 

11 (left).

Appendix B Results with AlexNet on CIFAR-100

In this section, we show that the ideas we presented in the main text hold for various classes of data and architecture. We show results obtained using AlexNet [Krizhevsky et al., 2012] on the CIFAR-100 dataset. Each subsection contains figures which are the counterpart of the ones of the main text. Performance and training dynamics of eFCNs in Fig. 9, measuring the deviations of the eFCN from the CNN space in Fig. 10, the role of off-local blocks in learning in Fig. 11 are presented.

Figure 9: Left: This figure sums up the results obtained with the VanillaCNN described in the paper. The red curve represents the test accuracy of the VanillaCNN versus its training time in epochs. Above each point of the training, we depict as crosses the test accuracy history of the eFCN stemmed at switch time , with colors indicating the training time of the eFCN after embedding. For comparison, the best test accuracy reached by a standard FCN of same size is depicted as a brown horizontal dashed line. Right: Same figure using Alexnet on the CIFAR-100 dataset. We note that results are qualitatively similar : the eFCNs always improve after initialization, outperform the standard FCN, and we again observe that for certain switch times the eFCN even exceeds the best test accuracy reached by the CNN.
Figure 10: Left panel: Switch time of the eFCN vs. , the measure of deviation from the CNN subspace through the locality constraint, at the final point of eFCN training. Middle panel: vs. the initial loss value. Even when the eFCN starts at a very low loss, is still deviates from the CNN subspace yet it doesn’t harm the performance. Right panel: vs. final test accuracy of eFCN models. Deviation from the CNN space is robust in test performance until a certain theshold, at which the test accuracy drops abruptly. For reference, the blue point in the middle and right panels indicate the deviation measure for a standard FCN, where .
Figure 11: Left: Visualization of an eFCN “filter” from the the first layer just after embedding (left column), after training after 11 epochs (middle column), and training after 78 epochs (right column); where the eFCN is initialized at switch times (top row), (middle row), and (bottom row). The colors indicate the natural logarithm of the absolute value of the weights. Note that the convolutional filters vary little and remain orders of magnitude larger than the off-local blocks and vary little, and the off-local blocks pick up strong signals from images as sharp silhouettes appear. Right: Contributions to the test accuracy of the local blocks (off-local blocks masked out) and off-local blocks (local blocks masked out)

Appendix C Interpolating between CNNs and FCNs

Another way to understand the dynamics of the eFCNs is to examine the paths that connect them to the CNN they stemmed from in the FCN weight space. Interpolating in the weight space has received some attention in recent literature, in papers such as [Draxler et al., 2018, Garipov et al., 2018], where it has been shown that contrary to previous beliefs the bottom of the landscapes of deep neural networks resembles a flat, connected level set since one can always find a path of low energy connecting minima.

Here we use two interpolation methods in weight space. The first method, labeled "linear", consists in sampling equally spaced points along the linear path connecting the weights. Of course, the interpolated points generally have higher training loss than the endpoints.

The second method, labeled "string", consists in starting from the linear interpolation path, and letting the interpolated points fall down the landscape following gradient descent, while ensuring that they stay close enough together by adding an elastic term in the loss :


By adjusting the stiffness constant we can control how straight the string is: at high we recover the linear interpolation, whereas at low the points decouple and reach the bottom of the landscape, but are far apart and don’t give us an actual path. Note that this method is a simpler form of the one used in [Draxler et al., 2018], where we don’t use the "nudging" trick.

For comparison, we also show the performance obtained when interpolating directly in output space (as done in ensembling methods).

Figure 12: Interpolation between the solution reached by the CNN after 100 epochs () and the solution found by the eFCN after 100 epochs, for four different switch times indicated below the subfigures. In each subfigure, top left panel: train loss, top right panel : test loss, bottom left panel: train accuracy, bottom right panel: test accuracy. The green line corresponds to linear interpolation, the blue line corresponds to the string method, and the red line corresponds to interpolation in output space.

Results are shown in figure 12. We see that for both the linear and string interpolations, the training loss displays a barrier, except at late where the the eFCN has not escaped far from the CNN subspace. A similar phenomenon may be seen in training accuracy.

However, the behavior of test accuracy is much more interesting. From subfigures (a) to (d), the test accuracy of the eFCN, at , increases as we know from Fig. 9. What is very surprising is that in all cases, the interpolated paths, with both linear and string methods, reach higher test accuracies than the endpoints, even at early when the eFCN and the CNN are quite far from each other. This suggests that although relaxing the constraints can be beneficial and improve test accuracy, the optimum performance is actually found somewhere in between the solution found by the CNN and the solution found by the eFCN. This offers yet another procedure to improve the performance in practice. However, in all cases we note that the gain in accuracy is lower than the gain obtained by interpolating in output space.