1 Introduction
Since the introduction of backpropagation (Rumelhart et al., 1986), stochastic gradient descent (SGD) has been the most commonly used optimization algorithm for deep neural networks. While yielding remarkable performance on a variety of learning tasks, a downside of the SGD algorithm is that it requires a schedule for the decay of its learning rate. In the convex setting, curvature properties of the objective function can be used to design schedules that are hyperparameter free and guaranteed to converge to the optimal solution (Bubeck, 2015). However, there is no analogous result of practical interest for the nonconvex optimization problem of a deep neural network. An illustration of this issue is the diversity of learning rate schedules used to train deep convolutional networks with SGD: Simonyan & Zisserman (2015) and He et al. (2016) adapt the learning rate according to the validation performance, while Szegedy et al. (2015); Huang et al. (2017) and Loshchilov & Hutter (2017) use predetermined schedules, which are respectively piecewise constant, geometrically decaying, and cyclic with a cosine annealing. While these protocols result in competitive or stateoftheart results on their learning task, there does not seem to be a consistent methodology. As a result, finding such a schedule for a new setting is a timeconsuming and computationally expensive effort.
To alleviate this issue, adaptive gradient methods have been developed (Zeiler, 2012; Kingma & Ba, 2015; Reddi et al., 2018), and borrowed from online convex optimization (Duchi et al., 2011). Typically, these methods only require the tuning of the initial learning rate, the other hyperparameters being considered robust across applications. However, it has been shown that such adaptive gradient methods obtain worse generalization than SGD (Wilson et al., 2017). This observation is corroborated by our experimental results.
In order to bridge this performance gap between existing adaptive methods and SGD, we introduce a new optimization algorithm, called Deep FrankWolfe (DFW). The DFW algorithm exploits the composite structure of deep neural networks to design an optimization algorithm that leverages efficient convex solvers. In more detail, we consider a composite (nested) optimization problem, with the loss as the outer function and the function encoded by the neural network as the inner one. At each iteration, we define a proximal problem with a firstorder approximation of the neural network (linearized inner function), while keeps the loss function in its exact form (exact outer function). When the loss is the hinge loss, each proximal problem created by our formulation is exactly a linear SVM. This allows us to employ the powerful FW algorithm as the workhorse of our procedure.
There are two bydesign advantages to our method compared to the SGD algorithm. First, each iteration exploits more information about the learning objective, while preserving the same computational cost. Second, an optimal stepsize is computed in closedform by using the FrankWolfe (FW) algorithm in the dual (Frank & Wolfe, 1956; LacosteJulien et al., 2013). Consequently, we do not need a handdesigned schedule for the learning rate. As a result, our algorithm is the first to provide competitive generalization error compared to SGD, all the while requiring a single hyperparameter and often converging significantly faster.
We present two additional improvements to customize the use of the DFW algorithm to deep neural networks. First, we show how to smooth the loss function to avoid optimization difficulties arising from learning deep models with SVMs (Berrada et al., 2018)
. Second, we incorporate Nesterov momentum
(Nesterov, 1983) to accelerate our algorithm.We demonstrate the efficacy of our method on image classification with wide residual networks (Zagoruyko & Komodakis, 2016)
and densely connected convolutional neural networks
(Huang et al., 2017) on the CIFAR data sets (Krizhevsky, 2009), and on natural language inference with a BiLSTM on the SNLI corpus (Bowman et al., 2015). We show that the DFW algorithm often strongly outperforms previous methods based on adaptive learning rates. Furthermore, it provides comparable or better accuracy to SGD with handdesigned learning rate schedules.In conclusion, our contributions can be summed up as follows:

We propose a proximal framework which preserves information from the loss function.

For the first time for deep neural networks, we demonstrate how our formulation gives at each iteration (i) an optimal stepsize in closed form and (ii) an update at the same computational cost as SGD.

We design a novel smoothing scheme for the dual optimization of SVMs.

To the best of our knowledge, the resulting DFW algorithm is the first to offer comparable or better generalization to SGD with a handdesigned schedule on the CIFAR data sets, all the while converging several times faster and requiring only a single hyperparameter.
2 Related Work
Non GradientBased Methods.
The success of a simple firstorder method such as SGD has led to research in other more sophisticated techniques based on relaxations (Heinemann et al., 2016; Zhang et al., 2017a), learning theory (Goel et al., 2017), Bregman iterations (Taylor et al., 2016), and even secondorder methods (Roux et al., 2008; Martens & Sutskever, 2012; Ollivier, 2013; Desjardins et al., 2015; Martens & Grosse, 2015; Grosse & Martens, 2016; Ba et al., 2017; Botev et al., 2017; Martens et al., 2018). While such methods hold a lot of promise, their relatively large periteration cost limits their scalability in practice. As a result, gradientbased methods continue to be the most popular optimization algorithms for learning deep neural networks.
Adaptive Gradient Methods.
As mentioned earlier, one of the main challenges of using SGD is the design of a learning rate schedule. Several works proposed alternative firstorder methods that do not require such a schedule, by either modifying the descent direction or adaptively rescaling the stepsize (Duchi et al., 2011; Zeiler, 2012; Schaul et al., 2013; Kingma & Ba, 2015; Zhang et al., 2017b; Reddi et al., 2018). However, as mentioned earlier, the adaptive variants of SGD sometimes provide subpar generalization (Wilson et al., 2017).
Learning to Learn and MetaLearning.
Learning to learn approaches have also been proposed to optimize deep neural networks. Baydin et al. (2018) and Wu et al. (2018) learn the learning rate to avoid a handdesigned schedule and to improve practical performance. Such methods can be combined with our proposed algorithm to learn its proximal coefficient, instead of considering it as a fixed hyperparameter to be tuned. Metalearning approaches have also been suggested to learn the optimization algorithm (Andrychowicz et al., 2016; Ravi & Larochelle, 2017; Wichrowska et al., 2017; Li & Malik, 2017). This line of work, which is orthogonal to ours, could benefit from the use of DFW to optimize the metalearner.
Optimization and Generalization.
Several works study the relationship between optimization and generalization in deep learning. In order to promote generalization within the optimization algorithm itself,
Neyshabur et al. (2015, 2016)proposed the PathSGD algorithm, which implicitly controls the capacity of the model. However, their method required the model to employ ReLU nonlinearity only, which is an important restriction for practical purposes.
Hardt et al. (2016); Arpit et al. (2017); Neyshabur et al. (2017); Hoffer et al. (2017) and Chaudhari & Soatto (2018) analyzed how existing optimization algorithms implicitly regularize deep neural networks. However this phenomenon is not yet fully understood, and the resulting empirical recommendations are sometimes opposing (Hardt et al., 2016; Hoffer et al., 2017).Proximal Methods.
The backpropagation algorithm has been analyzed in a proximal framework in (Frerix et al., 2018). Yet, the resulting approach still requires the same hyperparameters as SGD and incurs a higher computational cost per iteration.
Linear SVM SubProblems.
A main component of our formulation is to formulate subproblems as linear SVMs. Berrada et al. (2017) showed that neural networks with piecewise linear activations could be trained with the CCCP algorithm (Yuille & Rangarajan, 2002), which yielded approximate SVM problems to be solved with the BCFW algorithm (LacosteJulien et al., 2013). However their algorithm only updates the parameters of one layer at a time, which slows down convergence considerably in practice. Closest to our approach are the works of (Hochreiter & Obermayer, 2005) and (Singh & ShaweTaylor, 2018). Hochreiter & Obermayer (2005) suggested to create a local SVM based on a firstorder Taylor expansion and a proximal term, in order to lower the error of every data sample while minimizing the changes in the weights. However their method operated in a nonstochastic setting, making the approach infeasible for largescale data sets. Singh & ShaweTaylor (2018)
, a parallel work to ours, also created an SVM problem using a firstorder Taylor expansion, this time in a minibatch setting. Their work provided interesting insights from a statistical learning theory perspective. While their method is wellgrounded, its significantly higher cost per iteration impairs its practical speed and scalability. As such, it can be seen as complementary to our empirical work, which exploits a powerful solver and provides stateoftheart scalability and performance.
3 Problem Formulation
Before describing our formulation, we introduce some necessary notation. We use to denote the Euclidean norm. Given a function , is the derivative of with respect to evaluated at . According to the situation, this derivative can be a gradient, a Jacobian or even a directional derivative. Its exact nature will be clear from context throughout the paper. We also introduce the firstorder Taylor expansion of around the point : . For a positive integer , we denote the set as . For simplicity, we assume that stochastic algorithms process only one sample at each iteration, although the methods can be trivially extended to minibatches of size larger than one.
3.1 Learning Objective
We suppose we are given a data set , where each is a sample annotated with a label from the output space
. The data set is used to estimate a parameterized model represented by the function
. Given its (flattened) parameters , and an input , the model predicts, a vector with one score per element of the output space
. For instance, can be a linear map or a deep neural network. Given a vector of scores per label , we denote by the loss function that computes the risk of the prediction scores given the ground truth label . For example, the loss can be crossentropy or the multiclass hinge loss:(1) 
(2) 
The crossentropy loss (1) tries to match the empirical distribution by driving incorrect scores as far as possible from the ground truth one. The hinge loss (2) attempts to create a minimal margin of one between correct and incorrect scores. The hinge loss has been shown to be more robust to overfitting than crossentropy, when combined with smoothing techniques that are common in the optimization literature (Berrada et al., 2018). To simplify notation, we introduce and for each . Finally, we denote by the regularization (typically the squared Euclidean norm). We now write the learning problem under its empirical risk minimization form:
(3) 
3.2 A Proximal Approach
Our main contribution is a formulation which exploits the composite nature of deep neural networks in order to obtain a better approximation of the objective at each iteration. Thanks to the careful approximation design, this approach yields subproblems that are amenable to efficient optimization by powerful convex solvers. In order to understand the intuition of our approach, we first present a proximal gradient perspective on SGD.
The SGD Algorithm.
At iteration , the SGD algorithm selects a sample at random and observes the objective estimate . Then, given the learning rate , it performs the following update on the parameters:
(4) 
Equation (4) is the closedform solution of a proximal problem where the objective has been linearized by the firstorder Taylor expansion (Bubeck, 2015):
(5) 
To see the relationship between (4) and (5), one can set the gradient with respect to to 0 in equation (5), and observe that the resulting equation is exactly (4). In other words, SGD minimizes a firstorder approximation of the objective, while encouraging proximity to the current estimate .
However, one can also choose to linearize only a part of the composite objective (Lewis & Wright, 2016). Choosing which part to approximate is a crucial decision, because it yields optimization problems with widely different properties. In this work, we suggest an approach that lends itself to fast optimization with robust convex solvers and preserves information about the learning task by keeping an exact loss function.
LossPreserving Linearization.
In detail, at iteration , with selected sample , we introduce the proximal problem that linearizes the regularization and the model , but not the loss function :
(6) 
In figure 1, we provide a visual comparison of equations (5) and (6) in the case of a piecewise linear loss. As will be seen, by preserving the loss function, we will be able to achieve good performance across a number of tasks with a fixed . Consequently, we will provide the first algorithm to accurately learn deep neural networks with only a single hyperparameter while offering similar performance compared to SGD with a handdesigned schedule.
4 The Deep FrankWolfe Algorithm
4.1 Algorithm
We focus on the optimization of equation (6) when is a multiclass hinge loss (2). The results of this section were originally derived for linear models (LacosteJulien et al., 2013). Our contribution is to show for the first time how they can be exploited for deep neural networks thanks to our formulation (6). We will refer to the resulting algorithm for neural networks as Deep FrankWolfe (DFW). We begin by stating the key advantage of our method.
Proposition 1 (Optimal stepsize, (LacosteJulien et al., 2013)).
Problem (6) with a hinge loss is amenable to optimization with FrankWolfe in the dual, which yields an optimal stepsize in closedform at each iteration .
This optimal stepsize can be obtained in closedform because the hinge loss is convex and piecewise linear. In fact, the approach presented here can be applied to any loss function that is convex and piecewise linear (another example would be the distance for regression for instance).
Since the stepsize can be computed in closedform, the main computational challenge is to obtain the update direction, that is, the conditional gradient of the dual. In the following result, we show that by taking a single step per proximal problem, this dual conditional gradient can be computed at the same cost as a standard stochastic gradient. The proof is available in appendix A.5.
Proposition 2 (Cost per iteration).
If a single step is performed on the dual of (6), its conditional gradient is given by . Given the stepsize , the resulting update can be written as:
(7) 
In other words, the cost per iteration of the DFW algorithm is the same as SGD, since the update only requires standard stochastic gradients. In addition, we point out that in a minibatch setting, the conditional gradient is given by the average of the gradients over the minibatch. As a consequence, we can use batch FrankWolfe in the dual rather than coordinatewise updates, with the same parallelism as SGD over the samples of a minibatch.
One can observe how the update (7) exploits the optimal stepsize given by Proposition 1. There is a geometric interpretation to the role of this stepsize . When is set to its minimal value 0, the resulting iterate does not move along the direction . Since the stepsize is optimal, this can only happen if the current iterate is detected to be at a minimum of the piecewise linear approximation. Conversely, when reaches its maximal value 1, the algorithm tries to move as far as possible along the direction . In that case, the update is the same as the one obtained by SGD (as given by equation (4)). In other words, can automatically decay the effective learning rate, hereby preventing the need to design a learning rate schedule by hand.
As mentioned previously, the DFW algorithm performs only one step per proximal problem. Since problem (6) is only an approximation of the original problem (3), it may be unnecessarily expensive to solve it very accurately. Therefore taking a single step per proximal problem may help the DFW algorithm to converge faster. This is confirmed by our experimental results, which show that DFW is often able to minimize the learning objective (3) at greater speed than SGD.
4.2 Improvements for Deep Neural Networks
We present two improvements to customize the application of our algorithm to deep neural networks.
Smoothing.
The SVM loss is nonsmooth and has sparse derivatives, which can cause difficulties when training a deep neural network (Berrada et al., 2018). In Appendix A.6, we derive a novel result that shows how we can exploit the smooth primal crossentropy direction and inexpensively detect when to switch back to using the standard conditional gradient.
Nesterov Momentum.
To take advantage of acceleration similarly to the SGD baseline, we adapt the Nesterov momentum to the DFW algorithm. We defer the details to the appendix in A.7 for space reasons. We further note that the momentum coefficient is typically set to a high value, say 0.9, and does not contribute significantly to the computational cost of crossvalidation.
4.3 Algorithm Summary
The main steps of DFW are shown in Algorithm 1. As the key feature of our approach, note that the stepsize is computed in closedform in step 11 of the algorithm (colored in blue).
Note that only the hyperparameter will be tuned in our experiments: we will use the same batchsize, momentum and number of epochs as the baselines in our experiments. In addition, we point out again that when , we recover the SGD step with Nesterov momentum.
In sections A.5 and A.6 of the appendix, we detail the derivation of the optimal stepsize (step 11) and the computation of the search direction (step 8). The computation of the dual search direction is omitted here for space reasons. However, its implementation is straightforward in practice, and its computational cost is linear in the size of the output space.
Finally, we emphasize that the DFW algorithm is motivated by an empirical perspective. While our method is not guaranteed to converge, our experiments show an effective minimization of the learning objective for the problems encountered in practice.
5 Experiments
We compare the Deep Frank Wolfe (DFW) algorithm to the stateoftheart optimizers. We show that, across diverse data sets and architectures, the DFW algorithm outperforms adaptive gradient methods (with the exception of one setting, DN10, where it obtains similar performance to AMSGrad and BPGrad). In addition, the DFW algorithm offers competitive and sometimes superior performance to SGD at considerably less computational cost, even though SGD has the advantage of a handdesigned schedule that has been handdesigned separately for each of these tasks.
Our experiments are implemented in pytorch
(Paszke et al., 2017), and the code will be made publicly available. All models are trained on a single Nvidia Titan Xp card.5.1 Image Classification with Convolutional Neural Networks
Data Set & Architectures.
The CIFAR10/100 data sets contain 60,000 RGB natural images of size 32 32 with 10/100 classes (Krizhevsky, 2009)
. We split the training set into 45,000 training samples and 5,000 validation samples, and use 10,000 samples for testing. The images are centered and normalized per channel. As is standard practice, we use random horizontal flipping and random crops with four pixels padding. We perform our experiments on two modern architectures of deep convolutional neural networks: wide residual networks
(Zagoruyko & Komodakis, 2016), and densely connected convolutional networks (Huang et al., 2017). Specifically, we employ a wide residual network of depth 40 and width factor 4, which has 8.9M parameters, and a bottleneck densely connected convolutional neural network of depth 40 and growth factor 40, which has 1.9M parameters. We refer to these architectures as WRN and DN respectively. All the following experimental details follow the protocol of (Zagoruyko & Komodakis, 2016) and (Huang et al., 2017). The only difference is that, instead of using 50,000 samples for training, we use 45,000 samples for training, and 5,000 samples for the validation set, which we found to be essential for all adaptive methods. While Deep Frank Wolfe (DFW) uses an SVM loss, the baselines are trained with the CrossEntropy (CE) loss since this resulted in better performance.Method.
We compare DFW to the most common adaptive learning rates currently used: Adagrad (Duchi et al., 2011), Adam (Kingma & Ba, 2015), the corrected version of Adam called AMSGrad (Reddi et al., 2018), and BPGrad (Zhang et al., 2017b). For these methods and for DFW, we crossvalidate the initial learning rate as a power of 10. We also evaluate the performance of SGD with momentum (simply referred to as SGD), for which we follow the protocol of (Zagoruyko & Komodakis, 2016) and (Huang et al., 2017). For all methods, we set a budget of 200 epochs for WRN and 300 epochs for DN. For DN, the regularization is set to as in (Huang et al., 2017). For WRN, the is crossvalidated between , as in (Zagoruyko & Komodakis, 2016), and , a more usual value that we have found to perform better for some of the methods (in particular DFW, since the corresponding loss function is an SVM instead of CE, for which the value of was designed). The value of the Nesterov momentum is set to 0.9 for BPGrad, SGD and DFW. DFW has only one hyperparameter to tune, namely , which is analogous to an initial learning rate. For SGD, the initial learning rate is set to 0.1 on both WRN and DN. Following (Zagoruyko & Komodakis, 2016) and (Huang et al., 2017), it is then divided by 5 at epochs 60, 120 and 180 for WRN, and by 10 at epochs 150 and 225 for DN.
Results.
Architecture  Optimizer  Test 
Accuracy (%)  
WRN  Adagrad  86.07 
WRN  Adam  84.86 
WRN  AMSGrad  86.08 
WRN  BPGrad  88.62 
WRN  DFW  90.18 
WRN  SGD  90.08 
DN  Adagrad  87.32 
DN  Adam  88.44 
DN  AMSGrad  90.53 
DN  BPGrad  90.85 
DN  DFW  90.22 
DN  SGD  92.02 
Architecture  Optimizer  Test 
Accuracy (%)  
WRN  Adagrad  57.64 
WRN  Adam  58.46 
WRN  AMSGrad  60.73 
WRN  BPGrad  60.31 
WRN  DFW  67.83 
WRN  SGD  66.78 
DN  Adagrad  56.47 
DN  Adam  64.61 
DN  AMSGrad  68.32 
DN  BPGrad  59.36 
DN  DFW  69.55 
DN  SGD  70.33 
Observe that DFW significantly outperforms the adaptive gradient methods, particularly on the more challenging CIFAR100 data set. On the WRNCIFAR100 task in particular, DFW obtains a testing accuracy which is about 7% higher than all other adaptive methods and outperforms SGD with a handdesigned schedule by 1%. The inferior generalization of adaptive gradient methods is consistent with the findings of Wilson et al. (2017). On all tasks, the accuracy of DFW is comparable to SGD. Note that DFW converges significantly faster than SGD: the network reaches its final performance several times faster than SGD in all cases. We illustrate this with an example in figure 2, which plots the training and validation errors on DNCIFAR100. In figure 3, one can see how the stepsize is automatically decayed by DFW on this same experiment: we compare the effective stepsize for DFW to the manually tuned for SGD.
5.2 Natural Language Inference with Recurrent Neural Networks
Data Set.
The Stanford Natural Language Inference (SNLI) data set is a large corpus of 570k pairs of sentences (Bowman et al., 2015). Each sentence is labeled by one of the three possible labels: entailment, neutral and contradiction. This allows the model to learn the semantics of the text data from a threeway classification problem. Thanks to its scale and its supervised labels, this data set allows large neural networks to learn highquality text embeddings. As Conneau et al. (2017)
demonstrate, the SNLI corpus can thus be used as a basis for transfer learning in natural language processing, in the same way that the ImageNet data set is used for pretraining in computer vision.
Method.
We follow the protocol of (Conneau et al., 2017) to learn their best model, namely a bidirectional LSTM of about 47M parameters. In particular, the reported results use SGD with an initial learning rate of 0.1 and a handdesigned schedule that adapts to the variations of the validation set: if the validation accuracy does not improve, the learning rate is divided by a factor of 5. We also report results on Adagrad, Adam, AMSGrad and BPGrad. Following the official SGD baseline, Nesterov momentum is deactivated. Using their opensource implementation, we replace the optimization by the DFW algorithm, the CE loss by an SVM, and leave all other components unchanged. In this experiment, we use the conditional gradient direction rather than the CE gradient, since threeway classification does not cause sparsity in the derivative of the hinge loss (which is the issue that originally motivated our use of a different direction). We crossvalidate our initial proximal term as a power of ten, and do not manually tune any schedule. In order to disentangle the importance of the loss function from the optimization algorithm, we run the baselines with both an SVM loss and a CE loss. The initial learning rate of the baselines is also crossvalidated as a power of ten.
Results.
The results are presented in Table 3.
Optimizer  Loss  Adagrad  Adam  AMSGrad  BPGrad  DFW  SGD  SGD 

Test Accuracy (%)  CE  83.8  84.5  84.2  83.6    84.7  84.5 
SVM  84.6  85.0  85.1  84.2  85.2  85.2   
Note that these results outperform the reported testing accuracy of 84.5% in (Conneau et al., 2017) that is obtained with CE. This experiment, which is performed on a completely different architecture and data set than the previous one, confirms that DFW outperforms adaptive gradient methods and matches the performance of SGD with a handdesigned learning rate schedule.
6 The Importance of The StepSize
6.1 Impact on Generalization
It is worth discussing the subtle relationship between optimization and generalization. As an illustrative example, consider the following experiment: we take the protocol to train the DN network on CIFAR100 with SGD, and simply change the initial learning rate to be ten times smaller, and the budget of epochs to be ten times larger. As a result, the final training objective significantly decreases from 0.33 to 0.069. Yet at the same time, the best validation accuracy decreases from 70.94% to 68.7%. A similar effect occurs when decreasing the value of the momentum, and we have observed this across various convolutional architectures. In other words, accurate optimization is less important for generalization than the implicit regularization of a high learning rate.
We have observed DFW to accurately optimize the learning objective in our experiments. However, given the above observation, we believe that its good generalization properties are rather due to its capability to usually maintain a high learning rate at an early stage. Similarly, the good generalization performance of SGD may be due to its schedule with a large number of steps at a high learning rate.
6.2 Sensitivity Analysis
The previous section has qualitatively hinted at the importance of the stepsize for generalization. Here we quantitatively analyze the impact of the initial learning rate on both the training accuracy (quality of optimization) and the validation accuracy (quality of generalization). We compare results of the DFW and SGD algorithms on the CIFAR data sets when varying the value of as a power of 10. The results on the validation set are summarized in figure 4, and the performance on the training set is reported in Appendix B.
On the training set, both methods obtain nearly perfect accuracy across at least three orders of magnitude of (details in Appendix B.3). In contrast, the results of figure 4 confirm that the validation performance is sensitive to the choice of for both methods.
In some cases where is high, SGD obtains a better performance than DFW. This is because the handdesigned schedule of SGD enforces a decay of , while the DFW algorithm relies on an automatic decay of the stepsize for effective convergence. This automatic decay may not happen if a small proximal term (large ) is combined with a local approximation that is not sufficiently accurate (for instance with a small batchsize).
However, if we allow the DFW algorithm to use a larger batch size, then the local approximation becomes more accurate and it can handle large values of as well. Interestingly, choosing a larger batchsize and a larger value of can result in better generalization. For instance, by using a batchsize of 256 (instead of 64) and , DFW obtains a test accuracy of 72.64% on CIFAR100 with the DN architecture (SGD obtains 70.33% with the settings of (Huang et al., 2017)).
6.3 Discussion
Our empirical evidence indicates that the initial learning rate can be a crucial hyperparameter for good generalization. We have observed in our experiments that such a choice of high learning rate provides a consistent improvement for convolutional neural networks: accurate minimization of the training objective with large initial steps usually leads to good generalization. Furthermore, as mentioned in the previous section, it is sometimes beneficial to even increase the batchsize in order to be able to train the model using large initial steps.
In the case of recurrent neural networks, however, this effect is not as distinct. Additional experiments on different recurrent architectures have showed variations in the impact of the learning rate and in the bestperforming optimizer. Further analysis would be required to understand the effects at play.
7 Conclusion
We have introduced DFW, an efficient algorithm to train deep neural networks. DFW predominantly outperforms adaptive gradient methods, and obtains similar performance to SGD without requiring a handdesigned learning rate schedule.
We emphasize the generality of our framework in Section 3, which enables the training of deep neural networks to benefit from any advance on optimization algorithms for linear SVMs. This framework could also be applied to other loss functions that yield efficiently solvable proximal problems. In particular, our algorithm already supports the use of structured prediction loss functions (Taskar et al., 2003; Tsochantaridis et al., 2004), which can be used, for instance, for image segmentation.
We have mentioned the intricate relationship between optimization and generalization in deep learning. This illustrates a major difficulty in the design of effective optimization algorithms for deep neural networks: the learning objective does not include all the regularization needed for good generalization. We believe that in order to further advance optimization for deep neural networks, it is essential to alleviate this problem and expose a clear objective function to optimize.
References
 Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Neural Information Processing Systems, 2016.

Arpit et al. (2017)
Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger,
Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron
Courville, Yoshua Bengio, and Simon LacosteJulien.
A closer look at memorization in deep networks.
International Conference on Machine Learning
, 2017.  Ba et al. (2017) Jimmy Ba, Roger Grosse, and James Martens. Distributed secondorder optimization using kroneckerfactored approximations. International Conference on Learning Representations, 2017.
 Bach (2015) Francis Bach. Duality between subgradient and conditional gradient methods. SIAM Journal on Optimization, 2015.
 Baydin et al. (2018) Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. International Conference on Learning Representations, 2018.
 Berrada et al. (2017) Leonard Berrada, Andrew Zisserman, and M Pawan Kumar. Trusting SVM for piecewise linear CNNs. International Conference on Learning Representations, 2017.
 Berrada et al. (2018) Leonard Berrada, Andrew Zisserman, and M Pawan Kumar. Smooth loss functions for deep topk classification. International Conference on Learning Representations, 2018.
 Botev et al. (2017) Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical gaussnewton optimisation for deep learning. International Conference on Machine Learning, 2017.
 Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. Conference on Empirical Methods in Natural Language Processing, 2015.
 Bubeck (2015) Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 2015.
 Chaudhari & Soatto (2018) Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. International Conference on Learning Representations, 2018.
 Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. Conference on Empirical Methods in Natural Language Processing, 2017.
 Desjardins et al. (2015) Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. Neural Information Processing Systems, 2015.
 Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.
 Frank & Wolfe (1956) Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.

Frerix et al. (2018)
Thomas Frerix, Thomas Möllenhoff, Michael Moeller, and Daniel Cremers.
Proximal backpropagation.
International Conference on Learning Representations, 2018.  Goel et al. (2017) Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the ReLU in polynomial time. Conference on Learning Theory, 2017.
 Grosse & Martens (2016) Roger Grosse and James Martens. A kroneckerfactored approximate fisher matrix for convolution layers. International Conference on Machine Learning, 2016.
 Hardt et al. (2016) Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. International Conference on Machine Learning, 2016.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Conference on Computer Vision and Pattern Recognition
, 2016. 
Heinemann et al. (2016)
Uri Heinemann, Roi Livni, Elad Eban, Gal Elidan, and Amir Globerson.
Improper deep kernels.
International Conference on Artificial Intelligence and Statistics
, 2016.  Hochreiter & Obermayer (2005) Sepp Hochreiter and Klaus Obermayer. Optimal gradientbased learning using importance weights. International Joint Conference on Neural Networks, 2005.
 Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Neural Information Processing Systems, 2017.
 Huang et al. (2017) Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. Conference on Computer Vision and Pattern Recognition, 2017.
 Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, 2009.
 LacosteJulien et al. (2013) Simon LacosteJulien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Blockcoordinate FrankWolfe optimization for structural SVMs. International Conference on Machine Learning, 2013.
 Lewis & Wright (2016) Adrian S Lewis and Stephen J Wright. A proximal method for composite minimization. Mathematical Programming, 2016.
 Li & Malik (2017) Ke Li and Jitendra Malik. Learning to optimize. International Conference on Learning Representations, 2017.
 Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. International Conference on Learning Representations, 2017.
 Martens & Grosse (2015) James Martens and Roger Grosse. Optimizing neural networks with Kroneckerfactored approximate curvature. International Conference on Machine Learning, 2015.
 Martens & Sutskever (2012) James Martens and Ilya Sutskever. Training deep and recurrent networks with Hessianfree optimization. Neural Networks: Tricks of the Trade, 2012.
 Martens et al. (2018) James Martens, Jimmy Ba, and Matt Johnson. Kroneckerfactored curvature approximations for recurrent neural networks. International Conference on Learning Representations, 2018.
 Mohapatra et al. (2016) Pritish Mohapatra, Puneet Dokania, C. V. Jawahar, and M. Pawan Kumar. Partial linearization based optimization for multiclass SVM. European Conference on Computer Vision, 2016.
 Nesterov (1983) Yurii Nesterov. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady, 1983.
 Neyshabur et al. (2015) Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Pathsgd: Pathnormalized optimization in deep neural networks. Neural Information Processing Systems, 2015.
 Neyshabur et al. (2016) Behnam Neyshabur, Yuhuai Wu, Ruslan R Salakhutdinov, and Nati Srebro. Pathnormalized optimization of recurrent neural networks with relu activations. Neural Information Processing Systems, 2016.
 Neyshabur et al. (2017) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. Neural Information Processing Systems, 2017.
 Ollivier (2013) Yann Ollivier. Riemannian metrics for neural networks. Information and Inference: a Journal of the IMA, 2013.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. NIPS Autodiff Workshop, 2017.
 Ravi & Larochelle (2017) Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. International Conference on Learning Representations, 2017.
 Reddi et al. (2018) Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. International Conference on Learning Representations, 2018.
 Roux et al. (2008) Nicolas L Roux, PierreAntoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. Neural Information Processing Systems, 2008.
 Rumelhart et al. (1986) David Rumelhart, Geoffrey Hinton, and Ronald Williams. Learning representations by backpropagating errors. Nature, 1986.
 Schaul et al. (2013) Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. International Conference on Machine Learning, 2013.
 Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. International Conference on Learning Representations, 2015.
 Singh & ShaweTaylor (2018) Gaurav Singh and John ShaweTaylor. Faster convergence & generalization in DNNs. arXiv preprint, 2018.
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. Conference on Computer Vision and Pattern Recognition, 2015.
 Taskar et al. (2003) Benjamin Taskar, Carlos Guestrin, and Daphne Koller. Maxmargin Markov networks. Neural Information Processing Systems, 2003.
 Taylor et al. (2016) Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: A scalable ADMM approach. International Conference on Machine Learning, 2016.
 Tsochantaridis et al. (2004) Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. International Conference on Machine Learning, 2004.
 Wichrowska et al. (2017) Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando de Freitas, and Jascha SohlDickstein. Learned optimizers that scale and generalize. International Conference on Machine Learning, 2017.
 Wilson et al. (2017) Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. Neural Information Processing Systems, 2017.
 Wu et al. (2018) Xiaoxia Wu, Rachel Ward, and Léon Bottou. Wngrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
 Yuille & Rangarajan (2002) Alan L. Yuille and Anand Rangarajan. The concaveconvex procedure (CCCP). Neural Information Processing Systems, 2002.
 Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. British Machine Vision Conference, 2016.
 Zeiler (2012) Matthew Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint, 2012.
 Zhang et al. (2017a) Yuchen Zhang, Percy Liang, and Martin J. Wainwright. Convexified convolutional neural networks. International Conference on Machine Learning, 2017a.
 Zhang et al. (2017b) Ziming Zhang, Yuanwei Wu, and Guanghui Wang. Bpgrad: Towards global optimality in deep learning via branch and pruning. Conference on Computer Vision and Pattern Recognition, 2017b.
Appendix A Proofs & Algorithms
For completeness, we prove results for our specific instance of Structural SVM problem. We point out that the proofs of sections A.1, A.2 and A.3 are adaptations from (LacosteJulien et al., 2013). Propositions are numbered according to their appearance in the paper.
a.1 Preliminaries
In this section, we assume the loss to be a hinge loss:
(8) 
We suppose that we have received a sample . We simplify the notation and . For simplicity of the notation, and without loss of generality, we consider the proximal problem obtained at time :
(9) 
Let us define the classification task loss:
(10) 
Using this notation, the multiclass hinge loss can be written as:
(11) 
Indeed, we can successively write:
(12) 
We are now going to rewrite problem (9) as the sum of a quadratic term and a pointwise maximum of linear functions. For , let us define:
(13) 
Then we have that:
(14) 
Therefore, problem (9) can be written as:
(15) 
We notice that the term in is a constant that does not depend on nor , therefore we can simplify the expression of to:
(16) 
We introduce the following notation:
(17)  
(18)  
(19) 
We will also use the indicator vector: , which is equal to 1 at index and 0 elsewhere.
a.2 Dual Objective
Lemma 1 (Dual Objective).
The Lagrangian dual of (9) is given by:
(20) 
Given the dual variables , the primal can be computed as .
Proof.
We derive the Lagrangian of the primal problem. For that, we write the problem in the following equivalent ways:
(21)  
(22)  
(23)  
(24) 
We can now write the KKT conditions of the inner minimization problem:
(25) 
This gives and , since by definition. By injecting these constraints in , we obtain:
(26) 
which finally gives the desired result. ∎
a.3 Derivation of the Optimal StepSize
Lemma 2 (Optimal StepSize).
Suppose that we make a step in the direction of in the dual. We define the corresponding primal variables and , as well as . Then the optimal stepsize is given by:
(27) 
Proof.
Given the direction , we take the step . The new objective is given by:
(28) 
In order to compute the optimal stepsize, we compute the derivative of the above expression with respect to gamma, and set it to 0:
(29) 
We can isolate the unique term containing :
(30) 
This yields:
(31) 
We can then inject the primal variables and simplify:
(32) 
∎
a.4 PrimalDual Proximal FrankWolfe Algorithm
We present here the primaldual algorithm that solves using the previous results:
Note that when is linear, and when the search direction is given by the conditional gradient, we recover the standard FrankWolfe algorithm for SVM (LacosteJulien et al., 2013).
a.5 SingleStep Proximal FrankWolfe Algorithm
We now provide some simplification to the steps 7, 9 and 10 of Algorithm 2 when a single step is taken, as is the case in the DFW algorithm. This corresponds to the iteration .
Proposition 2 (Cost per iteration).
If a single step is performed on the dual of (6), its conditional gradient is given by . The resulting update can be written as:
(33) 
Proof.
It is known that for linear SVMs, the direction of the dual conditional gradient is given by the negative subgradient of the primal (LacosteJulien et al., 2013; Bach, 2015). We apply this result to the Taylor expansion of the network, which is the local model used for the proximal problem. Then we have that at iteration , the conditional gradient is given by:
(34) 
It now suffices to notice that a firstorder Taylor expansion does not modify the derivative at its point of linearization: for a function ,
. By applying this property and the chain rule to (
34), we obtain that the conditional gradient is given by:(35) 
This completes the proof that the conditional gradient direction is given by a stochastic gradient. We now prove equation (33) in the next lemma. ∎
Lemma 3.
Proof.
Again, since we perform a single step of FW, we assume . To prove equation (36), we note that:
(39) 
We point out the two following results:
(40) 
and:
(41) 
Since by definition, equation (37) is obtained with a simple application of equations 40 and 41. Finally, we prove equation 38 by writing:
(42) 
∎
a.6 Smoothing the Loss
As pointed out in the paper, the SVM loss is nonsmooth and has sparse derivatives, which can prevent the effective training of deep neural networks (Berrada et al., 2018). Partial linearization can solve this problem by locally smoothing the dual (Mohapatra et al., 2016). However, this would introduce a temperature hyperparameter which is undesirable. Therefore, we note that DFW can be applied with any direction that is feasible in the dual, since it computes an optimal stepsize. In particular, the following result states that we can use the wellconditioned and nonsparse gradient of crossentropy.
Proposition 3.
The gradient of crossentropy in the primal gives a feasible direction in the dual. Furthermore, we can inexpensively detect when this feasible direction cannot provide any improvement in the dual, and automatically switch to the conditional gradient when that is the case.
For simplicity, we divide Proposition 3 into two distinct parts: first we show how the CE gradient gives a feasible direction in the dual, and then how it can be detected to be an ascent direction.
Lemma 4.
The gradient of crossentropy in the primal gives a feasible direction in the dual. In other words, the gradient of crossentropy in the primal is such that there exists a dual search direction verifying .
Proof.
We consider the vector of scores . We compute its softmax: . Clearly, by property of the softmax. Furthermore, by going back to the definition of , one can easily verify that is exactly the primal gradient given by a backward pass through the crossentropy loss instead of the hinge loss. This concludes the proof. ∎
The previous lemma has shown that we can use the gradient of crossentropy as a feasible direction in the dual. The next step is to make it a dual ascent direction, that is a direction which always permits improvement on the dual objective (unless at the optimal point). In what follows, we show that we can inexpensively (approximately) compute a sufficient condition for to be an ascent direction. If the condition is not satisfied, then we can automatically switch to use the subgradient of the hinge loss (which is known as an ascent direction in the dual).
Lemma 5.
Let be a feasible direction in the dual, and be the vector of augmented scores output by the linearized model.
Let us assume that we apply the singlestep Proximal FrankWolfe algorithm (that is, we have ), and that is a nonnegative function.
Then is a sufficient condition for to be an ascent direction in the dual.
Proof.
Let , . By definition, we have that:
(43) 
Therefore:
(44) 
We have just shown that if , then . Since is an optimal stepsize, this indicates that is an ascent direction (we would obtain for a direction that cannot provide improvement). ∎
Approximate Condition.
In practice, we consider that . Indeed, for , we have that , and , which is typically very small (we use a weight decay coefficient in the order of in our experimental settings). Therefore, we replace by in the above criterion, which becomes inexpensive since is already computed by the forward pass.
a.7 Nesterov Momentum
As can be seen in the previous primaldual algorithms, taking a step in the dual can be decomposed into two stages: the initialization and the movement along the search direction. The initialization step is not informative about the optimization problem. Therefore, we discard it from the momentum velocity, and only accumulate the step along the conditional gradient (scaled by ). This results in the following velocity update:
(45) 
Appendix B Experimental Details on the CIFAR Data Sets
b.1 Adaptive Gradient Baselines: CrossValidation
Accuracy CIFAR10 (%)  Accuracy CIFAR100 (%)  

0.0001  0.001  71.6  39.44 
0.0001  0.01  88.18  55.72 
0.0001  0.1  86.4  55.44 
0.0001  1  68.48  20.68 
Accuracy CIFAR10 (%)  Accuracy CIFAR100 (%)  

0.0001  0.001  68.98  31.86 
0.0001  0.01  86.4  53.82 
0.0001  0.1  83.6  51.18 
0.0005  0.001  68.66  32.5 
0.0005  0.01  86.3  56.16 
0.0005  0.1  77.92  44.12 
Accuracy CIFAR10 (%)  Accuracy CIFAR100 (%)  

0.0001  0.0001  86.26  50.7 
0.0001  0.001  89.42  63.9 
0.0001  0.01  81.12  51.82 
Accuracy CIFAR10 (%)  Accuracy CIFAR100 (%)  

0.0001  0.0001  79.7  41.42 
0.0001  0.001  86.1  58.7 
0.0001  0.01  80.06  50.86 
0.0005  0.0001  78.88  40.08 
0.0005  0.001  85.14  55.26 
0.0005  0.01  72.54  36.82 
Accuracy CIFAR10 (%)  Accuracy CIFAR100 (%)  

0.0001  0.0001  84.28  49.54 
0.0001  0.001  90.4  68.54 
0.0001  0.01  83.98  50.44 
Accuracy CIFAR10 (%)  Accuracy CIFAR100 (%)  

0.0001  0.0001  75.86  41.6 
0.0001  0.001  87.02  59.6 
0.0001  0.01  82.32  52.12 
0.0005  0.0001  75.74  42.28 
0.0005  0.001  86.16  57.82 
0.0005  0.01  75.82  36.48 
Accuracy CIFAR10 (%)  Accuracy CIFAR100 (%)  

0.0001  0.001  72.72  40.96 
0.0001  0.01  83.26  53.12 
0.0001  0.1  91.7  59.7 
0.0001  1  10.16  1.16 
Accuracy CIFAR10 (%)  Accuracy CIFAR100 (%)  

0.0001  0.001  64.98  31.9 
0.0001  0.01  78.46  44.26 
0.0001  0.1  89.24  54.42 
0.0001  1  16.1  1.16 
0.0005  0.001  68.08  33.26 
0.0005  0.01  85.44  59.9 
0.0005  0.1  88.44  51.28 
0.0005  1  10.16  1.16 
b.2 Convergence Plots
In this section we provide the convergence plots of the different algorithms on the CIFAR data sets. In some cases the training performance can show some oscillations. We emphasize that this is the result of crossvalidating the initial learning rate based on the validation set performance: sometimes a betterbehaved convergence would be obtained on the training set with a lower learning rate. However this lower learning rate is not selected because it does not provide the best validation performance.
b.3 SGD & DFW: Sensitivity Analysis
We provide here a sensitivity analysis of the DFW algorithm on its hyperparameter , and we compare it against the SGD algorithm with its custom schedule.
Appendix C Experimental Details on the SNLI Data Set
c.1 CrossValidation
Accuracy CE (%)  Accuracy SVM (%)  

0.001  83.43  84.16 
0.01  83.77  84.62 
0.1  62.09  34.5 
Accuracy CE (%)  Accuracy SVM (%)  

1e05  83.18  83.02 
0.0001  84.56  84.69 
0.001  84.42  83.31 
0.01  33.82  33.82 
Accuracy CE (%)  Accuracy SVM (%)  

1e05  82.81  82.95 
0.0001  84.69  84.83 
0.001  84.66  83.59 
0.01  36.78  38.25 
Accuracy CE (%)  Accuracy SVM (%)  

0.001  75.51  74.87 
0.01  83.09  83.02 
0.1  83.93  84.24 
1.0  84.28  84.73 
10  33.82  33.31 
Accuracy (%)  

0.1  84.87 
1.0  85.21 
10  84.76 
Accuracy CE (%)  Accuracy SVM (%)  

0.01  84.22  84.59 
0.1  84.63  85.15 
1.0  85.06  84.7 
10  34.59  34.51 
Comments
There are no comments yet.