1 Introduction
Machine learning tasks can often be expressed as general constrained convex optimization problems of the form
(1) 
where is a convex and continuously differentiable function, and is a compact convex subset of a Hilbert space. For such optimization problems, one of the simplest and earliest known iterative optimizers is given by the FrankWolfe (FW) algorithm Jaggi (2013), summarized in Fig. 2(bottom), also known as conditional gradient. In each iteration, it considers the linearization of the objective at the current position and moves towards a convex minimizer of this linear function (taken over the same domain). In other words, FrankWolfe effectively turns the constrained convex optimization problem into a series of simple linear optimization problems.
Recently, Bauckhage (2017) employed this view to implement FrankWolfe optimizing over the unit simplex—the convex hull of the unit basis vectors—in terms of a recurrent neural network (RNN). Since the domain
is given as an intersection of linear constraints, the subproblems can be solved using softmin activation functions. This paper significantly extends our understanding of such neural conditional gradients.
As warm up, we show that the resulting FrankWolfe Networks (FWNets)—the generalized architecture is shown in Fig. 2(top)—allow one to implement (training) support vector machines directly within neural networks. Unfortunately, the resulting neural optimizer is too dense to scale to large classification problems: it hinges on the quadratic gram matrix. Consequently, as our second contribution, we introduce sparse FWNets for convex optimization over the unit ball of the tracenorm, i.e., . Since the subproblems amount to approximating the unit left and right top singular vectors of the gradient matrix , we replace the softmin activation functions by sparse RNNs that are structurally equivalent to the well known power iteration. This allows one to realize neural conditional gradients for, e.g., sparse softmax classifiers that scale well to large datasets.
The closest in spirit to FWNets are probably OptNets
Amos & Kolter (2017). They integrate constrained optimization problems, in particular quadratic ones as individual layers into neural networks. This has also the potential of richer endtoend training for complex tasks that require such optimization. However, OptNets do not cast the optimizer itself as a neural network. Instead, external optimizers are invoked to solve OptNets. This hampers a seamless integration with other deep learning concepts.
Consider e.g. learning to learn (L2L), which has a long history in psychology Ward (1937); Harlow (1949); Kehoe (1988) and has inspired many recent attempts within the machine learning community to build agents capable of learning to learn Schmidhuber (1987); Naik & Mammone (1992); Thrun & Pratt (1998); Hochreiter et al. (2001); Santoro et al. (2016); Andrychowicz et al. (2016); Wang et al. (2016); Ravi & Larochelle (2017); Li & Malik (2017). So far, however, learning to learn has mainly been considered for gradient(free) optimizers (L2LG); learning to learn by conditional gradients (L2LC) has not been proposed. Our third contribution fills this gap. We show how to boost the performance of neural conditional gradients by learning parts of them instead of using handcoded ones. Our learned conditional gradient optimizers, implemented by LSTMs, outperform handdesigned as well as unconstrained but learned competitors. We demonstrate this on a number of classification tasks, including training deep SVM and softmax classifiers.
We proceed as follows. We start of by reviewing L2L. Then we illustrate FWNets and use them to devise L2LC. Afterwards we introduce sparse FWNets for tracenorm problems. Before concluding, we present our experimental evaluation.
2 Learning to learn by gradients by gradients
Let us start off by briefly reviewing learning to learn by gradients by gradients Andrychowicz et al. (2016). The goal is to optimize an objective function defined over some domain . To this end, we find the minimizer . While any method capable of minimizing this objective can be applied, the standard approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates To realize L2L, Andrychowicz et al. (2016) proposed to replace handdesigned update rules with a learned update rule, called the optimizer , specified by its own set of parameters. This results in updates to the optimizee of the form More precisely, Andrychowicz et al. advocated to realize the update rule using a recurrent neural network (RNN), which maintains its own state and hence dynamically updates as a function of its iterates.
Indeed this learning to learn by gradients by gradients is widely applicable due to the simplicity of gradient computations. When facing a constraint optimization problem, however, maintaining feasibility typically requires a projection step, which is potentially computationally expensive, especially for complex feasible regions in very large dimensions. To overcome this, we advocate the use of the FrankWolfe algorithm Jaggi (2013), which eschews the projection step and rather use a linear optimization oracle to stay within the feasible region. While convergence rates and regret bounds are often suboptimal, in many cases the gain due to only having to solve a single linear optimization problem over the feasible region in every iteration still leads to significant computational advantages. This may explain its popularity for problems such as computing the distance to a convex hull, computing a minimum enclosing ball, or training a support vector machine.
3 Neural support vector machines
Support Vector Machine (SVM) are working horses of machine learning. Frank Wolfe algorithms for training them Ouyang & Gray (2010) solve a quadratic program (QP) over the unit simplex, i.e., the convex hull of the unit basis vectors. To implement them as neural networks, we can proceed as follows. Consider, e.g., the SVM formulation for binary classification
where . The corresponding Lagrangian dual problem for SVMs can be expressed as:
(2) 
Here, is a positive definite kernel matrix As shown in Frandi et al. (2014), , hence we have where
So, the gradient for the new objective is . Therefore FW requires computing
(3) 
where the nonlinear, vectorvalued function is the well known softmin operator defined as
(4) 
for which we note Plugging in the relaxed optimization step (3), we can now rewrite the FrankWolfe updates for SVMs as
where . But this is then to say that by choosing an appropriate parameter for the softmin function the following nonlinear dynamical system
(5) 
mimics FrankWolfe up to arbitrary precision.
The underlying FW over the unit simplex is summarized in Fig. 2(bottom). Structurally it is equivalent to the system of equations governing the dynamics of echo state networks, a particular from of recurrent neural networks (RNN), shown in Fig. 2(top) for training SVMs. For inference, we can unroll the RNN into a multilayer neural network. Due to well known FW convergence results Jaggi (2013), we know that layers are likely to provide an approximate solution to the SVM problem.
Moreover, the neural view on training SVMs allows one to deepify SVMs: we replace the final classification layer of a deep network by a FWNet that trains an SVM. This enables endtoend training akin to Tang (2013); Zhang et al. (2015) but in a simpler and fully neural fashion: the SVM parameters are updated via a forwardpropagation only, and the parameters of the kernel neural network are updated by gradient descent using backpropagation of the error starting at the FWNet, cf. Fig. 3. Here, denotes the input to the FWNets. During training, the FWNet computes the Kernel at each iteration and the weights (respectively of Eq. (5)) are updated as described above. To predict the class of a new example , we make one fowardpass through the network.
4 Learning to learn by conditional gradients by gradients (L2LC)
The performance of FWNets is hampered by the fact that they only makes use of the linearizations, ignoring other information such as curvature. To speed them up, we now introduce learning to learn by conditional gradients by gradients (L2LC) as depicted in Fig. 4.
The FWNet is unrolled over time , and at each step parts of the optimizer are trained using an RNN as optimize (orange). This way, the optimizee adapts the parts, which are then used to form a FrankWolfe update. Consider, e.g., learning to learn SVMs. Instead of using the typical handcoded rule or implementing a linesearch, we learn to adapt, e.g., the learningrate:
(6)  
where are the weights of the FrankWolfe layer and h the state of an RNN, e.g., an LSTM. Or, we learn to adapt the conditional gradient itself. For that, one has to be little bit more careful. We have to ensure that the predictions are on the unit simplex:
(7)  
where or is constant. That is, the RNN predicts the unconstrained gradient, which is then projected onto the unit simplex using a sigmoid. Overall, the FWNet is unrolled over the learning iterations , and at each step the unconstrained gradient is used as input to the RNN, the optimizee (orange). The prediction in then squeezed through a sigmoid and we update the weight vector.
5 Neural sparse softmax classifiers
Unfortunately, neural SVMs are not likely to scale well. The underlying SVM scales quadratically in the number of training examples due to the gram matrix. Indeed, one may resort to devise neural implementation of stochastic FrankWolfe algorithms Jaggi (2013) or frame the learning problem within L2LC, generalizing local FWNets to a global model Vinyals et al. (2016); Ravi & Larochelle (2017). Here we introduce FWNets for training largescale, sparse softmax classifiers Liu & Tsang (2017), i.e., for optimization problems of the following form:
(8) 
where with being the tracenorm (also called the nuclear or Schatten norm). The tracenorm ball is the convex hull of the rank1 matrices, which is also compact. The averaged multiclass objectives are with The individual gradients are
Since Schattennorms are invariant under orthogonal transformations, we can employ the singular value decomposition (SVD) to minimize the induced linear subproblems. Therefore the main computational cost of a single FWLayer on a Schattennorm domain remains the computation of the SVD of
, which is in Jaggi (2013). For bounded tracenorm, however, the subproblems can be solved by a single approximate eigenvector computation instead of a complete SVD, which is much more efficiently, especially if the matrix dimensions are large and the optimal solution is lowrank
AllenZhu et al. (2017). This gives FrankWolfe a significant computational advantage over projected and proximal gradient descent approaches. The vectors and can be efficient computed via power iteration Zheng et al. (2017). This results in a rank1 solution of Eq. (8), which can be written as where and are the unit left and right top singular vectors of the gradient matrix :The FrankWolfe algorithm and corresponding FWNet for the tracenorm domain are shown in Fig. 5. Here, lines 411 instantiate the general FrankWolfe algorithm in Alg. 2 with power iterations to compute the top singular vectors and of . Zheng et al. (2017) showed that a small number of power iterations is sufficient to ensure a sublinear convergence in expectation and if the number of power iterations are constant (i.e. for all ) the FrankWolfe algorithm converges in expectation to a neighborhood of the optimal solution whose size decreases with . In any case, the power iteration can naturally be implemented within FWNets using an RNN as summarized in Fig. 5. Everything else remains conceptually the same and, in turn, we may even realize L2LC over lowrank matrices following similar arguments as for the unit simplex.
6 Experimental evidence
Our intention here is to evaluate neural conditional gradients by investigating the following questions: (Q1) Can FWNets compete with popular, nonneural gradient descent approaches such as ADAM Kingma & Ba (2014)? (Q2) Can we train CSVMs endtoend using FWNets? (Q3) Can L2LC be faster than L2LG? (Q4) Do neural rank1 softmax classifiers perform and scale well?
To this end, we implemented FWNets, neural SVMs, neural softmax classifiers, and L2LC using the TensorFlow API version 1.3 and the L2L implementation of
Andrychowicz et al. (2016). All experiments were ran on a Linux Machine with a NVIDIA GeForce GTX 1080 Ti with 11 GB memory and a AMD Ryzen Threadripper 1950X CPU with 16 physical cores having 32 threads in total. We considered several datasets. For comparing FWNets with classical, nonneural gradient optimization, we used both the synthetic datasets of “concentric circles” (Fig. 8) as well as the realworld datasets MNIST LeCun et al. (1998) containing images of handwritten digits and Cifar10 respectively Cifar100 Krizhevsky (2009) containing images of different animals and vehicles. For the L2L experiments, we split the data into three disjoint sets. One split was used to train the optimizee, one for training the optimizer, and the final one to test the corresponding learned model. The neural SVMs are compared to ADAM gradient optimizers based on (1) reparameterization and (2) Lagrange multipliers to deal with the “sum to one” constraint. The objective of the Langrangian approach reads(9) 
Furthermore we evaluate the performance of learning to learn the optimizer for solving the given problem. For that we used an LSTM as optimzee. More precisley, following Andrychowicz et al. (2016), we introduced an additional LSTM to optimize the stepsize respectively two different LSTMs when optimizing the the fully connected and convolutional layers. In all experiments we used twolayer LSTMs with 20 hidden units in each layer, aiming at minimizing (2) respectively (9) using truncated backpropgation through time and early stopping in order to avoid overfitting.
FewShot Neural SVMs (Q1, Q3). In our first experiment we considered classes 1 and 2, denoted as Cifar2, from the Cifar10 dataset. We extracted their features from an inceptionnetwork and used them for training the base models (Q1) using a linear kernel. Additionally we train an optimizee for FW (Q3). A random search set . Fig. 7 summarizes the results. The optimizee is trained on the classes 1 and 2, but then used to optimize neural SVMs on all pairwise combinations (13,14,…,23,…) of classes from Cifar10 and therefore transferring to a completely novel dataset. As one can see, FWNets and L2LC outperformed the other baselines. FWNets with an handdesign, adaptive stepsize can be slightly faster than L2LC
, but the LSTM learns to control FW in a similar way and shows a much smaller variance. This answers (
Q1, Q3) affirmatively.Deep SVMs (Q2, Q3).
Next we considered training deep SVMs on the “concentric circles” dataset, i.e., we placed an FWNet as a last layer of a neural network, trained in an endtoend (Q1) as well as in a learning to learn fashion (Q4). The neural network we used as kernel contained three layers. The first and second layer are fullyconnected layer with four and two neurons each, trained using ADAM. Fig. 7 summarizes the results. As one can see, the learned optimizers outperformed the handcoded ones. Moreover, as Fig. 8 illustrates, a learned optimizer may find smoother hyperplanes achieving the same loss in less many iterations, when also adapting the kernel: just using FWNets takes 500 iterations; when the LSTM controls the step size, it takes 200 iterations; when also adapting the kernel, it just takes 20 iterations and the hyperplane is considerably smoother. This answers (Q2, Q3) affirmatively.
To investigated deep SVMs further, we also considered Cifar10. We split the labels of the dataset in two different classes, namely natural and manmade. The class natural contains the classes bird, cat, deer, dog, frog and horse, and the manmade class the classes airplane, automobile, ship and truck
. As kernel we used a neural network with both convolutional and fullyconnected layers: three convolutional layers with max pooling followed by a fullyconnected layer with 32 hidden units; all nonlinearities were ReLU activations with batch normalization. The final layer is a FWNet simulating to train an SVM, and the rest of the network was trained using ADAM. Fig.
9 summarize the results.As one can see, the stepsizelearned conditional gradient L2LC outperforms handcoded optimizers even with adaptive stepsize; requiring less than half of the iterations to converge. Training also the kernel is harder as it is a nonconvex problem; exploring this further is an interesting avenue for future work. In any case, the results answer (Q2, Q3) affirmatively.
Sparse Neural Softmax Classifiers (Q1, Q3, Q4).
Finally, we investigated FWNets for training deep softmax classifiers on MNIST. We used a simple CNN with two convolutional layer and one fullyconnected consisting of 16 neurons followed by a fullyconnected softmax layer. For both FWnets and ADAMbased optimizers we used the same constant step size of
. The FWNets unrolled the power iteration networks for five steps in order to compute the left and right top singular vectors of the gradient matrix . The step size was set . Additionaly we train an optimizee (L2LC) for FW. Therefore we split the trainingset of MNIST in two disjoint sets with the same size. The results are summarized in Fig. 10. As one can see, the sparse FWNet and L2LCoutperformed ADAM, both in terms of convergence and predictive performance; the same top1 accuracy in less than third of the iterations. Furthermore the trained classifier optimized with the optimizee results in a much higher confidence of the predictions, as one can see from the behavior of the Lossfunction. This answers (
Q1, Q2, Q4) affirmatively.To investigate this further, we also considered wider and deeper CNNs on MNIST and also on Cifar10 and Cifar100. The more dense the network became, the better the ADAM performed. This validates our assumption of lowrank solutions: if the lowrank assumption does not hold or is not required, there is no point in estimating a sparse model using the tracenorm constraint
AllenZhu et al. (2017).7 Conclusion
We have introduced the learning to learn by conditional gradients (L2LC) framework based on FrankWolfe Networks (FWNets). This enables one to train sparse convex optimizers that are specialized to particular classes of problems. We illustrated this for training SVMs and sparse softmax classifiers. Our experimental results confirm that learned conditional gradients compare favorably against stateoftheart optimization methods used in deep learning.
There are several interesting avenues for future work. One should develop FWNets for other ML tasks such as graph classification Kersting et al. (2014) and Bayesian Quadrature Briol et al. (2015) as well as for other FW approaches Jaggi (2013). One may also adapt the Power Iteration in an endtoend fashion Duvenaud et al. (2015). Finally, hierarchical RNNs Wichrowska et al. (2017) have the potential to speed up learning to learn by conditional gradients.
Acknowledgments: This work was supported by the Federal Ministry of Food, Agriculture and Consumer Protection (BMELV) based on a decision of the German Federal Office for Agriculture and Food (BLE); grant nr. “2818204715”.
References
 AllenZhu et al. (2017) AllenZhu, Z., Hazan, E., Hu, W., and Li, Y. Linear Convergence of a FrankWolfe Type Algorithm over TraceNorm Balls. In Proc. NIPS, 2017.
 Amos & Kolter (2017) Amos, B. and Kolter, J.Z. OptNet: Differentiable Optimization as a Layer in Neural Networks. In Proc. ICML, 2017.
 Andrychowicz et al. (2016) Andrychowicz, Marcin, Denil, Misha, Colmenarejo, Sergio Gomez, Hoffman, Matthew W., Pfau, David, Schaul, Tom, and de Freitas, Nando. Learning to learn by gradient descent by gradient descent. CoRR, 2016.
 Bauckhage (2017) Bauckhage, C. A Neural Network Implementation of FrankWolfe Optimization. In Proc. ICANN, 2017.
 Briol et al. (2015) Briol, F.X., Oates, C.J., Girolami, M.A., and Osborne, M.A. FrankWolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees. In Proc. NIPS, 2015.
 Duvenaud et al. (2015) Duvenaud, D.K., D.Maclaurin, AguileraIparraguirre, J., GómezBombarelli, R., Hirzel, T., AspuruGuzik, A., and Adams, R.P. Convolutional Networks on Graphs for Learning Molecular Fingerprints. In Proc. NIPS, 2015.
 Frandi et al. (2014) Frandi, E., Ñanculef, R., and Suykens, J.A. K. Complexity Issues and Randomization Strategies in FrankWolfe Algorithms for Machine Learning. arXiv:1410.4062, 2014.
 Frank & Wolfe (1956) Frank, M. and Wolfe, P. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 3(1–2):95–110, 1956.
 Harlow (1949) Harlow, H.F. The Formation of Learning Sets. Psychological Review, 56(1):51–65, 1949.
 Hochreiter et al. (2001) Hochreiter, S., Younger, A.S., and Conwell, P.R. Learning to Learn Using Gradient Descent. In Proc. ICANN, 2001.
 Jaggi (2013) Jaggi, M. Revisiting FrankWolfe: ProjectionFree Sparse Convex Optimization. In Proc. ICML, 2013.

Kehoe (1988)
Kehoe, E.
A Layered Network Model of Associative Learning: Learning to Learn and Configuration.
Psychological Review, 95(4):411–433, 1988.  Kersting et al. (2014) Kersting, K., Mladenov, M., Garnett, R., and Grohe, M. Power Iterated Color Refinement. In Proc. AAAI, 2014.
 Kingma & Ba (2014) Kingma, D.P. and Ba, J. Adam: A Method for Stochastic Optimization. In Proc. ICLR, 2014.
 Krizhevsky (2009) Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Technical report, Univ. of Toronto, 2009.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Li & Malik (2017) Li, K. and Malik, J. Learning to Optimize Neural Nets. arXiv:1703.00441, 2017.
 Liu & Tsang (2017) Liu, Z. and Tsang, I.W. Approximate Conditional Gradient Descent on MultiClass Classification. In Proc. AAAI, 2017.
 Naik & Mammone (1992) Naik, D.K. and Mammone, R.J. MetaNeural Networks that Learn by Learning. In Proc. IJCNN, 1992.
 Ouyang & Gray (2010) Ouyang, H. and Gray, A.G. Fast Stochastic FrankWolfe Algorithms for Nonlinear SVMs. In Proc. SDM, 2010.
 Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a Model for Fewshot Learning. In Proc. ICLR, 2017.
 Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T.P. MetaLearning with MemoryAugmented Neural Networks. In Proc. ICML, 2016.
 Schmidhuber (1987) Schmidhuber, J. Evolutionary Principles in SelfReferential Learning. Master’s thesis, TU Munich, 1987.
 Tang (2013) Tang, Y. Deep Learning Using Support Vector Machines. arXiv:1306.0239, 2013.
 Thrun & Pratt (1998) Thrun, S. and Pratt, L. (eds.). Learning to Learn. Kluwer Academic Publishers, 1998.
 Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. Matching Networks for One Shot Learning. In Proc. NIPS, 2016.
 Wang et al. (2016) Wang, J.X., KurthNelson, Z., Tirumala, D., Soyer, H., Leibo, J.Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to Reinforcement Learn. arXiv:1611.05763, 2016.
 Ward (1937) Ward, L.B. Reminiscence and Rote Learning. Psychological Monographs, 49(4), 1937.
 Wichrowska et al. (2017) Wichrowska, O., Maheswaranathan, N., Hoffman, M.W., Gomez Colmenarejo, S., Denil, M., de Freitas, N., and SohlDickstein, J. Learned Optimizers that Scale and Generalize. arXiv:1703.04815, 2017.
 Zhang et al. (2015) Zhang, S.X., Liu, C., Yao, K., and Gong, Y. Deep Neural Support Vector Machines for Speech Recognition. In Proc. ICASSP, 2015.
 Zheng et al. (2017) Zheng, W., Bellet, A., and Gallinari, P. A Distributed FrankWolfe Framework for Learning LowRank Matrices with the Trace Norm. arXiv:1712.07495, 2017.
Comments
There are no comments yet.