Machine learning tasks can often be expressed as general constrained convex optimization problems of the form
where is a convex and continuously differentiable function, and is a compact convex subset of a Hilbert space. For such optimization problems, one of the simplest and earliest known iterative optimizers is given by the Frank-Wolfe (FW) algorithm Jaggi (2013), summarized in Fig. 2(bottom), also known as conditional gradient. In each iteration, it considers the linearization of the objective at the current position and moves towards a convex minimizer of this linear function (taken over the same domain). In other words, Frank-Wolfe effectively turns the constrained convex optimization problem into a series of simple linear optimization problems.
Recently, Bauckhage (2017) employed this view to implement Frank-Wolfe optimizing over the unit simplex—the convex hull of the unit basis vectors—in terms of a recurrent neural network (RNN). Since the domain
is given as an intersection of linear constraints, the subproblems can be solved using softmin activation functions. This paper significantly extends our understanding of such neural conditional gradients.
As warm up, we show that the resulting Frank-Wolfe Networks (FWNets)—the generalized architecture is shown in Fig. 2(top)—allow one to implement (training) support vector machines directly within neural networks. Unfortunately, the resulting neural optimizer is too dense to scale to large classification problems: it hinges on the quadratic gram matrix. Consequently, as our second contribution, we introduce sparse FWNets for convex optimization over the unit ball of the trace-norm, i.e., . Since the subproblems amount to approximating the unit left and right top singular vectors of the gradient matrix , we replace the softmin activation functions by sparse RNNs that are structurally equivalent to the well known power iteration. This allows one to realize neural conditional gradients for, e.g., sparse softmax classifiers that scale well to large datasets.
The closest in spirit to FWNets are probably OptNetsAmos & Kolter (2017)
. They integrate constrained optimization problems, in particular quadratic ones as individual layers into neural networks. This has also the potential of richer end-to-end training for complex tasks that require such optimization. However, OptNets do not cast the optimizer itself as a neural network. Instead, external optimizers are invoked to solve OptNets. This hampers a seamless integration with other deep learning concepts.
Consider e.g. learning to learn (L2L), which has a long history in psychology Ward (1937); Harlow (1949); Kehoe (1988) and has inspired many recent attempts within the machine learning community to build agents capable of learning to learn Schmidhuber (1987); Naik & Mammone (1992); Thrun & Pratt (1998); Hochreiter et al. (2001); Santoro et al. (2016); Andrychowicz et al. (2016); Wang et al. (2016); Ravi & Larochelle (2017); Li & Malik (2017). So far, however, learning to learn has mainly been considered for gradient(-free) optimizers (L2LG); learning to learn by conditional gradients (L2LC) has not been proposed. Our third contribution fills this gap. We show how to boost the performance of neural conditional gradients by learning parts of them instead of using hand-coded ones. Our learned conditional gradient optimizers, implemented by LSTMs, outperform hand-designed as well as unconstrained but learned competitors. We demonstrate this on a number of classification tasks, including training deep SVM and softmax classifiers.
We proceed as follows. We start of by reviewing L2L. Then we illustrate FWNets and use them to devise L2LC. Afterwards we introduce sparse FWNets for trace-norm problems. Before concluding, we present our experimental evaluation.
2 Learning to learn by gradients by gradients
Let us start off by briefly reviewing learning to learn by gradients by gradients Andrychowicz et al. (2016). The goal is to optimize an objective function defined over some domain . To this end, we find the minimizer . While any method capable of minimizing this objective can be applied, the standard approach for differentiable functions is some form of gradient descent, resulting in a sequence of updates To realize L2L, Andrychowicz et al. (2016) proposed to replace hand-designed update rules with a learned update rule, called the optimizer , specified by its own set of parameters. This results in updates to the optimizee of the form More precisely, Andrychowicz et al. advocated to realize the update rule using a recurrent neural network (RNN), which maintains its own state and hence dynamically updates as a function of its iterates.
Indeed this learning to learn by gradients by gradients is widely applicable due to the simplicity of gradient computations. When facing a constraint optimization problem, however, maintaining feasibility typically requires a projection step, which is potentially computationally expensive, especially for complex feasible regions in very large dimensions. To overcome this, we advocate the use of the Frank-Wolfe algorithm Jaggi (2013), which eschews the projection step and rather use a linear optimization oracle to stay within the feasible region. While convergence rates and regret bounds are often suboptimal, in many cases the gain due to only having to solve a single linear optimization problem over the feasible region in every iteration still leads to significant computational advantages. This may explain its popularity for problems such as computing the distance to a convex hull, computing a minimum enclosing ball, or training a support vector machine.
3 Neural support vector machines
Support Vector Machine (SVM) are working horses of machine learning. Frank Wolfe algorithms for training them Ouyang & Gray (2010) solve a quadratic program (QP) over the unit simplex, i.e., the convex hull of the unit basis vectors. To implement them as neural networks, we can proceed as follows. Consider, e.g., the -SVM formulation for binary classification
where . The corresponding Lagrangian dual problem for SVMs can be expressed as:
Here, is a positive definite kernel matrix As shown in Frandi et al. (2014), , hence we have where
So, the gradient for the new objective is . Therefore FW requires computing
where the non-linear, vector-valued function is the well known softmin operator defined as
for which we note Plugging in the relaxed optimization step (3), we can now rewrite the Frank-Wolfe updates for SVMs as
where . But this is then to say that by choosing an appropriate parameter for the softmin function the following non-linear dynamical system
mimics Frank-Wolfe up to arbitrary precision.
The underlying FW over the unit simplex is summarized in Fig. 2(bottom). Structurally it is equivalent to the system of equations governing the dynamics of echo state networks, a particular from of recurrent neural networks (RNN), shown in Fig. 2(top) for training SVMs. For inference, we can unroll the RNN into a multi-layer neural network. Due to well known FW convergence results Jaggi (2013), we know that layers are likely to provide an -approximate solution to the SVM problem.
Moreover, the neural view on training SVMs allows one to deepify SVMs: we replace the final classification layer of a deep network by a FWNet that trains an SVM. This enables end-to-end training akin to Tang (2013); Zhang et al. (2015) but in a simpler and fully neural fashion: the SVM parameters are updated via a forward-propagation only, and the parameters of the kernel neural network are updated by gradient descent using back-propagation of the error starting at the FWNet, cf. Fig. 3. Here, denotes the input to the FWNets. During training, the FWNet computes the Kernel at each iteration and the weights (respectively of Eq. (5)) are updated as described above. To predict the class of a new example , we make one foward-pass through the network.
4 Learning to learn by conditional gradients by gradients (L2LC)
The performance of FWNets is hampered by the fact that they only makes use of the linearizations, ignoring other information such as curvature. To speed them up, we now introduce learning to learn by conditional gradients by gradients (L2LC) as depicted in Fig. 4.
The FWNet is unrolled over time , and at each step parts of the optimizer are trained using an RNN as optimize (orange). This way, the optimizee adapts the parts, which are then used to form a Frank-Wolfe update. Consider, e.g., learning to learn SVMs. Instead of using the typical hand-coded rule or implementing a line-search, we learn to adapt, e.g., the learning-rate:
where are the weights of the Frank-Wolfe layer and h the state of an RNN, e.g., an LSTM. Or, we learn to adapt the conditional gradient itself. For that, one has to be little bit more careful. We have to ensure that the predictions are on the unit simplex:
where or is constant. That is, the RNN predicts the unconstrained gradient, which is then projected onto the unit simplex using a sigmoid. Overall, the FWNet is unrolled over the learning iterations , and at each step the unconstrained gradient is used as input to the RNN, the optimizee (orange). The prediction in then squeezed through a sigmoid and we update the weight vector.
5 Neural sparse softmax classifiers
Unfortunately, neural SVMs are not likely to scale well. The underlying SVM scales quadratically in the number of training examples due to the gram matrix. Indeed, one may resort to devise neural implementation of stochastic Frank-Wolfe algorithms Jaggi (2013) or frame the learning problem within L2LC, generalizing local FWNets to a global model Vinyals et al. (2016); Ravi & Larochelle (2017). Here we introduce FWNets for training large-scale, sparse softmax classifiers Liu & Tsang (2017), i.e., for optimization problems of the following form:
where with being the trace-norm (also called the nuclear- or Schatten -norm). The trace-norm ball is the convex hull of the rank-1 matrices, which is also compact. The averaged multi-class objectives are with The individual gradients are
Since Schatten-norms are invariant under orthogonal transformations, we can employ the singular value decomposition (SVD) to minimize the induced linear subproblems. Therefore the main computational cost of a single FWLayer on a Schatten-norm domain remains the computation of the SVD of, which is in Jaggi (2013)
. For bounded trace-norm, however, the subproblems can be solved by a single approximate eigenvector computation instead of a complete SVD, which is much more efficiently, especially if the matrix dimensions are large and the optimal solution is low-rankAllen-Zhu et al. (2017). This gives Frank-Wolfe a significant computational advantage over projected and proximal gradient descent approaches. The vectors and can be efficient computed via power iteration Zheng et al. (2017). This results in a rank-1 solution of Eq. (8), which can be written as where and are the unit left and right top singular vectors of the gradient matrix :
The Frank-Wolfe algorithm and corresponding FWNet for the trace-norm domain are shown in Fig. 5. Here, lines 4-11 instantiate the general Frank-Wolfe algorithm in Alg. 2 with power iterations to compute the top singular vectors and of . Zheng et al. (2017) showed that a small number of power iterations is sufficient to ensure a sublinear convergence in expectation and if the number of power iterations are constant (i.e. for all ) the Frank-Wolfe algorithm converges in expectation to a neighborhood of the optimal solution whose size decreases with . In any case, the power iteration can naturally be implemented within FWNets using an RNN as summarized in Fig. 5. Everything else remains conceptually the same and, in turn, we may even realize L2LC over low-rank matrices following similar arguments as for the unit simplex.
6 Experimental evidence
Our intention here is to evaluate neural conditional gradients by investigating the following questions: (Q1) Can FWNets compete with popular, non-neural gradient descent approaches such as ADAM Kingma & Ba (2014)? (Q2) Can we train CSVMs end-to-end using FWNets? (Q3) Can L2LC be faster than L2LG? (Q4) Do neural rank-1 softmax classifiers perform and scale well?
To this end, we implemented FWNets, neural SVMs, neural softmax classifiers, and L2LC using the TensorFlow API version 1.3 and the L2L implementation ofAndrychowicz et al. (2016). All experiments were ran on a Linux Machine with a NVIDIA GeForce GTX 1080 Ti with 11 GB memory and a AMD Ryzen Threadripper 1950X CPU with 16 physical cores having 32 threads in total. We considered several datasets. For comparing FWNets with classical, non-neural gradient optimization, we used both the synthetic datasets of “concentric circles” (Fig. 8) as well as the real-world datasets MNIST LeCun et al. (1998) containing images of handwritten digits and Cifar-10 respectively Cifar-100 Krizhevsky (2009) containing images of different animals and vehicles. For the L2L experiments, we split the data into three disjoint sets. One split was used to train the optimizee, one for training the optimizer, and the final one to test the corresponding learned model. The neural SVMs are compared to ADAM gradient optimizers based on (1) reparameterization and (2) Lagrange multipliers to deal with the “sum to one” constraint. The objective of the Langrangian approach reads
Furthermore we evaluate the performance of learning to learn the optimizer for solving the given problem. For that we used an LSTM as optimzee. More precisley, following Andrychowicz et al. (2016), we introduced an additional LSTM to optimize the step-size respectively two different LSTMs when optimizing the the fully connected and convolutional layers. In all experiments we used two-layer LSTMs with 20 hidden units in each layer, aiming at minimizing (2) respectively (9) using truncated backpropgation through time and early stopping in order to avoid overfitting.
Few-Shot Neural SVMs (Q1, Q3). In our first experiment we considered classes 1 and 2, denoted as Cifar-2, from the Cifar-10 dataset. We extracted their features from an inception-network and used them for training the base models (Q1) using a linear kernel. Additionally we train an optimizee for FW (Q3). A random search set . Fig. 7 summarizes the results. The optimizee is trained on the classes 1 and 2, but then used to optimize neural SVMs on all pairwise combinations (1-3,1-4,…,2-3,…) of classes from Cifar-10 and therefore transferring to a completely novel dataset. As one can see, FWNets and L2LC outperformed the other baselines. FWNets with an hand-design, adaptive stepsize can be slightly faster than L2LC
, but the LSTM learns to control FW in a similar way and shows a much smaller variance. This answers (Q1, Q3) affirmatively.
Deep SVMs (Q2, Q3).
Next we considered training deep SVMs on the “concentric circles” dataset, i.e., we placed an FWNet as a last layer of a neural network, trained in an end-to-end (Q1) as well as in a learning to learn fashion (Q4). The neural network we used as kernel contained three layers. The first and second layer are fully-connected layer with four and two neurons each, trained using ADAM. Fig. 7 summarizes the results. As one can see, the learned optimizers outperformed the hand-coded ones. Moreover, as Fig. 8 illustrates, a learned optimizer may find smoother hyperplanes achieving the same loss in less many iterations, when also adapting the kernel: just using FWNets takes 500 iterations; when the LSTM controls the step size, it takes 200 iterations; when also adapting the kernel, it just takes 20 iterations and the hyperplane is considerably smoother. This answers (Q2, Q3) affirmatively.
To investigated deep SVMs further, we also considered Cifar-10. We split the labels of the dataset in two different classes, namely natural and manmade. The class natural contains the classes bird, cat, deer, dog, frog and horse, and the manmade class the classes airplane, automobile, ship and truck
. As kernel we used a neural network with both convolutional and fully-connected layers: three convolutional layers with max pooling followed by a fully-connected layer with 32 hidden units; all non-linearities were ReLU activations with batch normalization. The final layer is a FWNet simulating to train an SVM, and the rest of the network was trained using ADAM. Fig.9 summarize the results.
As one can see, the stepsize-learned conditional gradient L2LC outperforms hand-coded optimizers even with adaptive stepsize; requiring less than half of the iterations to converge. Training also the kernel is harder as it is a non-convex problem; exploring this further is an interesting avenue for future work. In any case, the results answer (Q2, Q3) affirmatively.
Sparse Neural Softmax Classifiers (Q1, Q3, Q4).
Finally, we investigated FWNets for training deep softmax classifiers on MNIST. We used a simple CNN with two convolutional layer and one fully-connected consisting of 16 neurons followed by a fully-connected softmax layer. For both FWnets and ADAM-based optimizers we used the same constant step size of. The FWNets unrolled the power iteration networks for five steps in order to compute the left and right top singular vectors of the gradient matrix . The step size was set . Additionaly we train an optimizee (L2LC) for FW. Therefore we split the training-set of MNIST in two disjoint sets with the same size. The results are summarized in Fig. 10. As one can see, the sparse FWNet and L2LC
outperformed ADAM, both in terms of convergence and predictive performance; the same top-1 accuracy in less than third of the iterations. Furthermore the trained classifier optimized with the optimizee results in a much higher confidence of the predictions, as one can see from the behavior of the Loss-function. This answers (Q1, Q2, Q4) affirmatively.
To investigate this further, we also considered wider and deeper CNNs on MNIST and also on Cifar-10 and Cifar-100. The more dense the network became, the better the ADAM performed. This validates our assumption of low-rank solutions: if the low-rank assumption does not hold or is not required, there is no point in estimating a sparse model using the trace-norm constraintAllen-Zhu et al. (2017).
We have introduced the learning to learn by conditional gradients (L2LC) framework based on Frank-Wolfe Networks (FWNets). This enables one to train sparse convex optimizers that are specialized to particular classes of problems. We illustrated this for training SVMs and sparse softmax classifiers. Our experimental results confirm that learned conditional gradients compare favorably against state-of-the-art optimization methods used in deep learning.
There are several interesting avenues for future work. One should develop FWNets for other ML tasks such as graph classification Kersting et al. (2014) and Bayesian Quadrature Briol et al. (2015) as well as for other FW approaches Jaggi (2013). One may also adapt the Power Iteration in an end-to-end fashion Duvenaud et al. (2015). Finally, hierarchical RNNs Wichrowska et al. (2017) have the potential to speed up learning to learn by conditional gradients.
Acknowledgments: This work was supported by the Federal Ministry of Food, Agriculture and Consumer Protection (BMELV) based on a decision of the German Federal Office for Agriculture and Food (BLE); grant nr. “2818204715”.
- Allen-Zhu et al. (2017) Allen-Zhu, Z., Hazan, E., Hu, W., and Li, Y. Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls. In Proc. NIPS, 2017.
- Amos & Kolter (2017) Amos, B. and Kolter, J.Z. OptNet: Differentiable Optimization as a Layer in Neural Networks. In Proc. ICML, 2017.
- Andrychowicz et al. (2016) Andrychowicz, Marcin, Denil, Misha, Colmenarejo, Sergio Gomez, Hoffman, Matthew W., Pfau, David, Schaul, Tom, and de Freitas, Nando. Learning to learn by gradient descent by gradient descent. CoRR, 2016.
- Bauckhage (2017) Bauckhage, C. A Neural Network Implementation of Frank-Wolfe Optimization. In Proc. ICANN, 2017.
- Briol et al. (2015) Briol, F.-X., Oates, C.J., Girolami, M.A., and Osborne, M.A. Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees. In Proc. NIPS, 2015.
- Duvenaud et al. (2015) Duvenaud, D.K., D.Maclaurin, Aguilera-Iparraguirre, J., Gómez-Bombarelli, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R.P. Convolutional Networks on Graphs for Learning Molecular Fingerprints. In Proc. NIPS, 2015.
- Frandi et al. (2014) Frandi, E., Ñanculef, R., and Suykens, J.A. K. Complexity Issues and Randomization Strategies in Frank-Wolfe Algorithms for Machine Learning. arXiv:1410.4062, 2014.
- Frank & Wolfe (1956) Frank, M. and Wolfe, P. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 3(1–2):95–110, 1956.
- Harlow (1949) Harlow, H.F. The Formation of Learning Sets. Psychological Review, 56(1):51–65, 1949.
- Hochreiter et al. (2001) Hochreiter, S., Younger, A.S., and Conwell, P.R. Learning to Learn Using Gradient Descent. In Proc. ICANN, 2001.
- Jaggi (2013) Jaggi, M. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Proc. ICML, 2013.
A Layered Network Model of Associative Learning: Learning to Learn and Configuration.Psychological Review, 95(4):411–433, 1988.
- Kersting et al. (2014) Kersting, K., Mladenov, M., Garnett, R., and Grohe, M. Power Iterated Color Refinement. In Proc. AAAI, 2014.
- Kingma & Ba (2014) Kingma, D.P. and Ba, J. Adam: A Method for Stochastic Optimization. In Proc. ICLR, 2014.
- Krizhevsky (2009) Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Technical report, Univ. of Toronto, 2009.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Li & Malik (2017) Li, K. and Malik, J. Learning to Optimize Neural Nets. arXiv:1703.00441, 2017.
- Liu & Tsang (2017) Liu, Z. and Tsang, I.W. Approximate Conditional Gradient Descent on Multi-Class Classification. In Proc. AAAI, 2017.
- Naik & Mammone (1992) Naik, D.K. and Mammone, R.J. Meta-Neural Networks that Learn by Learning. In Proc. IJCNN, 1992.
- Ouyang & Gray (2010) Ouyang, H. and Gray, A.G. Fast Stochastic Frank-Wolfe Algorithms for Nonlinear SVMs. In Proc. SDM, 2010.
- Ravi & Larochelle (2017) Ravi, S. and Larochelle, H. Optimization as a Model for Few-shot Learning. In Proc. ICLR, 2017.
- Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T.P. Meta-Learning with Memory-Augmented Neural Networks. In Proc. ICML, 2016.
- Schmidhuber (1987) Schmidhuber, J. Evolutionary Principles in Self-Referential Learning. Master’s thesis, TU Munich, 1987.
- Tang (2013) Tang, Y. Deep Learning Using Support Vector Machines. arXiv:1306.0239, 2013.
- Thrun & Pratt (1998) Thrun, S. and Pratt, L. (eds.). Learning to Learn. Kluwer Academic Publishers, 1998.
- Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. Matching Networks for One Shot Learning. In Proc. NIPS, 2016.
- Wang et al. (2016) Wang, J.X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J.Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to Reinforcement Learn. arXiv:1611.05763, 2016.
- Ward (1937) Ward, L.B. Reminiscence and Rote Learning. Psychological Monographs, 49(4), 1937.
- Wichrowska et al. (2017) Wichrowska, O., Maheswaranathan, N., Hoffman, M.W., Gomez Colmenarejo, S., Denil, M., de Freitas, N., and Sohl-Dickstein, J. Learned Optimizers that Scale and Generalize. arXiv:1703.04815, 2017.
- Zhang et al. (2015) Zhang, S.-X., Liu, C., Yao, K., and Gong, Y. Deep Neural Support Vector Machines for Speech Recognition. In Proc. ICASSP, 2015.
- Zheng et al. (2017) Zheng, W., Bellet, A., and Gallinari, P. A Distributed Frank-Wolfe Framework for Learning Low-Rank Matrices with the Trace Norm. arXiv:1712.07495, 2017.