Complexity Issues and Randomization Strategies in Frank-Wolfe Algorithms for Machine Learning

Frank-Wolfe algorithms for convex minimization have recently gained considerable attention from the Optimization and Machine Learning communities, as their properties make them a suitable choice in a variety of applications. However, as each iteration requires to optimize a linear model, a clever implementation is crucial to make such algorithms viable on large-scale datasets. For this purpose, approximation strategies based on a random sampling have been proposed by several researchers. In this work, we perform an experimental study on the effectiveness of these techniques, analyze possible alternatives and provide some guidelines based on our results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/05/2015

A PARTAN-Accelerated Frank-Wolfe Algorithm for Large-Scale SVM Classification

Frank-Wolfe algorithms have recently regained the attention of the Machi...
07/23/2021

Machine Learning with a Reject Option: A survey

Machine learning models always make a prediction, even when it is likely...
02/09/2015

Random Coordinate Descent Methods for Minimizing Decomposable Submodular Functions

Submodular function minimization is a fundamental optimization problem t...
08/10/2020

A Survey on Large-scale Machine Learning

Machine learning can provide deep insights into data, allowing machines ...
01/18/2018

Faster Algorithms for Large-scale Machine Learning using Simple Sampling Techniques

Now a days, the major challenge in machine learning is the `Big Data' ch...
08/06/2019

Machine Learning and the future of Supernova Cosmology

Machine Learning methods will play a fundamental role in our ability to ...
03/26/2021

LS-CAT: A Large-Scale CUDA AutoTuning Dataset

The effectiveness of Machine Learning (ML) methods depend on access to l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Frank-Wolfe algorithm [7], hereafter denoted as FW, is a general method to solve

where is a convex differentiable function, and is a convex polytope. Given the current iterate , a standard FW iteration consists of the following steps:

  1. Define a search direction by optimizing a linear model:

    (1)

    where denotes the set of vertices of .

  2. Choose a stepsize , e.g. by a line-search: .

  3. Update:

Recently, the Optimization and Machine Learning communities have showed a renewed surge of interest in the family of FW algorithms [10, 9, 15]. They enjoy bounds on the number of iterations which are independent of the problem size, as well as sparsity guarantees [3, 10]. Furthermore, variants of the above basic procedure exist which attain a linear convergence rate [18, 8, 15, 12]. Such properties make FW a good choice for problems arising in a variety of applications [1, 5, 13].

Complexity of Frank-Wolfe Iterations. As the total number of FW iterations can be large in practice, devising a convenient way to find a solution to the subproblem (1) is often mandatory in order to make the algorithm viable. A typical situation arises when (1) has an analytical solution or the problem structure makes it easy to solve [15, 14]. Still, the resulting complexity can be impractical when handling large-scale data. As a motivating example, we consider the problem

(2)

which stems from the task of training a nonlinear -SVM model for binary classification [17, 4]. Here, is a positive definite kernel matrix. In this case, , hence we have

The theoretical cost of an iteration is therefore , where , proportional to the number of examples.111More in general, it is proportional to and to the cost of computing , with . In order to circumvent the dependence from the dataset size, the use of approximation strategies based on a random sampling has been proposed by several researchers [17, 5], but, up to our knowledge, never systematically studied on practical problems. We attempt to fill this gap by performing an experimental study on the effect of using such techniques.

2 Randomization Strategies and Possible Alternatives

In this section, we consider two different techniques to reduce the computational effort in each FW iteration, and try to identify the kind of problems where each can be applied effectively.

2.1 Random Working Set Selection

A simple and yet effective way to avoid the dependence on is to explore only a fixed number of points in . In the case of (2), this means extracting a sample and solving

The iteration cost becomes in this case . The following result motivates this kind of approximation, suggesting that it is reasonable to keep the samples very small, i.e. to pick .

Theorem 1 ([16], Theorem 6.33).

Let be a set of cardinality , and let be a random subset of size

. Then, the probability that the smallest element in

is less than or equal to elements of is at least .

In the case of (2), where and , this means that, for example, it only takes to guarantee that, with probability at least (and independently of ), lies between the smallest gradient components.

Choice of the Stopping Criterion and Implications. The stopping criterion for FW algorithms is usually based on the duality gap [10]:

This criterion, however, is not applicable without computing the entire gradient , which is not done in the randomized case. As a possible alternative, we can use the approximate quantity

Since , this simplification entails a tradeoff between the reduction in computational cost and risk of an anticipated stopping. Although this can be considered acceptable in contexts such as SVM classification, where solving the optimization problem with a high accuracy is usually not needed, it is important to make sure that the impact of this approximation can be kept to an acceptable level. The experiments in the next section aim precisely at investigating this issue.

2.2 Analytical Gradient Update

Another possibility to obtain a more efficient iteration is to exploit the structure of the problem to keep the exact gradient updated at each iteration [11]. In the case of problem (2), this can be done in operations, since it is easy to see by using the formula for the FW step that

Compared to a naive implementation, we get rid of a factor and, as an important by-product, we have that the duality gap can be updated exactly without any additional cost.

3 Numerical Results

In order to assess the effectiveness of the above implementations of the FW step, we conducted numerical tests on the benchmark datasets Adult a9a (), Web w8a (), IJCNN () and USPS-ext () [2, 6]. All the experiments were coded in C++, and executed on a 3.40GHz 4-core Intel machine with 16GB RAM running Linux.

Table 1 presents the statistics (averaged over

runs) for classification accuracy on the test set, CPU time, number of iterations and support vectors, obtained with samplings of increasing size. The tolerance parameter was set to

, and a Gaussian kernel was used for all the experiments. An LRU caching strategy was implemented to avoid the computation of recently used entries of .

Dataset points 1000 points 500 points 250 points 125 points
Adult a9a Test acc (%)
Time (s)
Iter
SVs
Web w8a Test acc (%)
Time (s)
Iter
SVs
IJCNN Test acc (%)
Time (s)
Iter
SVs
USPS-ext Test acc (%)
Time (s)
Iter
SVs
Table 1: Average statistics with different sampling sizes.

First of all, note that the effect of sampling is substantially problem-dependent. On some datasets, such as USPS-ext, FW clearly encounters an early stopping even with a fairly large sampling size, while other results, such as those on Adult a9a, appear more stable. In some cases, e.g. on Web w8a, there seems to be a cutoff point after which the performance degrades considerably. Still, some general trends can be estabilished: the number of iterations decreases monotonically with , as expected from the observations in Section 2, and CPU times decrease accordingly. On the contrary, as seen from the results on IJCNN, the model size is not always monotonic with respect to . This arguably happens because solving (1) approximately can lead to spurious points being selected as FW vertices. Finally, note that the full sampling solution (which employs the strategy in Section 2.2) is very competitive on the smaller problems, while it is still very time consuming on the largest dataset USPS-ext. This intuitively suggests that a random sampling is computationally convenient when it can still produce a good solution with , where

is an estimate of the average cardinality of

across iterations. Some of these conclusions are summarized in Table 2.

In the next experiment, we analyze, on the datasets Adult a9a and USPS-ext, the effect of sampling on the computation of the duality gap (and therefore on the stopping criterion) and on the minimization of the linear model. Figures 1 and 2 report, respectively, the exact gap and the approximate gap , plotted in logarithmic scale against the iteration number for various sampling sizes.

Figure 1: Exact duality gap path on datasets Adult a9a (a) and USPS-ext (b).
Figure 2: Approximate duality gap path on datasets Adult a9a (a) and USPS-ext (b).

The figures shed light on the results in Table 1. On the dataset Adult a9a, the randomized strategy appears very effective: the duality gap does not deviate much from the ideal figure obtained with the full dataset, even for small sampling sizes. Furthermore, there are no significant differences between computing the exact and approximate duality gap. On the other hand, on USPS-ext, is noticeably larger than its approximate counterpart, indicating that the algorithm is making less progress than predicted by . Furthermore, the approximate gap exhibits large oscillations due to the random nature of the sampling, and it is possible that an “unlucky” iteration leads to a premature stopping, as can be seen from the figure. It is interesting to note that the degradation in optimization quality (as measured by ) is not reflected in this case by a corresponding loss in test accuracy, which is a phenomenon typical of classification problems. However, this is not true in general, as other applications such as function estimation are known to be more sensitive to a less accurate solution.

Randomized Working Set Selection - Applicable whenever is a polytope
- Large computational gain when
- Performance depends on the problem
Analytical Gradient Update - Convenient for structured (e.g. quadratic)
- Saves a factor at each iteration
- Deterministic results
Table 2: Some recommendations on the implementation of the FW step.

Adaptive Strategies. Taking into account all the above, one would ideally want to be able to select an optimal strategy automatically, based on the data and the actual performance. Provided both strategies can be applied to the problem at hand, one could for example start by performing a fixed number of iterations using both, and then devise some criterion based on the difference in duality gap to decide whether the approximation is adequate. However, a discussion on how to effectively implement such a strategy would be nontrivial, and as such is deferred to a separate work.

4 Conclusions

Using SVM classification problems as a motivation, we have performed an experimental study on the effectiveness and impact of some techniques designed to alleviate the computational burden of the optimization step in a FW iteration. Our results suggested that, while it comes with some caveats, a random sampling technique may be the most viable choice on very large-scale problems. On the other hand, when the problem size is not prohibitive (e.g. batch training tasks with medium to large datasets), fast updating schemes which exploit the problem structure might provide a better choice.

Acknowledgments

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC AdG A-DATADRIVE-B (290923). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information. Research Council KUL: GOA/10/09 MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; Flemish Government: FWO: projects: G.0377.12 (Structured systems), G.088114N (Tensor based data similarity); PhD/Postdoc grants; iMinds Medical Information Technologies SBO 2014; IWT: POM II SBO 100031; Belgian Federal Science Policy Office: IUAP P7/19 (DYSCO, Dynamical systems, control and optimization, 2012-2017).

References

  • [1] Andreas Argyriou, Marco Signoretto, and Johan Suykens. Hybrid algorithms with applications to sparse and low rank regularization. In Johan Suykens, Marco Signoretto, and Andreas Argyriou, editors,

    Regularization, Optimization, Kernels, and Support Vector Machines

    , chapter 3. Chapman & Hall/CRC (Boca Raton, USA), 2014.
  • [2] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2011.
  • [3] Kenneth Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms, 6(4):63:1–63:30, 2010.
  • [4] Emanuele Frandi, Maria Grazia Gasparo, Stefano Lodi, Ricardo Ñanculef, and Claudio Sartori. A new algorithm for training SVMs using approximate minimal enclosing balls. In

    Proceedings of the 15th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science

    , pages 87–95. Springer, 2010.
  • [5] Emanuele Frandi, Maria Grazia Gasparo, Stefano Lodi, Ricardo Ñanculef, and Claudio Sartori. Training support vector machines using Frank-Wolfe methods.

    International Journal of Pattern Recognition and Artificial Intelligence

    , 27(3), 2011.
  • [6] Andrew Frank and Arthur Asuncion. The UCI KDD Archive. http://kdd.ics.uci.edu, 2010.
  • [7] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1:95–110, 1956.
  • [8] Jacques Guélat and Patrice Marcotte. Some comments on Wolfe’s “away step”. Mathematical Programming, 35:110–119, 1986.
  • [9] Zaid Harchaoui, Anatoli Juditski, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 13(1):1–38, 2014.
  • [10] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, 2013.
  • [11] Piyush Kumar and Alper Yildirim. A linearly convergent linear-time first-order algorithm for support vector classification with a core set result. INFORMS Journal on Computing, 23(3):377–391, 2011.
  • [12] Simon Lacoste-Julien and Martin Jaggi. An affine invariant linear convergence analysis for Frank-Wolfe algorithms. arXiv.org, December 2013.
  • [13] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013.
  • [14] Giampaolo Liuzzi and Francesco Rinaldi. Solving -penalized problems with simple constraints via the Frank-Wolfe reduced dimension method. Optimization Letters (in press), 2014.
  • [15] Ricardo Ñanculef, Emanuele Frandi, Claudio Sartori, and Héctor Allende. A novel Frank-Wolfe algorithm. analysis and applications to large-scale SVM training. Information Sciences (in press), 2014.
  • [16] Bernard Schölkopf and Alexander Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.
  • [17] Ivor Tsang, James Kwok, and Pak-Ming Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6:363–392, 2005.
  • [18] Philip Wolfe. Convergence theory in nonlinear programming. In J. Abadie, editor, Integer and Nonlinear Programming, pages 1–36. North-Holland, Amsterdam, 1970.