The Frank-Wolfe (FW) or conditional gradient algorithm (frank1956algorithm; jaggi2013revisiting) is designed to solve optimization problems of the form
is a (possibly infinite) set of vectors which we callatoms, and is its convex hull. The FW algorithm and variants have seen an impressive revival in recent years, due to their low memory requirements and projection-free iterations, which make them particularly appropriate to solve large scale convex problems, for instance convex relaxations of problems written over combinatorial polytopes (zaslavskiy2009path; joulin2014efficient; vogelstein2015fast).
The Frank-Wolfe algorithm is projection-free, i.e. unlike most methods to solve (OPT), it does not require to compute a projection onto the feasible set . Instead, it relies on a linear minimization oracle over a set , written , which solves the following linear problem
For some constraint sets, such as the nuclear or latent group norm ball (jaggi2010simple; vinyes2017fast)
, computing the LMO can be orders of magnitude faster than projecting. Another feature of FW that has greatly contributed to its practical success is its low memory requirements. The algorithm maintains its iterates as a convex combination of a few atoms, enabling the resulting sparse and low rank iterates to be stored efficiently. This feature allows the FW algorithm to be used in situations with a huge or even infinite number of features, such as architecture optimization in neural networks(ping2016learning)
or estimation of an infinite-dimensional sparse matrix arising in multi-output polynomial networks(NIPS2017_6927).
Despite these attractive properties, for problems with a large number of variables or with a very large atomic set (or both), computing the full gradient and LMO at each iteration can become prohibitive. Designing variants of the FW algorithm which alleviate this computational burden would have a significant practical impact on performance.
One recent direction to achieve this is to replace the LMO with a randomized linear oracle in which the linear minimization is performed only over a random sample of the original atomic domain. This approach has proven to be highly successful on specific problems such as structured SVMs (lacoste2012block) and -constrained regression (Frandi2016), however little is known in the general case. Is it possible to design a FW variant with a randomized oracle that achieves the same convergence rate (up to a constant factor) as the non-randomized variant? Can this be extended to linearly-convergent FW algorithms (lacoste2013affine; lacoste2015global; garber2015faster)? In this paper we give a positive answer to both questions and explore the trade-offs between subsampling and convergence rate.
Outline and main contribution.
The main contribution of this paper is to develop and analyze two algorithms that share the low memory requirements and projection-free iterations of FW, but in which the LMO is computed only over a random subset of the original domain. In many cases, this results in large computational gains in computing the LMO which can also speed up the overall FW algorithm. In practice, the algorithm will run a larger number of cheaper iterations, which is typically more efficient for very large data sets (e.g. in a streaming model where the data does not fit in core memory and can only be accessed by chunks). The paper is structured as follows
§2 describes the Randomized FW algorithm, proving a sublinear convergence rate.
§3 describes “Randomized Away FW”, a variant of the above algorithm with linear convergence rate on polytopes. To the best of our knowledge this is the first provably convergent randomized version of the Away-steps FW algorithm.
Finally, in §4 we discuss implementation aspects of the proposed algorithms and study their performance on lasso and latent group lasso problems.
Note that with the proven sub-linear rate of convergence for Randomized FW, the cost of the LMO is reduced by the subsampling rate, but this is compensated by the fact that the number of iteration required by RFW to reach same convergence guarantee as FW is itself multiplied by the sampling rate. Similarly the linear convergence rate in Randomized AFW does not theoretically show a computational advantage since the number of iterations is multiplied by the squared sampling rate, in our highly conservative bounds at least. Nevertheless, our numerical experiments show that randomized versions are often numerically superior to their deterministic counterparts.
1.1 Related work
Several references have focused on reducing the cost of computing the linear oracle. The analysis of (jaggi2013revisiting) allows for an error term in the LMO, and so a randomized linear oracle could in principle be analyzed under this framework. However, this is not fully satisfactory as it requires the approximation error to decrease towards zero as the algorithm progresses. In our algorithm, the subsampling approximation error doesn’t need to decrease.
lacoste2012block studied a randomized FW variant named block-coordinate FW in which at each step the LMO is computed only over a subset (block) of variables. In this case, the approximation error need not decrease to zero, but the method can only be applied to a restricted class of problems: those with block-separable domain, leaving out important cases such as –constrained minimization. Because of the block separability, a more aggressive step-size strategy can be used in this case, resulting overall in a different algorithm.
Finally, frandi2014complexity proposed a FW variant which can be seen as a special case of our Algorithm 1 for the Lasso problem, analyzed in (Frandi2016). Our analysis here brings three key improvements on this last result. First, it is provably convergent for arbitrary atomic domains, not just the ball (furthermore the proof in (Frandi2016) has technical issues discussed in Appendix C
). Second, it allows a choice of step size that does not require exact line-search (Variant 2), which is typically only feasible for quadratic loss functions. Third, we extend our analysis to linearly-convergent FW variants such as the Away-step FW.
A different technique to alleviate the cost of the linear oracle was recently proposed by braun2017lazifying. In that work, the authors propose a FW variant that replaces the LMO by a “weak” separation oracle and showed significant speedups in wall-clock performance on problems such as the video co-localization. This approach was combined with gradient sliding in (lan2017conditional), a technique (lan2016conditional) that allows to skip the computation of gradients from time to time. However, for problems such as Lasso or latent group lasso, a randomized LMO avoids all full gradient computations, while the lazy weak separation oracle still requires it. Combining these various techniques is an interesting open question.
Proximal coordinate-descent methods (richtarik2014iteration) (not based on FW) have also been used to solve problems with a huge number of variables. They are particularly effective when combined with variable screening rules such as (strongrules; fercoq2015mind). However, for constrained problems they require evaluating a projection operator, which on some sets such as the latent group lasso ball can be much more expensive than the LMO. Furthermore, these methods require that the projection operator is block-separable, while our method does not.
We denote vectors with boldface lower case letters (i.e., ), and sets in calligraphic letter (i.e., ). We denote
. Probability is denoted. The cardinality of a set is denoted . For a solution of (OPT), we denote .
Randomized vs stochastic. We denote FW variants with randomness in the LMO randomized and reserve the name stochastic for FW variants that replace the gradient with a stochastic approximation, as in (hazan2016variance).
2 Randomized Frank-Wolfe
In this section we present our first contribution, a FW variant that we name Randomized Frank-Wolfe (RFW). The method is detailed in Algorithm 1. Compared to the standard FW algorithm, it has the following two distinct features.
First, the LMO is computed over a random subset of the original atomic set in which each atom is equally likely to appear, i.e., in which for all (Line 1). For discrete sets this can be implemented simply by drawing uniformly at random a fixed number of elements at each iteration. The sampling parameter controls the fraction of the domain that is considered by the LMO at each iteration. If , the LMO considers the full domain at each iteration and the algorithm defaults to the classical FW algorithm. However, for , the LMO only needs to consider a fraction of the atoms in the original dataset and can be faster than the FW LMO.
Second, because of this subsampling we can no longer guarantee that the atom chosen by the LMO is a descent direction and so it is no longer possible to use the “oblivious” (i.e., independent on the result of the LMO) step-size commonly used in the FW algorithm. We provide two possible choices for this step-size: the first variant (Line 1) chooses the step-size by exact line search and requires to solve a 1-dimensional convex optimization problem. This approach is efficient when this sub-problem has a closed form solution, as it happens for example in the case of quadratic loss functions. The second variant does not need to solve this sub-problem, but in exchange requires to have an estimate of the curvature constant (defined in next subsection). Note that in absence of an estimate of this quantity, one can use the bound , where is the Lipschitz constant of and is the diameter of the domain in euclidean norm.
Gradient coordinate subsampling.
We note that the gradient of only enters Algorithm 1 through the computation of the randomized LMO, and so only the dot product between the gradient and the subsampled atomic set are truly necessary. In some cases the elements of the atomic set have a specific structure that makes computing dot products particularly effective. For example, when the atomic elements are sparse, only the coordinates of the gradient that are in the support of the atomic set need to be evaluated. As a result, for sparse atomic sets such as the ball, the group lasso ball (also known as ball), or even the latent group lasso (obozinski2011group) ball, only a few coordinates of the gradient need to be evaluated at each iteration. The number of exact gradients that need to be evaluated will depend on both the sparsity of this atomic set and the subsampling rate. For example, in the case of the ball, the extreme atoms have a single nonzero coefficient, and so RFW only needs to compute on average gradient coefficients at each iteration, where denotes the ambient dimension.
A side-effect of subsampling the linear oracle is that , where is the atom selected by the randomized linear oracle is not, unlike in the non-randomized algorithm, an upper bound on . This property is a feature of FW algorithms that cannot be retrieved in our variant. As a replacement, the stopping criteria that we propose is to compute a full LMO every iterations, with ( is a good default value).
In this subsection we prove an convergence rate for the RFW algorithm. As is often the case for FW-related algorithms, our convergence result will be stated in terms of the curvature constant , which is defined as follows for a convex and differentiable function and a convex and compact domain :
It is worth mentioning that a bounded curvature constant corresponds to a Lipschitz assumption on the gradient of (jaggi2013revisiting).
Proof. See Appendix A.
The rate obtained in the previous theorem is similar to known bounds for FW. For example, (jaggi2013revisiting, Theorem 1) established for FW a bound of the form
This is similar to the rate of Theorem 2.1, except for the factor in the denominator. Hence, if our updates are times as costly as the full FW update (as is the case e.g. for the ball), then the theoretical convergence rate is the same. This bound is likely tight, as in the worst case one will need to sample the whole atomic set to decrease the objective if there is only one descent direction. This is however a very pessimistic scenario, and in practice good descent directions can often be found without sampling the whole atomic set. As we will see in the experimental section, despite these conservative bounds, the algorithm often exhibits large computational gains with respect to the deterministic algorithm.
3 Randomized Away-steps Frank-Wolfe
A popular variant of the FW algorithm is the Away-steps FW variant of guelat1986some. This algorithm adds the option to move away from an atom in the current representation of the iterate. In the case of a polytope domain, it was recently shown to have much better convergence properties, such as linear (i.e. exponential) convergence rates for generally-strongly convex objectives (garber2013linearly; beck2013convergence; lacoste2015global).
In this section we describe the first provably convergent randomized version of the Away-steps FW, which we name Randomized Away-steps FW (RAFW). We will assume throughout this section that the domain is a polytope, i.e. that , where is a finite set of atoms. We will make use of the following notation.
Active set. We denote by the active set of the current iterate, i.e. decomposes as , where are positive weights that are iteratively updated.
Subsampling parameter. The method depends on a subsampling parameter . It controls the amount of computation per iteration of the LMO. In this case, the atomic set is finite and denotes an integer . This sampling rate is approximately in the RFW formulation of §2.
The method is described in Algorithm 2 and, as in the Away-steps FW, requires computing two linear minimization oracles at each iteration. Unlike the deterministic version, the first oracle is computed on the subsampled set (Line 2), where is a subset of size , sampled uniformly at random from . The second LMO (Line 2) is computed on the active set, which is also typically much smaller than the atomic domain.
As a result of both oracle calls, we obtain two potential descent directions, the RFW direction and the Away direction . The chosen direction is the one that correlates the most with the negative gradient, and a maximum step size is chosen to guarantee that the iterates remain feasible (Lines 2–2).
Updating the support.
Line 2 requires updating the support and the associated coefficients. For a FW step we have if and otherwise . The corresponding update of the weights is when and otherwise.
For an away step we instead have the following update rule. When (which is called a drop step), then . Combined with (or equivalently ) we call them bad drop step, as it corresponds to a situation in which we are not able to guarantee a geometrical decrease of the dual gap.
For away steps in which , the away atom is not removed from the current representation of the iterate. Hence , for and otherwise.
Per iteration cost.
Establishing the per iteration cost of this algorithm is not as straightforward as for RFW, as the cost of some operations depends on the size of the active set, which varies throughout the iterations. However, for problems with sparse solutions, we have observed empirically that the size of the active set remains small, making the cost of the second LMO and the comparison of Line 2 negligible compared to the cost of an LMO over the full atomic domain. In this regime, and assuming that the atomic domain has a sparse structure that allows gradient coordinate subsampling, RAFW can achieve a per iteration cost that is, like RFW, roughly times lower than that of its deterministic counterpart.
We now provide a convergence analysis of the Randomized Away-steps FW algorithm. These convergence results are stated in terms of the away curvature constant and the geometric strong convexity , which are described in Appendix B and in (lacoste2015global). Throughout this section we assume that has bounded (note that the usual assumption of Lipschitz continuity of the gradient over compact domain implies this) and strictly positive geometric strong convexity constant .
Consider the set , with a finite set of extreme atoms, after iterations of Algorithm 2 (RAFW) we have the following linear convergence rate
with , and .
Proof. See Appendix B.
Proof sketch. Our proof structure roughly follows that of the deterministic case in (lacoste2015global; beck2013convergence) with some key differences due to the LMO randomness, and can be decomposed into three parts.
The first part consists in upper bounding and is no different from the proof of its deterministic counterpart (lacoste2015global; beck2013convergence).
The second part consists in lower bounding the progress . For this algorithm we can guarantee a decrease of the form
where is the partial pair-wise dual gap while is the pair-wise dual gap, in which is replaced by the result of a full (and not subsampled) LMO.
We can guarantee a possible geometric decrease on at each iteration, except for bad drop steps, where we can only secure . We mark these by setting .
One crucial issue is then to quantify . This can be seen as a measure of the quality of the subsampled oracle: if it selects the same atom as the non-subsampled oracle the quotient will be 1, in all other cases it will be .
To ensure a geometrical decrease we further study the probability of events and : first, we produce a simple bound on the number of bad drop steps (where ). Second, when holds, Lemma 3 provides a lower bound on the probability of .
The third and last part of the proof analyzes the expectation of the decrease rate given the above discussion. We produce a conservative bound assuming the maximum possible number of bad drop steps. The key element in this part is to make this maximum a function of the size of the support of the initial iterate and of the number of iteration. The convergence bound is then proven by induction.
Comparison with deterministic convergence rates.
The rate for away Frank-Wolfe in (lacoste2015global, Theorem 8), after iteration is
Due to the dependency on of the convergence rate in Theorem 3.1, our bound does not show that RAFW is computationally more efficient than AFW. Indeed we use a very conservative proof technique in which we measure progress only when the sub-sampling oracle equals the full one. Also, the cost of both LMOs depends on the support of the iterates which is unknown a priori except for a coarse upper bound (e.g. the support cannot be more than the number of iterations). Nevertheless, the numerical results do show speed ups compared to the deterministic method.
Beyond strong convexity.
The strongly convex objective assumption may not hold for many problem instances. However, the linear rate easily holds for of the form where is strongly convex and a linear operator. This type of function is commonly know as a -generally strongly convex function (beck2013convergence; wang2014iteration) or (lacoste2015global) (see “Away curvature and geometric strong convexity” in Appendix B for definition). The proof simply adapts that of (lacoste2015global, Th. 11) to our setting.
Suppose has bounded smoothness constant and is -generally-strongly convex. Consider the set , with a finite set of extreme atoms. Then after iterations of Algorithm 2, with and a parameter of sub-sampling, we have
with and .
Proof. See end of Appendix B.
In this section we compare the proposed methods with their deterministic versions. We consider two regularized least squares problems: one with regularization and another one with latent group lasso (LGL) regularization. In the first case, the domain is a polytope and as such the analysis of AFW and RAFW holds.
We will display the FW gap versus number of iterations, and also cumulative number of computed gradient coefficients, which we will label “nbr coefficients of grad”. This allows to better reflect the true complexity of our experiments since sub-sampling the LMO in the problems we consider amounts to computing the gradient on a batch of coordinates.
In the case of latent group lasso, we also compared the performance of RFW against FW in terms of wall-clock time on a large dataset stored in disk and accessed sequentially in chunks (i.e. in streaming model).
4.1 Lasso problem
We generate a synthetic dataset following the setting of (lacoste2015global), with a Gaussian design matrix of size and noisy measurements , with a random Gaussian vector and a vector with of nonzero coefficients and values in .
Figure 1 compares FW and RFW. Each call to the randomized LMO outputs a direction, likely less aligned with the opposite of the gradient than the direction proposed by FW, which explains why RFW requires more iterations to converge on the upper left graph of Figure 1. Each call of the randomized LMO is cheaper than the LMO in terms of number of computed coefficients of the gradient, and the trade-off is beneficial as can be seen on the bottom left graph, where RFW outperforms its deterministic variant in terms of nbr coefficients of grad.
Finally, the right panels of Figure 1 provide an insight on the evolution of the sparsity of the iterate, depending on the algorithm. FW and RFW perform similarly in terms of the fraction of recovered support (bottom right graph). In terms of the sparsity of the iterate, RFW under-performs FW (upper right graph). This can be explained as follows: because of the sub-sampling, each atom of the randomized LMO provides a direction less aligned with the opposite of the gradient than the one provided by the LMO. Each update in such a direction may result in putting weight on an atom that would better be off the representation of the iterate. It impacts the iterate all along the algorithm as vanilla FW removes past atoms from the representation only by multiplicatively shrinking their weight.
Unlike RFW, the RAFW method outperforms AFW in terms on number of iterations in the upper left graph in Figure 2. These graphs also show that both have linear rate of convergence. The bottom left graph shows that the gap between RAFW and AFW is even larger when comparing the cumulative number of computed coefficients of the gradient required to reach a certain target precision.
This out-performance of RAFW over AFW in term of number of iteration to converge is not predicted by our convergence analysis. We conjecture that the away mechanism improves the trade-off between the cost of the LMO and the alignment of the descent direction with the opposite of the gradient. Indeed, because of the oracle subsampling, the partial FW gap (e.g. the scalar product of the Randomized FW direction with the opposite of the gradient) in RAFW is smaller than in the non randomized variant, and so there is a higher likelihood of performing an away step.
Finally, the away mechanism enables the support of the RAFW to stay close to that of AFW, which was not the case in the comparison of RFW versus FW. This is illustrated in the right panels of Figure 2.
On figure 3, we test the Lasso problem on the E2006-tf-idf data set (kogan2009predicting), which gathers volatility of stock returns from companies with financial reports. Each financial reports is then represented through its TF-IDF embedding ( and
weafter an initial round of feature selection). The regularizing parameter is chosen to obtain solution with a fraction ofnonzero coefficients.
4.2 Latent Group-Lasso
We write the set of indices from to . Consider and , represents the projection vector of onto its -coordinate. We use the notation to denote the gradient with respect to the variables in group . Similarly is the vector that equals in the coordinates of and elsewhere.
As outlined by jaggi2013revisiting, FW algorithms are particularly useful when the domain is a ball of the latent group norm (obozinski2011group). Consider a set of subset of such that and denote by any norm on . Frank-Wolfe can be tuned to solve (OPT) with being the ball corresponding to the latent group norm
This formulation matches a constrained version of the regularized (obozinski2011group, equation (5)) when each is proportional to the Euclidean norm. From now on we will consider to be the euclidean norm.
When forms a partition of (i.e., there is no overlap between groups), this norm coincides with the group lasso norm.
Given an element of , consider the hyper-disk
(obozinski2011group, lemma 8) shows that such constrain set is the convex hull of .
At iteration of RFW for a random subset of size of we then propose to simply run RFW (algorithm 1) with . Denoting by the LMO in RFW becomes
This means that we only need to compute the gradient on the index. Depending on and on the sub-sampling rate, this can be a significant computational benefit.
We illustrate the convergence speed-up of using RFW over FW for latent group lasso regularized least square regression.
For we consider a collection of groups of size with an overlap of and the associated atomic set . We chose the ground truth parameter vector with a fraction of
of nonzero coefficients, where on each active group, the coefficients are generated from a Gaussian distribution. The data is a set ofpairs randomly generated from a Gaussian with some additive Gaussian noise. The regularizing parameter is , set so that the unconstrained optimum lies outside of the constrain set.
Large dataset and Streaming Model.
The design matrix is stored in disk. We allow both RFW and FW to access it only through chunks of size . This streaming model allows a wall clock comparison of the two methods on very large scale problems.
Computing the gradient when the objective is the least squares loss consists in a matrix vector product. Computing it on a batch of coordinates then requires same operation with a smaller matrix. When computing the gradient at each randomized LMO call, the cost of slicing the design matrix can then compensate the gain in doing a smaller matrix vector product.
With data loaded in memory, which is typically the case for large datasets, both the LMO and the randomized LMO have this access data cost. Consider also that RFW allows any scheme of sampling, including one that minimizes the cost of data retrieval.
5 Conclusion and future work
We give theoretical guarantees of convergence of randomized versions of FW that exhibit same order of convergence as their deterministic counter-parts. As far as we know, for the case of RAFW, this is the first contribution of the kind. While the theoretical complexity bounds don’t necessarily imply this, our numerical experiments show that randomized versions often outperform their deterministic ones on -regularized and latent group lasso regularized least squares. In both cases, randomizing the LMO allows us to compute the gradient only on a subset of its coordinates. We use it to speed up the method in a streaming model where the data is accessed by chunks, but there might be other situations where the structure of the polytope can be leveraged to make subsampling computationally beneficial.
There are also other linearly-convergent FW variants other than AFW, for which it might be possible to derive randomized variants.
Finally many recent results (Goldfarb2016; goldfarb2017linear; hazan2016variance) on FW have combined various improvements of FW (away mechanism, sliding, lazy oracles, stochastic FW, etc.). Randomized oracles add to this toolbox and could further improve its benefits.
A.A. is at the département d’informatique de l’ENS, École normale supérieure, UMR CNRS 8548, PSL Research University, 75005 Paris, France, and INRIA Sierra project-team. T.K. is a PhD student under the supervision of A.A. and acknowledges funding from the CFM-ENS chaire les modèles et sciences des données. FP is funded through the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodorowska-Curie grant agreement 748900. The authors would like to thanks Robert Gower, Vincent Roulet and Federico Vaggi for helpfull discussions.
How to Get Away with Subsampling: a Frank-Wolfe
Algorithm for Optimizing over Large Atomic Domains
We denote by the conditional expectation at iteration , conditioned on all the past and by a full expectation. We denote by a tilde the values that come from the deterministic analysis of FW. Denote by . For , denote by all integer between and .
Appendix Appendix A Proof of sub-linear convergence for Randomized Frank-Wolfe
In this section we provide a convergence proof for Algorithm 1. The proof is loosely inspired by that of (locatello17a, Appendix B.1)
, with the obvious difference that the result of the LMO is a random variable in our case.
Proof. By definition of the curvature constant, at iteration we have
By minimizing with respect to on we obtain
which is the definition of in the algorithm with Variant 2. Hence, we have
an inequality which is also valid for Variant 1 since by the line search procedure the objective function at is always equal or smaller than that of Variant 1. Denote by ,
We write the FW atom if we had started the FW algorithm at , and the expectation conditionned on all the past until , we have
where the second inequality follows from the definition of expectation and the fact that minimum is non-positive since it is zero for . The last inequality is a consequence of uniform sampling as well as it uses that the FW gap is an upper bound on the dual gap, e.g. .
Induction. From (15) the following is true for any
Taking unconditional expectation and writing , we get for any
With , we get by induction
where . Initialization follows the fact that the curvature constant is positive. For , from (17) and the induction hypothesis
The last inequality comes from the fact that . Indeed, with , it is equivalent to
The last being true, it concludes the proof.
Appendix Appendix B Proof of linear convergence for RAFW
Away curvature and geometric strong convexity
. The away curvature constant is a modification of the curvature constant described in the previous subsection, in which the FW direction is replaced with an arbitrary direction :
The geometric strong convexity constant depends on both the function and the domain (in contrast to the standard strong convexity definition) and is defined as (see “An Affine Invariant Notion of Strong Convexity” in (lacoste2015global) for more details)
where and the positive step-size quantity:
In particular is the Frank Wolfe atom starting from . is the away atom when considering all possible expansions of as a convex combinations of atoms in . Denote by and by . is finally defined by
Similarly following (lacoste2015global, Lemma 9 in Appendix F), the geometric -generally-strongly-convex constant is defined as
where represents the solution set of (OPT).
In the context of RAFW, denotes the finite set of extremes atoms such that . At iteration , is a random subset of element of where is the current support of the iterate. The Randomized LMO is performed over so that for Algorithm 2, is the FW atom at iteration for RAFW.
Note that when , Algorithm 2 does exactly the same as AFW. For the sake of simplicity we will consider that this is not the case. Indeed we would otherwise fall back into the deterministic setting and the proof would just be that of (lacoste2015global).
We use tilde notation for quantities that are specific to the deterministic FW setting. For instance, is the FW atom for AFW starting at .
Similarly the Away atom is such that and it does not depend on the sub-sampling at iteration . Here we do not use any tilde because it is a quantity that appears both in AFW and its Randomized counter-part.
In AFW, is an upper-bound of the dual gap, named the pair-wise dual gap (lacoste2015global). We consider the corresponding partial pair-wise dual gap . It is partial is the sense that the maximum is computed on a subset of which results in the fact that it is not guaranteed anymore to be an upper-bound on the dual-gap.
Structure of the proof.
The main proof follows the scheme of the deterministic one of AFW in (lacoste2015global, Theorem 8). It is divided in three parts. The first part consists in upper bounding with . It does not depend on the specific construction of the iterates and thus remains the same as that in (lacoste2015global). The second part provides a lower bound on the progress on the algorithm, namely
with , when it is not doing a bad drop step
(defined above). As a proxy for this event, we use the binary variablethat equals for bad drop steps and otherwise.
The difficulty lies in that we guarantee a geometrical decrease only when and . Because of the sub-sampling and unlike in the deterministic setting, is a random variable. Lemma 3 provides a lower bound on the probability of interest, , for the last part of the main proof.
Finally, the last part of the proof constructs a bound on the number of times we can expect both and subject to the constraint that at least half of the iterates satisfy . It is done by recurrence.
Appendix B.1 Lemmas
This lemma ensures the chosen direction in RAFW is a good descent direction, and links it with which may be equal to .
Let and be as defined in Algorithm 2. Then for , we have
Proof. The first inequality appeared already in the convergence proof of lacoste2015global, which we repeat here for completeness. By the definition of we have:
so that we have . By definition of , it implies .
Lemma 2 is just a simple combinatorial result needed in Lemma 3. Consider a sequence of numbers, we lower bound the probability for the maximum of a subset of size greater than to be equal to the maximum of the sequence.
Consider any sequence in with , and a subset of size . We have
Proof. Consider . We have if and only if at least one element of belongs to :
By definition has at least one element . Since
All subsets are taken uniformly at random, we just have to count the number of subset of of size with