Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences

02/19/2012
by   Benjamin Recht, et al.
0

Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling with- and without-replacement in such algorithms. Focusing on least means squares optimization, we formulate a noncommutative arithmetic-geometric mean inequality that would prove that the expected convergence rate of without-replacement sampling is faster than that of with-replacement sampling. We demonstrate that this inequality holds for many classes of random matrices and for some pathological examples as well. We provide a deterministic worst-case bound on the gap between the discrepancy between the two sampling models, and explore some of the impediments to proving this inequality in full generality. We detail the consequences of this inequality for stochastic gradient descent and the randomized Kaczmarz algorithm for solving linear systems.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/02/2020

Recht-Ré Noncommutative Arithmetic-Geometric Mean Conjecture is False

Stochastic optimization algorithms have become indispensable in modern m...
02/09/2021

Berry–Esseen Bounds for Multivariate Nonlinear Statistics with Applications to M-estimators and Stochastic Gradient Descent Algorithms

We establish a Berry–Esseen bound for general multivariate nonlinear sta...
08/28/2018

Exponential inequality for chaos based on sampling without replacement

We are interested in the behavior of particular functionals, in a framew...
03/02/2016

Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization

Stochastic gradient methods for machine learning and optimization proble...
08/16/2016

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition

In 1963, Polyak proposed a simple condition that is sufficient to show a...
10/21/2013

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

We obtain an improved finite-sample guarantee on the linear convergence ...
02/21/2020

Incremental Sampling Without Replacement for Sequence Models

Sampling is a fundamental technique, and sampling without replacement is...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Randomized sequential algorithms abound in machine learning and optimization. The most famous is the stochastic gradient method (see Bottou, 1998; Bertsekas, 2012; Nemirovski et al., 2009; Shalev-Shwartz and Srebro, 2008), but other popular methods include algorithms for alternating projections (see Strohmer and Vershynin, 2009; Leventhal and Lewis, 2010), proximal point methods (see Bertsekas, 2011), coordinate descent (see Nesterov, 2010) and derivative free optimization (see Nesterov, 2011; Nemirovski and Yudin, 1983). In all of these cases, an iterative procedure is derived where, at each iteration, an independent sample from some distribution determines the action at the next stage. This sample is selected with-replacement from a pool of possible options.

In implementations of many of these methods, however, practitioners often choose to break the independence assumption. For instance, in stochastic gradient descent, many implementations pass through each item exactly once in a random order (i.e., according to a random permutation). In randomized coordinate descent, one can cycle over the coordinates in a random order. These strategies, employing without-replacement sampling, are often easier to implement efficiently, guarantee that every item in the data set is touched at least once, and often have better empirical performance than their with-replacement counterparts (see Bottou, 2009; Recht and Ré, 2011; Feng et al., 2012).

Unfortunately, the analysis of without-replacement sampling schemes are quite difficult. The independence assumption underlying with-replacement sampling provides an elegant Markovian framework for analyzing incremental algorithms. The iterates in without-replacement sampling are correlated, and studying them requires sophisticated probabilistic tools. Consequently, most of the analyses without-replacement optimization assume that the iterations are assigned deterministically. Such deterministic orders might incur exponentially worse convergence rates than randomized methods (Nedic and Bertsekas, 2000)

, and deterministic orders still require careful estimation of accumulated errors 

(see Luo, 1991; Tseng, 1998). The goal of this paper is to make progress towards patching the discrepancy between theory and practice of without-replacement sampling in randomized algorithms.

In particular, in many cases, we demonstrate that without-replacement sampling outperforms with-replacement sampling provided a noncommutative version of the arithmetic-geometric mean inequality holds. Namely, if are a collection of positive semidefinite matrices, we define the arithmetic and (symmetrized) geometric means to be

where denotes the group of permutations. Our conjecture is that the norm of is always less than the norm of . Assuming this inequality, we show that without-replacement sampling leads to faster convergence for both the least mean squares and randomized Kaczmarz algorithms of Strohmer and Vershynin (2009).

Using established work in matrix analysis, we show that these noncommutative arithmetic-geometric mean inequalities hold when there are only two matrices in the pool. We also prove that the inequality is true when all of the matrices commute. We demonstrate that if we don’t symmetrize, there are deterministically ordered products of matrices whose norm exceeds by an exponential factor. That is, symmetrization is necessary for the noncommutative arithmetic-geometric mean inequality to hold.

While we are unable to prove the noncommutative arithmetic-geometric mean inequality in full generality, we verify that it holds for many classes of random matrices. Random matrices are, in some sense, the most interesting case for machine learning applications. This is particularly evident in applications such as empirical risk minimization and online learning where the the data are conventionally assumed to be generated by some i.i.d random process. In Section 4, we show that if are generated i.i.d. from certain distributions, then the noncommutative arithmetic-geometric mean inequality holds in expectation with respect to the . Section 4.1 assumes that w.b.here have independent entries, identically sampled from some symmetric distribution. In Section 4.2, we analyze the random matrices that commonly arise in stochastic gradient descent and related algorithms, again proving that without-replacement sampling exhibits faster convergence than with-replacement sampling. We close with a discussion of other open conjectures that could impact machine learning theory, algorithms, and software.

2 Sampling in incremental gradient descent

To illustrate how with- and without-replacement sampling methods differ in randomized optimization algorithms, we focus on one core algorithm, the Incremental Gradient Method (IGM). Recall that the IGM minimizes the function

(2.1)

via the iteration

(2.2)

Here,

is an initial starting vector,

are a sequence of nonnegative step sizes, and the indices are chosen using some (possibly deterministic) sampling scheme. When is strongly convex, the IGM iteration converges to a near-optimal solution of (2.1) for any under a variety of step-sizes protocols and sampling schemes including constant and diminishing step-sizes (see Anstreicher and Wolsey, 2000; Bertsekas, 2012; Nemirovski et al., 2009). When the increments are selected uniformly at random at each iteration, IGM is equivalent to stochastic gradient descent. We use the term IGM here to emphasize that we are studying many possible orderings of the increments. In the next examples, we study the specialized case where the are quadratic and the IGM is equivalent to the least mean squares algorithm of Widrow and Hoff (1960).

2.1 One-dimensional Examples

First consider the following toy one-dimensional least-squares problem

(2.3)

where is a sequence of scalars with mean

and variance

. Applying (2.2) to (2.3) results in the iteration.

If we initialize the method with and take steps of incremental gradient with stepsize , we have

where is the index drawn at iteration . If the steps are chosen using a without-replacement sampling scheme, , the global minimum. On the other hand, using with-replacement sampling, we will have

which is a positive mean square error.

Another toy example that further illustrates the discrepancy is the least-squares problem

where are positive weights. Here, is a scalar, and the global minimum is clearly . Let’s consider the incremental gradient method with constant stepsize . Then after iterations we will have

If we perform without-replacement sampling, this error is given by

On the other hand, using with-replacement sampling yields

By the arithmetic-geometric mean inequality, we then have that the without-replacement sample is always closer to the optimal value in expectation. This sort of discrepancy is not simply a feature of these toy examples. We now demonstrate that similar behavior arises in multi-dimensional examples.

2.2 IGM in more than one dimension

Now consider IGM in higher dimensions. Let be a vector in and set

where are some test vectors and

are i.i.d. Gaussian random variables with mean zero and variance

.

We want to compare with- vs without-replacement sampling for IGD on the cost function

(2.4)

Suppose we walk over steps of IGD with constant stepsize and we access the terms in that order. Then we have

Subtracting from both sides of this equation then gives

(2.5)

Here, the product notation means we multiply by the matrix with smallest index first, then left multiply by the matrix with the next index and so on up to the largest index.

Our goal is to estimate the risk after steps, namely , and demonstrate that this error is smaller for the without-replacement model. The expectation is with respect to the IGM ordering and the noise sequence . To simplify things a bit, we take a partial expectation with respect to :

(2.6)

In this case, we need to compare the expected value of matrix products under with or without-replacement sampling schemes in order to conclude which is better. Is there a simple conjecture, analogous to the arithmetic-geometric mean inequality, that would guarantee without-replacement sampling is always better?

2.3 The Randomized Kaczmarz algorithm

As another high-dimensional example in the same spirit, we consider the randomized Kaczmarz algorithm of Strohmer and Vershynin (2009). The Kaczmarz algorithm is used to solve the over-determined linear system . Here is an matrix with and we assume there exists an exact solution satisfying . Kaczmarz’s method solves this system by alternating projections (Kaczmarz, 1937) and was implemented in the earliest medical scanning devices (Hounsfield, 1973). In computer tomography, this method is called the Algebraic Reconstruction Technique (Herman, 1980; Natterer, 1986) or Projection onto Convex Sets (Sezan and Stark, 1987).

Kaczmarz’s algorithm consists of iterations of the form

(2.7)

where the rows of are accessed in some deterministic order. This sequence can be interpreted as an incremental variant of Newton’s method on the least squares cost function

with step size equal to  (see Bertsekas, 1999).

Establishing the convergence rate of this method proved difficult in imaging science. On the other hand, Strohmer and Vershynin (2009)

proposed a randomized variant of the Kaczmarz method, choosing the next iterate with-replacement with probability proportional to the norm of

. Strohmer and Vershynin established linear convergence rates for their iterative scheme. Expanding out (2.7) for iterations, we see that

Let us suppose that we modify Strohmer and Vershynin’s procedure to employ without-replacement sampling. After steps is the with-replacement or without-replacement model closer to the optimal solution?

3 Conjectures concerning the norm of geometric and arithmetic means of positive definite matrices

To formulate a sufficient conjecture which would guarantee that without-replacement sampling outperforms with-replacement, let us first formalize some notation. Throughout, denotes the set of integers from to . Let be some domain, , and a set of elements from . We define the without-replacement expectation as

That is, we average the value of over all ordered tuples of elements from . Similarly, the with-replacement expectation is defined as

With these conventions, we can list our main conjectures as follows:

Conjecture 3.1 (Operator Inequality of Noncommutative Arithmetic and Geometric Means)

Let be a collection of positive semidefinite matrices. Then we conjecture that the following two inequalities always hold:

(3.1)
(3.2)

Note that in (3.1), we have .

Assuming this conjecture holds, let us return to the analysis of the IGM (2.6). Assuming that is an arbitrary starting vector and that (3.2) holds, we have that each term in this summation is smaller for the without-replacement sampling model than for the with-replacement sampling model. In turn, we expect the without-replacement sampling implementation will return lower risk after one pass over the data-set. Similarly, for the randomized Kaczmarz iteration (2.7), Conjecture 3.1 implies that a without-replacement sample will have lower error after iterations.

In the remainder of this document we provide several case studies illustrating that these noncommutative variants of the arithmetic-geometric mean inequality hold in a variety of settings, establishing along the way tools and techniques that may be useful for proving Conjecture 3.1 in full generality.

3.1 Two matrices and a search for the geometric mean

Both of the inequalities (3.1) and (3.2) are true when . These inequalities all follow from an well-estabilished line of research in estimating the norms of products of matrices, started by the seminal work of Bhatia and Kittaneh (1990).

Proposition 3.2

Both (3.1) and (3.2) hold when

Proof Let and be positive definite matrices. Both of our arithmetic-geometric mean inequalities follow from the stronger inequality

(3.3)

This bound was proven by Bhatia and Kittaneh (2000). In particular, since

(3.1) is immediate. For (3.2), note that

(3.4)

and

We can bound (3.4) by

Here, the first inequality is the triangle inequality and the subsequent equality follows because the norm of is equal to the squared norm of . The second inequality is (3.3).

To complete the proof we show

in the semidefinite ordering. But this follows by observing

(3.5)

where if and otherwise. Since and both take possible values, the matrix is positive definite which means that can be written as a nonnegative sum of products of the form . We conclude that must be positive define and hence , completing the proof111An explicit decomposition (3.5) into Hermitian squares was initially found using the software NCSOSTools by Cafuta et al. (2011). This software finds decompositions of matrix polynomials into sums of Hermitian squares. Our argument was constructed after discovering this decomposition..  

Note that this proposition actually verifies a stronger statement: for two matrices, the arithmetic-geometric mean inequality holds for deterministic orderings of two matrices. We will discuss below how symmetrization is necessary for more than two matrices. In fact, considerably stronger inequalities hold for symmetrized products of two matrices. As a striking example, the symmetrized geometric mean actually precedes the square of the arithmetic mean in the positive definite order. Let and be positive semidefinite. Then we have

This ordering breaks for matrices as evinced by the counterexample

The interested reader should consult Bhatia and Kittaneh (2008) for a comprehensive list of inequalities concerning pairs of positive semidefinite matrices.

Unfortunately, these techniques are specialized to the case of two matrices, and no proof currently exists for the inequalities when . There have been a varied set of attempts to extend the noncommutative arithmetic-geometric mean inequalities to more than two matrices. Much of the work in this space has focused on how to properly define the geometric mean of a collection of positive semidefinite matrices. For instance, Ando et al. (2004) demarcate a list of properties desirable by any geometric mean, with one of the properties being that the geometric mean must precede the arithmetic mean in the positive-definite ordering. Ando et al derive a geometric mean satisfying all of these properties, but the resulting mean in no way resembles the means of matrices discussed in this paper. Instead, their geometric mean is defined as a fixed point of a nonlinear map on matrix tuples.  Bhatia and Holbrook (2006) and Bonnabel and Sepulchre (2009) propose geometric means based on geodesic flows on the Riemannian manifold of positive definite matrices, however these means also do not correspond to the averaged matrix products that we study in this paper.

3.2 When is it not necessary to symmetrize the order?

When the matrices commute, Conjecture 3.1 is a consequence of the standard arithmetic-geometric mean inequality (more precisely, a consequence of Maclaurin’s inequalities).

Theorem 3.3 (Maclaurin’s Inequalities)

Let be positive scalars. Let

be the normalized th symmetric sum. Then we have

Note that is the standard form of the arithmetic-geometric mean inequality. See Hardy et al. (1952) for a discussion and proof of this chain of inequalities.

To see that these inequalities immediately imply Conjecture 3.1 when the matrices are mutually commutative, note first that when , we have

The higher dimensional analogs follow similarly. If all of the commute, then the matrices are mutually diagonalizable. That is, we can write where

is an orthogonal matrix, and the

are all diagonal matrices of the eigenvalues of

in descending order. Then we have

and we also have

verifying our conjecture. In fact, in this case, any order of the matrix products will satisfy the desired arithmetic-geometric mean inequalities.

3.3 When is it necessary to symmetrize the order?

In contrast, symmetrizing over the order of the product is necessary for noncommutative operators. The following example, communicated to us by Aram Harrow , provides deterministic without-replacement orderings that have exponentially larger norm than the with-replacement expectation. Let . For , define the collection of vectors

(3.6)

Note that all of the have norm and, for , . The matrices are all positive semidefinite for , and we have the identity

(3.7)

Any set of unit vectors satisfying (3.7) is called a normalized tight frame, and the vectors (3.6) form a harmonic frame due to their trigonometric origin (see Hassibi et al., 2001; Goyal et al., 2001). The product of the is given by

and hence

Therefore, the arithmetic mean is less than the deterministically ordered matrix product for all .

It turns out that this harmonic frame example is in some sense the worst case. The following proposition shows that the geometric mean is always within a factor of of the arithmetic mean for any ordering of the without-replacement matrix product.

Proposition 3.4

Let be positive semidefinite matrices. Then

Proof If we sample uniformly from , then we have

Here, the first inequality follows from the triangle inequality. The second, because the operator norm is submultiplicative. The third inequality follows because the trace dominates the operator norm. The fourth inequality is Maclaurin’s. The fifth inequality follows because the trace of a positive semidefinite matrix is upper bounded by times the operator norm. The final inequality is again the triangle inequality.  

Note that this worst-case bound holds for deterministic orders of matrix products as well. Once we apply the submultiplicative property of the operator norm, all of the non-commutativity is washed out of the problem. Examples of deterministic matrix products saturating this upper bound can be constructed in higher dimensions using frames. If even, set

(3.8)

and for odd

(3.9)

Then one can verify again using standard trigonometric identities that

and that the inner products of adjacent are

These inner products are approximately for large . Thus, each of these cases violate the arithmetic-geometric mean inequality for the order by a factor of approximately provided .

At first glance, the harmonic frames example appears to cast doubt on the validity of Conjecture 3.1. However, after symmetrizing over the symmetric group, we can show that the harmonic frames do obey (3.1).

Theorem 3.5

Let . With the defined in (3.6),

This theorem additionally verifies that there is an asymptotic gap between the arithmetic and geometric means of the harmonic frames example after symmetrization. We include a full proof of this result in Appendix B. The proof treats the norm variationally using the identity that is the maximum of over all unit vectors

. Our computation then reduces to effectively computing a Fourier transform of the function of

in an appropriately defined finite group. We show that the Fourier coefficients can be viewed as enumerating sets, and we compute them exactly using generating functions.

The combinatorial argument that we use to prove Theorem 3.5 is very specialized. To provide a broader set of examples, we now turn to show that Conjecture 3.1 does in fact hold for many classes of random matrices.

4 Random matrices

In this section, we show that if are generated i.i.d. from certain distributions, then Conjecture 3.1 holds in expectation with respect to the . Section 4.1 assumes that where have independent entries, identically sampled from some symmetric distribution. In Section 4.2, we explore when the matrices are random rank-one perturbations of the identity as was the case in the IGM and Kaczmarz examples.

4.1 Random matrices satisfy the noncommutative arithmetic-geometric mean inequality

In this section, we prove the following

Proposition 4.1

For each , suppose with a random matrix whose entries are i.i.d. samples from some symmetric distribution. Then Conjecture 3.1 holds in expectation.

Proof Suppose the entries of each have finite variance (the theorem would be otherwise vacuous if we assumed infinite variance). Let the entry of be denoted by . Also, denote by the matrix with all of the stacked as columns: .

Let’s first prove that (3.1) holds in expectation for these matrices. First, consider the without-replacement samples, which are considerably easy to analyze. Let be a without-replacement sample from . Then

For the arithmetic mean, we can compute

(4.1)
(4.2)

Note that since are iid, symmetric random variables, each term in this sum is zero if it contains an odd power of for some and . If all of the powers in a summand are even, its expected value is bounded below by . A simple lower bound for this final term (4.2) thus looks only at the contribution from when all of the indices are set equal to .

Here the inequality is Jensen’s. This calculation proves (3.1) for our family of random matrices. That is, we have demonstrated that the expected value of the with-replacement sample has greater norm than the expected value of the without-replacement sample.

To verify that (3.2

) holds for our random matrix model, we first record the following property about the fourth moments of the entries of the

. Let . Then we can verify by direct calculation that

(4.3)

A consequence of this lemma is that . Using this identity, we can set and we then have

We compute this identity in a second way that describes its combinatorics more explicitly, which we will use as to derive our lower bound.

The second equality uses linearity coupled with the fact that are distinct, hence since elements from distinct matrices are independent. Many of the terms in this sum contain odd powers which are zero. Using the fact that , we see that all terms that are non-zero must contain only products of two forms: or . Then, we can write the sum:

Now consider the case that some index may be repeated (i.e., there exist such that for ). The key observation is the following. Let be a real-valued random variable with a finite second moment. Then,

(4.4)

With equality only for . This is Jensen’s inequality applied to for (since is real then is positive, and is convex on for ). To verify the inequality, let be the number of times index is repeated and observe

(4.5)
(4.6)

(4.5) follows from (4.3), since all terms are non-negative. (4.6) inequality is repeated application of (4.4). The final expression is precisely equal to the without-replacement average. Now, the with replacement average can be bounded as