We study unconstrained optimization of the form:
where we assume that the function is smooth, convex, and potentially high dimensional. This problem commonly arises in empirical risk minimization (ERM, see Shalev-Shwartz2014)
. State-of-the-art approaches for minimization of convex ERM objectives with large numbers of data points include variants of stochastic gradient descent (SGD) such as SVRG(Johnson2013), SARAH (Nguyen2017) and a plethora of others. Alternatively, one can approach the ERM problem via a dual formulation, where fast coordinate minimization techniques such as SDCA (Shalev-Shwartz2013), or parallel coordinate descent (PCDM; Richtarik2015a) can be applied. This is especially desirable in distributed and parallel environments (HYDRA; Ma2015; Duenner2016). In the dual formulation, these approaches are closely related to methods that subsample the Hessian (Pilanci2015; subsampled-newton; Roosta-Khorasani2016).
We study a block coordinate descent algorithm first introduced by Qu2015Feb. In each iteration of this algorithm, we sample a block of coordinates and then solve a Newton step on the chosen coordinate subspace. However, in place of the true Hessian, a fixed over-approximation matrix is used for the sake of efficiency. The Newton step is computed on a sparsified version of this matrix with all but the selected coordinates set to zero, denoted (see Section 1.2 for the complete notation). Originally, Qu2015Feb called this method Stochastic Dual Newton Ascent (SDNA), appealing to the fact that it operates in a dual ERM formulation. Later, it was also called a Stochastic Newton method (Mutny2018a), while we use the name Randomized Newton Method (RNM) following Gower2019.111They use the name for a more general algorithm.
The sampling strategy for the coordinate blocks has a dramatic impact on the convergence rate (Qu2016). Gower2015 demonstrate that by optimizing the sampling probabilities one can obtain very significant speedups, however this optimization is a semidefinite program which may be even more challenging than the original optimization problem itself. Even when using a basic sampling strategy (such as uniform), the convergence analysis of RNM is challenging because it hinges on deriving the expected pseudoinverse of , henceforth denoted . Prior to this work, no simple closed form expression was known for this quantity.
To overcome this challenge, we focus on a strategy of randomly sampling blocks of coordinates proportionally to the determinant of the corresponding submatrix of , which we call determinantal sampling. Similar sampling schemes have been analyzed in the context of stochastic optimization before (dpp-minibatch; Borsos2019). Recently, Rodomanov2019 analyzed determinantal sampling for randomized block coordinate descent, however they imposed cardinality constraints on the block size, and as a result were unable to obtain a simple expression for .
We use determinantal sampling with randomized block size, which allows us to obtain a simple closed form expression for the expected pseudoinverse:
where is a tunable parameter that is used to control the expected block size. With the use of this new expectation formula, we establish novel bounds on the convergence rate of RNM depending on the spectral properties of the over-approximating matrix . For many instances of the problem, the matrix coincides with the data covariance, and spectral decays of such covariances are well understood (Blanchard2007). This allows us to predict the decay-specific behavior of RNM with determinantal sampling and recommend the optimal block size.
The cost of each iteration of RNM scales cubically with the size of the block due to matrix inversion. Qu2015Feb
demonstrate numerically that for small blocks the optimization time decreases but at some point it starts to increase again. They surmise that the improvement is obtained only as long as the inversion cost is dominated by the other fixed per-iteration costs such as fetching from memory. However, whether the only possible speedup stems from this has remained unclear. We answer this question for determinantal sampling by deriving the optimal subset size in the case of kernel ridge regression. We show that when the eigenvalue decay is sufficiently rapid, then the gain in convergence rate can dominate the cost of inversion even for larger block sizes.
The main contributions of this paper can be summarized as follows:
We obtain a novel and remarkably simple expectation formula for determinantal sampling that allows us to derive a simple and closed form expression for the convergence rate of the Randomized Newton Method.
This allows us to improve the previous bounds on the theoretical speedup of using coordinate blocks of larger sizes. For example, we show that in the case of kernel regression with a covariance operator that has exponentially decreasing spectrum, the theoretical speedup is exponential.
We take into account the actual per iteration cost, and analyze not only the convergence rate of the algorithm, but also its numerical effort to solve a problem up to some relative precision. This allows us to classify the problems into categories where the optimal block size is one, the full matrix, or somewhere in between.
We numerically validate the discovered theoretical properties of determinantal sampling, and demonstrate cases when it improves over uniform sampling, and when it performs similarly.
Let be a non-empty subset of . We let be the matrix composed of columns of the identity matrix . Note that is the
identity matrix. Given an invertible matrix, we can extract its principal sub-matrix with the corresponding rows and columns indexed by via , and additionally keeping the sub-matrix in its canonical place we can define the following operation,
Note that is the matrix obtained from by retaining elements for and ; and all the other elements set to zero. By we denote the Moore-Penrose pseudoinverse. The matrix can be calculated by inverting , and then placing it back into the matrix.
The key assumption that motivates RNM is a smoothness condition that goes beyond the standard assumptions in the optimization literature, where smoothness would be characterized by a symmetric quadratic with the radius . Instead, Assumption 2 below is tighter, allowing for more refined analysis, and can be related to the standard assumption by .
[Smoothness] There exists a symmetric p.d. matrix such that ,
This assumption is satisfied for quadratic problems such as ridge regression with squared loss, where is the data matrix, and
is the vector of responses, which is corrupted via the noise. In this case, Assumption 2 holds with being the offset covariance matrix , where
is the regularization parameter. Beyond quadratic problems, it holds for many common problems such as logistic regression, where. Section 5.1 provides examples in the dual formulation.
2.1 Randomized Newton Method
Let be the iteration count and be the initial point. The Randomized Newton Method algorithm is defined via the following update rule:
where is a subset of coordinates chosen at iteration from random sampling to be defined. Notice that since is a sparse matrix with only a principal submatrix that is non-zero, its inversion costs arithmetic operations. Moreover, only elements of are needed for the update. Note that if then we are in the classical case of coordinate descent, while if , then we are performing a Newton step (with in place of the true Hessian).
The strategy with which one chooses blocks in (3) is of great importance and it influences the algorithm significantly. This strategy, called a sampling and denoted , is a random set-valued mapping with values being subsets of . A proper sampling is such that for all .
The most popular are uniform samplings, i.e., those for which the marginal probabilities are equal:
This class includes -nice and -list samplings (Qu2016). The -nice sampling considers all elements of a power set of with a fixed cardinality s.t. . There are of such subsets and each of them is equally probable. Consequently, the probability . On the other hand, the -list sampling is restricted to ordered and consecutive subsets of the power set, with cardinality fixed to .
Data dependent (and potentially non-uniform) samplings, which sample according to the diagonal elements of , have been analyzed in the context of coordinate descent (Qu2016; Allen-Zhu2016; Hanzely2018; Richtarik2015a).
3 Determinantal Sampling
Our proposed sampling for the Randomized Newton Method is based on a class of distributions called Determinantal Point Processes (DPPs). Originally proposed by dpp-physics
, DPPs have found numerous applications in machine learning(dpp-ml) as well as optimization (dpp-minibatch; Borsos2019)
, for their variance reduction properties and the ability to produce diverse samples. For ap.s.d. matrix , we define as a distribution over all subsets , so that
Even though this is a combinatorial distribution, the normalization constant can be computed exactly. We state this well known fact (e.g., see dpp-ml) separately because it is crucial for proving our main result. [Normalization] For a matrix ,
Note that the distribution samples out of a power set of . While cardinality constrained versions have also been used, they lack certain properties such as a simple normalization constant. Even though the subset size of
is a random variable, it is highly concentrated around its mean, and it can also be easily adjusted by replacing the matrix with a rescaled version, where . This only affects the distribution of the subset sizes, with the expected size given by the following lemma (see dpp-ml). [Subset Size] If , then
By varying the value of , we can obtain any desired expected subset size between and . As we increase , the subset size decreases, whereas if we take , then in the limit the subset size becomes , i.e., always selecting the . While the relationship between and
cannot be easily inverted analytically, it still provides a convenient way of smoothly interpolating between the full Newton and coordinate descent. To give a sense of whatcan be used to ensure subset size bounded by some , we give the following lemma. Let be the eigenvalues of in a decreasing order. If , then .
3.1 New expectation formula
We are now ready to state our main result regarding DPPs, which is a new expectation formula that can be viewed as a matrix counterpart of the determinantal identity from Lemma 4. If and , then
If we let , then the equality in (6) must be replaced by a p.s.d. inequality . We postpone the proof to the appendix. The remarkable simplicity of our result leads us to believe that it is of interest not only in the context of the Randomized Newton Method, but also to the broader DPP community. While some matrix expectation formulas involving the pseudoinverse have been recently shown for some special DPPs (e.g., unbiased-estimates-journal), this result for the first time relates an unregularized subsampled pseudoinverse with a -regularized inverse of the full matrix . Moreover, the amount of regularization that appears in the formula is directly related to the expected sample size.
3.2 Efficient sampling
Efficient DPP sampling has been an active area of research over the past decade. Several different approaches have been developed, such as an algorithm based on the eigendecomposition of (dpp-independence; dpp-ml) as well as an approximate MCMC sampler (rayleigh-mcmc) among others. For our problem, it is important to be able to sample from without having to actually construct the entire matrix , and much faster than it takes to compute the full inverse . Moreover, being able to rapidly generate multiple independent samples is crucial because of the iterative nature of the Randomized Newton Method. A recently proposed DPP sampler satisfies all of these conditions. We quote the time complexity of this method (the bounds hold with high probability relative to the randomness of the algorithm). [dpp-sublinear] For a p.s.d. matrix let where . Given , we can sample
the first in: time,
each next sample of in: time.
Note that the time it takes to obtain the first sample (i.e., the preprocessing cost) is , meaning that we do not actually have to read the entire matrix . Moreover, the cost of producing repeated samples only depends on the sample size , which is typically small. The key idea behind the algorithm of dpp-sublinear is to produce a larger sample of indices drawn i.i.d. proportionally to the marginal probabilities of . For any , the marginal probability of in is:
In the randomized linear algebra literature, this quantity is often called the th -ridge leverage score (ridge-leverage-scores), and sampling i.i.d. according to ridge leverage scores is known to have strong guarantees in approximating p.s.d. matrices.
Approximate ridge leverage score sampling incurs a smaller preprocessing cost compared to a DPP (Calandriello2017), and basically no resampling cost. Motivated by this, we propose to use this sampling as a fast approximation to and our experiments demonstrate that it exhibits similar convergence properties for Randomized Newton. We numerically compare the sampler from Lemma 3.2 against leverage score sampling in Appendix B.
4 Convergence Analysis
In this section, we analyze the convergence properties of the update scheme (3) with determinantal sampling defined by (4). In order to establish linear rate of convergence, we need to assume strong convexity. [Strong Convexity] Under Assumption 2, there exists a such that ,
Intuitively, the parameter measures the degree of accuracy of our quadratic approximation. For a quadratic function .
Strong convexity is not necessary to run RNM (3). In the cases where the function is only convex, we recover the standard sublinear rate depending on . [Karimireddy2018a] Let be convex and satisfy Assumption 2. Then using the update scheme in (3) with any proper sampling,
where is as in (7), and is the set diameter in geometry at the initial level sets.
The preceding two lemmas introduced the quantity characterizing the theoretical convergence rate of the method. By applying our new expectation formula (Theorem 3.1) we obtain a simple form for this quantity under DPP sampling.
Under Assumption 2, given :
Note that depends solely on the smallest eigenvalue and the parameter controlling the expected size. This is not the case for other samplings, and other closed forms are not known in general (Qu2015Feb).
Recall that the smaller the the bigger the subsets. The closed form expression from Theorem 4 combined with Lemma 3 allows us to formulate a recurrence relation between the convergence rates with different expected set sizes. [Recurrence relation] Let be the eigenvalues of in a decreasing order. Let be a positive integer, , and . Then,
and while .
This result allows us to further improve the theoretical bounds from Qu2015Feb on the parameter . Namely, it has been previously established that grows at least linearly with the increasing subset size of -uniform sampling, i.e., . We can establish more informative bounds depending on the eigenvalue decay. Specifically, for a decreasing sequence of eigenvalues ,
For example, given exponentially decaying eigenvalues where , the increase is at least exponential, and the convergence rate is at least bigger. The case with linear speed-up is recovered when all eigenvalues are equal.
5 Optimal Block Size
Our results such as Proposition 4 and inequality (9) describe the convergence speedup of using larger coordinate blocks for RNM with determinantal sampling as a function of the eigenvalues of . In this section, we demonstrate that covariance matrices arising in kernel ridge regression have known asymptotic eigenvalue decays, which allows for a precise characterization of RNM performance.
5.1 Kernel Ridge Regression
The motivating example for our analysis is the dual formulation of kernel ridge regression which is a natural application for block coordinate descent because of its high dimensionality. Suppose our (primal) regression problem is defined by the following objective:
where represents the kernel feature mapping and is the regularization parameter. Due to the Fenchel duality theorem (Borwein2005), the dual formulation of this problem is:
where . It is easy to see that the minimization problem (10) is exactly in the right form for RNM to be applied with the matrix . Notice that is an matrix and sampling sub-matrices of has the interpretation of subsampling the dataset. However, to keep the notation consistent with earlier discussion, w.l.o.g. we will let for the remainder of this section so that is . We will also assume that the minimization problem is solved with the RNM update where each coordinate block is sampled as .
5.2 Exponentially decreasing spectrum
Let be the eigenvalues of in decreasing order. Suppose that the eigenvalue decay is exponentially decreasing:
A classical motivating example for the exponential eigenvalue decay is the squared exponential kernel
, where an analytical form of the decay can be derived for normally distributed data(Rasmussen2006a). In particular, assuming , and using the kernel function in one dimension, the eigenvalues satisfy for a general constant independent of , where
For the ease of exposition, suppose that , where is the regularization constant and . Intuitively, this means that the regularization parameter flattens the decay at , which will play a role in the analysis.
To control the expected size of determinantal sampling, let , where . We get:
where . Asymptotically, if , i.e., the parameter dominates the regularization , then the expected subset is . However, in the regime where , the expected subset size rapidly goes up to (see Figure 0(b) (left)).
We now derive the convergence rate of RNM under determinantal sampling:
Likewise one can see that the convergence rate improves exponentially with .
From Theorem 4, we know that in order to reach an accurate solution from the initial accuracy under the convergence rate , the number of needed steps can be bounded by
Using the bound derived for , we obtain . Since, the computation step is dominated by the inversion operation , the number of arithmetic operations is
The upper bound on the numerical effort in the previous equation has two regimes. At first, for small subset sizes it is increasing, but then exponential decay starts to dominate and using larger blocks significantly improves the convergence rate. Finally it flattens around . Note that when , i.e., for , this phenomenon is visualized in Figure 0(a) where the vertical bars correspond to . In the regime where , inverting the whole matrix seems to be the best option. When , the term dominates the term , and the best subset size is either 1 or on the order of , depending on the value of .
These observations are contrary to the intuition from the previous works. We suspect that, due to fixed memory fetching costs, for small sizes the initial phase is unobserved but the second phase should be observed. Figure 0(a) suggests that for sufficiently small values of the numerical performance is maximized at the attenuation point and the predicted optimal block size is .
5.3 Polynomially decaying spectrum
Suppose that the eigenvalues are decreasing polynomially, i.e., so that for .
As a motivating example, consider a Matérn kernel of order , which has the form , where , are constants, and is a modified Bessel function of order (for details refer to Rasmussen2006a). This class of kernels exhibits asymptotically polynomial decay of eigenvalues (see Seeger2008).
Suppose, for . To control the expected size let us parameterize the tuning parameters as , where is a suitable general constant. Then the convergence rate becomes:
and . If , we can establish by integral approximation that , otherwise the expected size grows faster. Additionally, with increasing the convergence rate always improves.
When , similarly as in the preceding subsection, the numerical cost becomes . This suggests that for the total numerical cost decreases for larger subsets, while for the problems with smaller , the cost increases. In general, it is difficult to obtain general insights from the formulas, but the visalization in Figure 1b (right) suggests that if the regularization constant is large (small ), even problems with large might incur more cost as the subset size increases.
This suggests that small block sizes matching the memory fetching costs should be optimal if either the regularization is large or if is small. With the same assumption, if the desired accuracy is very high, performing full matrix inversion can be more efficient, corresponding to in Figure 1b (right). Note that increasing the accuracy to which we optimize the problem shifts the curves up in the logarithmic plot, while keeping the end point fixed.
5.4 Sparse spectrum
Suppose that only out of the eigenvalues are relatively large, while the remaining ones are very small. This scenario occurs with a linear kernel where the number of large eigenvalues corresponds to the number of features, and the remaining ones are proportional to the regularization parameter .
For the ease of exposition, let the large eigenvalues all be equal to . Lemma 5 implies that if then . The convergence rate can be split to two cases:
We see that once a discontinuity in the spectrum implies a discontinuity in the convergence rate. Consequently, the optimal subset size is of the order of as long as is sufficiently small.
We numerically validate the theoretical findings from the previous sections. Our main objective is to demonstrate that the convergence behavior of RNM under DPP sampling aligns well with the behavior of RNM under uniform sampling (called -nice), which is more commonly used. This would suggest that our convergence analysis under DPP sampling is also predictive for other standard samplings. In addition to providing evidence for this claim, we also show that there are cases where DPP sampling leads to superior performance of RNM.
Even though there exist efficient algorithms for DPP sampling, we chose to use approximate ridge leverage score sampling as a cheaper proxy for DPP sampling, as suggested in a recent line of work (dpp-intermediate; dpp-sublinear). The real data experiments were performed with sampling according to the -approximate ridge leverage scores (Calandriello2017). We always report the mean value of reruns of the experiment with the given parameters.
The first experiment deals with data sampled from a Gaussian distribution. The optimization using a kernelwith sparse spectrum (Figure 1(a)) verifies the theoretical findings that the optimal block size should be of the same order as . Using similarly generated data, and the relation in (11) to relate lengthscale and of squared exponential kernel, we reproduce the prediction of the theory that for sharper decays the optimal expected size should be larger (see Figure 1(b), compared with theory, Figure 0(a)). The performance of DPP and uniform sampling is on par as the intuition suggests, since for normally distributed data even a uniform subsample provides good summary statistics.
Gaussian Mixture Data
Akin to results from the sketching literature (e.g., see unbiased-estimates-journal)
, we suspect that the superior convergence of DPP sampling over uniform presents itself primarily if the dataset is heterogeneous. By heterogeneity we mean that a uniform subsampling of the points is likely not a good summary of the dataset. Consider a dataset where the points are sampled from a Gaussian Mixture Model withclusters that are equally likely. In order to have a good summary, a point from each cluster should be present in the sample. DPP samples are generally more diverse than uniform samples which makes it more likely that they will cover all the clusters. In Figure 3, we see that DPP significantly outperforms uniform sampling for this dataset because it allows RNM to solve more representative subproblems.
Real Data Experiments
We perform two real data experiments on standard UCI datasets where we optimize until statistical precision. In Figure 4a, we optimize linear ridge regression on the superconductivity dataset. Next, in Figure 4b we fit kernel ridge regression with squared exponential kernel on the cpusmall dataset. For both datasets, the optimal subset size under DPP sampling roughly matches the optimal size under uniform sampling. Moreover, in the case of the superconductivity dataset, as suggested by the theory for linear kernels, the optimal size is of the same order as the feature dimensionality.
We analyzed a sampling strategy for the Randomized Newton Method, where coordinate blocks of the Hessian over-approximation are sampled according to their determinant. This sampling allows for a simple interpretation of the convergence rate of the algorithm, which was previously not well understood. We demonstrated that for empirical risk minimization this convergence analysis allows us to predict the optimal size for the sampled coordinate blocks in order to minimize the total computational cost of the optimization.
This work was supported by SNSF grant 407540_167212 through the NRP 75 Big Data program. Also, MD thanks the NSF for funding via the NSF TRIPODS program.
Convergence Analysis of the Randomized Newton Method
with Determinantal Sampling
Appendix A Proofs
Proof of Theorem 3.1.
First, assume that . Since , we have for all . We will next use the following standard determinantal formula which holds for any and any invertible matrix :
Applying this formula to the submatrices of and denoting by the sub-vector of indexed by , we show that for any :
Since the above holds for all , the equality also holds for the matrices. To obtain the result with , it suffices to replace with . ∎
Proof of Lemma 3.
The eigenvalues of are so
which concludes the proof. ∎
a.2 Convergence Analysis
Proof of Theorem 4.
where . ∎
Proof of Proposition 4.
Dividing the denominator and the numerator by finishes the proof.
a.3 Dual convergence rate
The dual convergence rate established in Qu2015Feb relies on the notion of expected separable over-approximation. Namely, the existence of s.t. , where is the vector of marginal probabilities. In case of DPP sampling, one can choose , and apply dual convergence results established in this literature. By we denote element-wise product.
Appendix B Leverage Score Sampling vs Dpp Sampling
We perform a simple experiment on the Gaussian Mixtures dataset where the matrix has a sparse spectrum. In Figure 5 we see that the optimization process is influenced minimally.
Appendix C Other Samplings
The convergence properties of RNM with determinantal sampling depend solely on the spectral properties of . This is not true of other common samplings such as -nice. Indeed we can improve or worsen the performance of -nice sampling when is transformed via spectrum preserving operation such as unitary transformation
Suppose that we are given an eigenvalues of the matrix , for any sampling is it possible to find a spectrum preserving rotation such that is at least as small as which corresponds to DPP sampling with the same expected cardinality? The answer turns out to be negative, and we show counter-example.
[Counter-example] Let be a sampling such that is sampled with probability and and probability. The expected size of the subset and irrespective of the matrix .
Suppose matrix has degenerate spectrum such that is eigenvalue with multiplicity and is eigenvalue with multiplicity where . In order s.t. , , then . In what circumstances does DPP sampling perform better than a uniform sampling? First, we consider circumstances where uniform sampling is optimal.
c.1 Uniform sampling
It is important to allow for variation in the off-diagonal of . If we consider only diagonal , the optimal sampling is uniform sampling. Let be diagonal. The quantity of a sampling over a power set constrained by is maximized for uniform samplings.
Proof of Lemma c.1 .
We want to maximize the minimum eigenvalue of a matrix . For a diagonal we know that . Hence, , where is a vector of marginals . Hence, the minimum eigenvalue is the minimum marginal probability subject to a constraint that . This leads to an optimum where for all . Hence the optimal sampling distribution is uniform. ∎
c.2 Parallel Sampling
The parallel extension of the update method 3 has been considered in Mutny2018a and Karimireddy2018a. Namely, the authors consider a case, when the updates with machines are aggregated together to form a single update in the form , where is the aggregating parameter. It is known that for parallel disjoint samplings the convergence rate increases linearly with the number of processors. For independent samplings the aggregating parameter depends on the quantity,
which in the case of DPP sampling is equal to . The quantity , and as , the aggregation operation becomes averaging . For DPP sampling, we can see an inverse relationship between increasing by increasing block size, which inherently makes the parallelization problem more difficult by increasing .