1 Introduction
Quadratic function minimization/maximization is a versatile tool used in machine learning, statistics, and data mining and can represent many fundamental problems such as linear regression,
means clustering, principal component analysis (PCA), support vector machines, kernel machines and more (see
[Mur12]). In general, quadratic function minimization/maximization is NPHard. When the problem is convex (for minimization) or concave (for maximization), we can solve it by solving a system of linear equations, which requires time, whereis the number of variables. There are faster approximation methods based on stochastic gradient descent
[Bot04], and the multiplicative update algorithm [CHW12]. However, these methods still require time, which is prohibitive when we need to handle a huge number of variables.Quadratic function minimization over a sphere is also an important problem. This minimization problem is often called the trust region subproblem since it must be solved in each step of a trust region algorithm
. Trust region algorithms are among the most important tools in solving nonlinear programming problems, as they are robust and can be applied to illconditioned problems. In addition, trust region subproblems are useful in many other problems such as constrained eigenvalue problems
[GGvM89], leastsquare problems [ZCS10], combinatorial optimization problems
[Bus06] and many more. While the problem is nonconvex, it has been shown that the problem exhibit strong duality properties and is known to be solved in polynomial time (see [BTT96, YZ03]). In particular, it was shown to be equivalent to some semidefinite programming optimization problems that can be solved in polynomial time ([NN94, Ali95]). As in the nonconstrained case, there are approximation algorithms based on gradient descent [Nes83] and on reducing the problem to a sequence of eigenvalues computations [HK16]. However, as in the unconstrained case, these methods require running time which is linear in the number of the nonzero elements of the matrix (which might be linear in ).1.1 Our Contributions
In this work, we provide sublineartime approximation algorithms for minimizing quadratic functions, assuming random access to the entries of the input matrix and the vector.
First, we consider unconstraind minimization. Specifically, for a matrix and vectors , we consider the following quadratic function minimization problem:
(1) 
Here is a matrix whose diagonal entries are specified by and denotes the standard inner product.
Theorem 1.1.
Fix and let and be an optimal solution and the optimal value, respectively, of Problem (1). Let be a random set such that each index is taken to independently w.p with
Then, the following holds with probability at least
: Let and be an optimal solution and the optimal value, respectively, of the problemwhere is an operator that extracts a submatrix (or subvector) specified by an index set . Then,
where .
Recently, Hayashi and Yoshida [HY16] proposed a constanttime sampling method for this problem with an additive error of for , where and are the optimal solutions to the original and sampled problems, respectively. Although their algorithm runs in constant time, the guarantee is not meaningful when because the optimal value is always of order . Theorem 1.1 shows that we can improve the additive error to , where , as long as the number of samples is polylogarithmic (or more). We note that we always have and the difference is significant when and are sparse. For example, if and have only nonzero elements, then we have . Our new bound provides a trade off between the additive error and the time complexity, which was unclear from the argument by Hayashi and Yoshida [HY16].
Moreover, we consider minimization over a sphere. Specifically, given a matrix , vectors and , we consider the following quadratic function minimization problem over a sphere of radius :
(2) 
We give the first sublineartime approximation algorithm for this problem.
Theorem 1.2.
Let and be an optimal solution and optimal value, respectively, of Problem (2). Let and let be a random set such that each index is taken to independently w.p with
Then, the following holds with probability at least : Let and be an optimal solution and the optimal value, respectively, of the problem
Then,
where .
We can design a constanttime algorithm for (2) by using the result of [HY16], but the resulting error bound will be , which is times worse than the bound in Theorem 1.2.
The proofs of Theorems 1.1 and 1.2 rely on a novel decomposition theorem of matrices, which will be discussed in Section 1.3.
As another application of this decomposition theorem, we show that for any (small) , we can approximate the th largest singular values of a matrix (denoted ) to within an additive error of in time
Our algorithms are very simple to implement, and do not require any structure in the input matrix. However, similar results (with better running time) can be obtained by applying known sampling techniques from [FKV04]. Formally, we prove the following.
Theorem 1.3.
Given a matrix , , let
Then, for every , there is an algorithm that runs in time, and outputs a value such that with probability at least ,
We note that since the (see Fact 2.1), the relative error the algorithm achieves is at least . Therefore, to get meaningful approximation, one must have that . So, if we wish to set then we must have .
Finlay, we present numerical experiments that confirm the empirical performance for accuracy and runtime of our singular values algorithm (see Section 5.2.3)
1.2 Related work
In machine learning context, Clarkson et al. [CHW12] considered several machine learning optimization problems and gave sublineartime approximation algorithms for those problems. In particular, they considered approximate minimization of a quadratic function over the unit simplex . Namely, given a positive semidefinite matrix and , they showed that it is possible to obtain an approximate solution to (up to an additive error of ) in time, which is sublinear in the input size . In contrast, our algorithms run in polylogarithmic time and are much more efficient. Hayashi and Yoshida [HY17]
proposed a constanttime approximation algorithm for Tucker decomposition of tensors, which can be seen as minimizing lowdegree polynomials.
In addition to the work of Hayashi and Yoshida [HY16] mentioned above, an additional line of relevant work is constanttime approximation algorithms for the max cut problem on dense graphs [FK96, MS08]. Let be the Laplacian matrix of a graph on vertices. Then, the max cut problem can be seen as maximizing subject to , and these methods approximate the optimal value to within . Our method for approximating the largest singular values can be seen as an extension of these methods to a continuous setting.
1.3 Techniques
The main ingredient in our proof is a novel spectral decomposition theorem of matrices, which may be of independent interest. The theorem states that we can decompose a matrix into a structured matrix and a pseudorandom matrix . Here, is structured in the sense that it is a block matrix with a polylogarithmic number of blocks such that the entries in each block are equal. Also, is pseudorandom in the sense that it has a small spectral norm. Formally, we prove the following. For a matrix , we define its max norm as .
Theorem 1.4.
For any matrix and , there exists a decomposition with the following properties for :

is structured in the sense that it is a block matrix with blocks, such that the entries in each block are equal.

.

.
Our decomposition theorem is a strengthening of the matrix decomposition result of Frieze and Kannan [FK99, FK96]. In particular, they showed that any matrix can be decomposed to for , where the matrices are block matrices and . Here, is the cut norm, which is defined as
By using a result of Nikiforov [Nik09] that and the fact that the FriezeKannan result implies , we get that , which is too loose, and thus insufficient for our applications.
Given our decomposition theorem, we can conclude the following. When approximating (1) and (2), we can disregard the pseudorandom part . This will not affect our approximation by much, since has a small spectral norm. In addition, as consists of a polylogarithmic number of blocks, such that the entries in each block are equal, we can hit all the blocks by sampling a polylogarithmic number of indices. Hence, we can expect that is a good approximation to . To formally define the distance between and and to show it is small, we exploit graph limit theory, initiated by Lovász and Szegedy [LS06] (refer to [Lov12] for a book).
2 Preliminaries
For an integer we let . Given a set of indices , and a vector , we let be the restriction of to ; that is, , for every . Similarly, for a matrix and sets and , we denote the restriction of to by ; that is, , for every and . When we often use as a shorthand for . We use the notation as a shorthand for .
Given a matrix we define the Frobenius norm of as and the max norm of as . For a matrix , we let denote the th largest singular value of . It is well known that the largest singular value can be evaluated using the following.
In addition, we state the following fact regarding the singular values.
Fact 2.1.
Let , and consider the singular values of : . Then, for every , .
3 Spectral Decomposition Theorem
In this section we will prove the following decomposition theorem.
Theorem 3.1 (Spectral decomposition, restatement of Theorem 1.4).
For any matrix and , there exists a decomposition with the following properties for :

is structured in the sense that it is a block matrix with blocks, such that the entries in each block are equal.

.

.
The above theorem will serve as a central tool in the analysis of our algorithms. The fact that is a block matrix with polylogarithmic number of blocks, such that the entries in each block are equal, implies that by using polylogarithmic number of samples, we can query (with high probability) an entry from each of the blocks. In addition, the fact that has a small spectral norm allows us to disregard it, which only paying a small cost in the error of our approximation.
In order to prove the theorem, we introduce the following definition, two lemmas and a claim.
Definition 3.2.
We say that a partition is a refinement of a partition , if is obtained from , by splitting some sets into one or more parts.
Lemma 3.3.
Given a matrix and , there exists a block matrix with blocks such that the entries in each block are equal and , where .
In order to prove Lemma 3.3, we will need to prove the following.
Lemma 3.4.
Given a matrix and , let . Then, .
Proof: Assume that has singular values such that . For any , let , and let denote . Then, we can write as,
Consider any , , and let .
Let denote the th column of the matrix . So, . For any , is perpendicular to and all such that , and thus,
Let denote the th row of the matrix . Then similarly, we have , and it follows that .
On the other hand, since , we have and , and therefore,
By the fact that , we get that , which concludes the proof.
With this lemma at hand, we are ready to prove Lemma 3.3.
Proof of Lemma 3.3: Recall that can be written as,
where are the singular values of and and are the corresponding left and right singular vectors. If we let be such that
then we have that for any .
Next we show the existence of , which is a block matrix (with blocks), such that has the same value on every block. We construct as follows.
Let be determined later and let . Let and , and partition the interval into buckets such that
For every such that , we define a partition of the indices in so that for each and . We eliminate emptysets from if exist. Note that by the definition of we have that . Next, for every such that , we define as follows.
Next, let and let be a partition of that refines all for which is such that .
Similarly define from by setting , and defining analogously for each . Let and let be a partition of that refines all for which is such that .
Define,
By Fact 2.1 we have that for all . Therefore, we can have at most indices such that . Thus, the partition satisfies . Similarly, we have . Therefore, the resulting matrix is a block matrix with many blocks, such that all the entries in each block are the same. We refer to and the row partition of and the column partition of respectively.
Next, we have that
Consider the “error” term .
where and .
We analyze each of these terms separately. First, note that
So,
Here, follows from the fact that for two indices we have that . follows from the fact that there can be at most indices , such that , and therefore
(3) 
Next, we have that,
Here, follows from the fact that when one of the indices is in or we set the corresponding entry in the rounded vector to . In addition, the last inequality follows from the fact that when we remove the lower singular part of the matrix, we can only increase the value by at most factor of (see Lemma 3.4). Finally,
Combining all the three terms and setting gives,
Therefore, we get that,
and the lemma follows.
We are left with bounding the max norm of .
Claim 3.5.
Given a matrix and , let be the block approximation matrix defined above. Then,
Proof: By the definition of we have that . By the definition of the rounding process, we have that for every we have that (recall that and ). Similarly, for every we have that . Therefore,
where the last inequality uses (3).
4 Dikernels and Sampling Lemmas
In this section we will formalize the idea that is a good approximation of when and are uniformly random subsets of indices. (The proof for , where is uniformly random subset of indices, is almost identical and we omit it.) We start by providing some background on dikernels and their connection to matrices and then move on to proving our sampling lemmas.
4.1 Dikernels and Matrices
We call a (measurable) function a dikernel. We can regard a dikernel as a matrix whose index is specified by a real value in . For two functions , we define their inner product as