## 1 Introduction

A quadratic function is one of the most important function classes in machine learning, statistics, and data mining. Many fundamental problems such as linear regression,

-means clustering, principal component analysis, support vector machines, and kernel methods

Murphy:2012 can be formulated as a minimization problem of a quadratic function.In some applications, it is sufficient to compute the minimum value of
a quadratic function rather than its solution. For example,
Yamada *et al.* Yamada:2011

proposed an efficient method for estimating the Pearson divergence, which provides useful information about data, such as the density ratio

Sugiyama:2012 . They formulated the estimation problem as the minimization of a squared loss and showed that the Pearson divergence can be estimated from the minimum value. The least-squares mutual information Suzuki:2011 is another example that can be computed in a similar manner.Despite its importance, the minimization of a quadratic function has the issue of scalability. Let be the number of variables (the “dimension” of the problem). In general, such a minimization problem can be solved by quadratic programming (QP), which requires time. If the problem is convex and there are no constraints, then the problem is reduced to solving a system of linear equations, which requires time. Both methods easily become infeasible, even for medium-scale problems, say, .

Although several techniques have been proposed to accelerate quadratic function minimization, they require at least linear time in

. This is problematic when handling problems with an ultrahigh dimension, for which even linear time is slow or prohibitive. For example, stochastic gradient descent (SGD) is an optimization method that is widely used for large-scale problems. A nice property of this method is that, if the objective function is strongly convex, it outputs a point that is sufficiently close to an optimal solution after a constant number of iterations

Bottou:2004 . Nevertheless, in each iteration, we need at least time to access the variables. Another technique is low-rank approximation such as Nyström’s method Williams:2001 . The underlying idea is the approximation of the problem by using a low-rank matrix, and by doing so, we can drastically reduce the time complexity. However, we still need to compute the matrix–vector product of size , which requires time. Clarkson*et al.*Clarkson:2012 proposed sublinear-time algorithms for special cases of quadratic function minimization. However, it is “sublinear” with respect to the number of pairwise interactions of the variables, which is , and their algorithms require time for some .

#### Our contributions:

Let be a matrix and be vectors. Then, we consider the following quadratic problem:

(1) |

Here, denotes the inner product and denotes the matrix whose diagonal entries are specified by . Note that a constant term can be included in (1); however, it is irrelevant when optimizing (1), and hence we ignore it.

Let be the optimal value of (1) and
let be parameters. Then, the main goal
of this paper is the computation of with
with probability at least in
*constant time*, that is, independent of . Here, we assume
the real RAM model Brattka:1998cv , in which we can perform
basic algebraic operations on real numbers in one step. Moreover, we
assume that we have query accesses to , , and , with
which we can obtain an entry of them by specifying an index. We note
that is typically because
consists of terms, and
and
consist of terms. Hence, we can regard the error of
as an error of for each
term, which is reasonably small in typical situations.

Let be an operator that extracts a submatrix (or subvector) specified by an index set ; then, our algorithm is defined as follows, where the parameter will be determined later.

In other words, we sample a constant number of indices from the set , and then solve the problem (1) restricted to these indices. Note that the number of queries and the time complexity are and , respectively. In order to analyze the difference between the optimal values of and , we want to measure the “distances” between and , and , and and , and want to show them small. To this end, we exploit graph limit theory, initiated by Lovász and Szegedy Lovasz:2006jj (refer to Lovasz:2012wn for a book), in which we measure the distance between two graphs on different number of vertices by considering continuous versions. Although the primary interest of graph limit theory is graphs, we can extend the argument to analyze matrices and vectors.

Using synthetic and real settings, we demonstrate that our method is orders of magnitude faster than standard polynomial-time algorithms and that the accuracy of our method is sufficiently high.

#### Related work:

Several constant-time approximation algorithms are known for combinatorial optimization problems such as the max cut problem on dense graphs

Frieze:1996er ; Mathieu:2008vs , constraint satisfaction problems Alon:2002be ; Yoshida:2011da , and the vertex cover problem Nguyen:2008fr ; Onak:2012cl ; Yoshida:2012jv . However, as far as we know, no such algorithm is known for continuous optimization problems.A related notion is property testing Goldreich:1998wa ; Rubinfeld:1996um , which aims to design constant-time algorithms that distinguish inputs satisfying some predetermined property from inputs that are “far” from satisfying it. Characterizations of constant-time testable properties are known for the properties of a dense graph Alon:2009gn ; Borgs:2006el and the affine-invariant properties of a function on a finite field Yoshida:2014tq ; Yoshida:2016zz .

#### Organization

In Section 2, we introduce the basic notions from graph limit theory. In Section 3, we show that we can obtain a good approximation to (a continuous version of) a matrix by sampling a constant-size submatrix in the sense that the optimizations over the original matrix and the submatrix are essentially equivalent. Using this fact, we prove the correctness of Algorithm 1 in Section 4. We show our experimental results in Section 5.

## 2 Preliminaries

For an integer , let denote the set . The notation means that . In this paper, we only consider functions and sets that are measurable.

Let be a sequence of indices in . For
a vector , we denote the *restriction* of
to by ; that is, for
every . For the matrix , we
denote the *restriction* of to by
; that is, for
every .

### 2.1 Dikernels

Following Lovasz:2013bv , we call a (measurable) function a *dikernel*.
A dikernel is a generalization of a *graphon* Lovasz:2006jj , which is symmetric and whose range is bounded in .
We can regard a dikernel as a matrix whose index is specified by a real value in .
We stress that the term dikernel has nothing to do with kernel methods.

For two functions , we define their inner product as . For a dikernel and a function , we define a function as .

Let be a dikernel. The * norm*
for and the *cut norm* of
are defined as and , respectively, where the supremum is over all pairs of subsets.
We note that these norms satisfy the triangle inequalities and .

Let be a Lebesgue measure. A map
is said to be *measure-preserving*, if the pre-image
is measurable for every measurable set , and
. A *measure-preserving
bijection* is a measure-preserving map whose inverse map exists and
is also measurable (and then also measure-preserving). For a measure
preserving bijection and a dikernel
, we define the dikernel
as .

### 2.2 Matrices and Dikernels

Let be a dikernel and be a sequence of elements in . Then, we define the matrix so that .

We can construct the dikernel from the matrix as follows. Let . For , we define as a unique integer such that . Then, we define . The main motivation for creating a dikernel from a matrix is that, by doing so, we can define the distance between two matrices and of different sizes via the cut norm, that is, .

We note that the distribution of , where is a sequence of indices that are uniformly and independently sampled from exactly matches the distribution of , where is a sequence of elements that are uniformly and independently sampled from .

## 3 Sampling Theorem and the Properties of the Cut Norm

In this section, we prove the following theorem, which states that, given a sequence of dikernels , we can obtain a good approximation to them by sampling a sequence of a small number of elements in . Formally, we prove the following:

###### Theorem 3.1.

Let be dikernels. Let be a sequence of elements uniformly and independently sampled from . Then, with a probability of at least , there exists a measure-preserving bijection such that, for any functions and , we have

We start with the following lemma, which states that, if a dikernel has a small cut norm, then is negligible no matter what is. Hence, we can focus on the cut norm when proving Theorem 3.1.

###### Lemma 3.2.

Let and be a dikernel with . Then, for any functions , we have .

###### Proof.

For and the function , let be the level set of at . For and , we have

To introduce the next technical tool, we need several definitions. We
say that the partition is a *refinement* of the partition
if is obtained by splitting each set
into one or more parts. The partition
of the interval is called an *equipartition* if
for every . For the dikernel
and the equipartition
of , we define as the function
obtained by averaging each for .
More formally, we define

where and are unique indices such that and , respectively.

The following lemma states that any function can be well approximated by for the equipartition into a small number of parts.

###### Lemma 3.3 (Weak regularity lemma for functions on Frieze:1996er ).

Let be an equipartition of into sets. Then, for any dikernel and , there exists a refinement of with for some constant such that

###### Corollary 3.4.

Let be dikernels. Then, for any , there exists an equipartition into parts for some constant such that, for every ,

###### Proof.

Let be a trivial partition, that is, a partition consisting of a single part . Then, for each , we iteratively apply Lemma 3.3 with , , and , and we obtain the partition into at most parts such that . Since is a refinement of , we have for every . Then, satisfies the desired property with . ∎

As long as is sufficiently large, and are close in the cut norm:

###### Lemma 3.5 ((4.15) of Borgs:2008hd ).

Let be a dikernel and be a sequence of elements uniformly and independently sampled from . Then, we have

Finally, we need the following concentration inequality.

###### Lemma 3.6 (Azuma’s inequality).

Let be a probability space, be a positive integer, and . Let , where

are independent random variables, and

takes values in some measure space . Let be a function. Suppose that whenever and only differ in one coordinate. ThenNow we prove the counterpart of Theorem 3.1 for the cut norm.

###### Lemma 3.7.

Let be dikernels. Let be a sequence of elements uniformly and independently sampled from . Then, with a probability of at least , there exists a measure-preserving bijection such that, for every , we have

###### Proof.

First, we bound the expectations and then prove their concentrations. We apply Corollary 3.4 to and , and let be the obtained partition with parts such that

for every . By Lemma 3.5, for every , we have

Then, for any measure-preserving bijection and , we have

(2) |

Thus, we are left with the problem of sampling from . Let

be a sequence of independent random variables that are uniformly distributed in

, and let be the number of points that fall into the set . It is easy to compute thatThe partition of is constructed into the sets such that and . For each , we construct the dikernel such that the value of on is the same as the value of on . Then, agrees with on the set . Then, there exists a bijection such that for each . Then, for every , we have

which we rewrite as

The expectation of the right hand side is . By the Cauchy-Schwartz inequality, .

Inserted this into (2), we obtain

Choosing , we obtain the upper bound

Observing that changes by at most if one element in changes, we apply Azuma’s inequality with and the union bound to complete the proof. ∎

## 4 Analysis of Algorithm 1

In this section, we analyze Algorithm 1. Because we want to use dikernels for the analysis, we introduce a continuous version of (recall (1)). The real-valued function on the functions is defined as

where is a function such that for every and is the constant function that has a value of everywhere. The following lemma states that the minimizations of and are equivalent:

###### Lemma 4.1.

Let be a matrix and be vectors. Then, we have

for any .

###### Proof.

First, we show that . Given a vector , we define as . Then,

Then, we have .

Next, we show that . Let be a measurable function. Then, for , we have

Note that the form of this partial derivative only depends on ; hence, in the optimal solution , we can assume if . In other words, is constant on each of the intervals . For such , we define the vector as , where is any element in . Then, we have

Finally, we have . ∎

###### Theorem 4.2.

Let and be an optimal solution and the optimal value, respectively, of problem (1). By choosing , with a probability of at least , a sequence of indices independently and uniformly sampled from satisfies the following: Let and be an optimal solution and the optimal value, respectively, of the problem . Then, we have

where and .

Comments

There are no comments yet.