Quantum-Inspired Classical Algorithms for Singular Value Transformation

10/13/2019
by   Dhawal Jethwani, et al.
0

A recent breakthrough by Tang (STOC 2019) showed how to "dequantize" the quantum algorithm for recommendation systems by Kerenidis and Prakash (ITCS 2017). The resulting algorithm, classical but "quantum-inspired", efficiently computes a low-rank approximation of the users' preference matrix. Subsequent works have shown how to construct efficient quantum-inspired algorithms for approximating the pseudo-inverse of a low-rank matrix as well, which can be used to (approximately) solve low-rank linear systems of equations. In the present paper, we pursue this line of research and develop quantum-inspired algorithms for a large class of matrix transformations that are defined via the singular value decomposition of the matrix. In particular, we obtain classical algorithms with complexity polynomially related (in most parameters) to the complexity of the best quantum algorithms for singular value transformation recently developed by Chakraborty, Gilyén and Jeffery (ICALP 2019) and Gilyén, Su, Low and Wiebe (STOC19).

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

11/12/2018

Quantum-inspired low-rank stochastic regression with logarithmic dependence on the dimension

We construct an efficient classical analogue of the quantum matrix inver...
10/14/2019

Sampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum machine learning

We present an algorithmic framework generalizing quantum-inspired polylo...
11/17/2021

A quantum-inspired algorithm for approximating statistical leverage scores

Suppose a matrix A ∈ℝ^m × n of rank k with singular value decomposition ...
12/22/2021

Parametrized Complexity of Quantum Inspired Algorithms

Motivated by recent progress in quantum technologies and in particular q...
09/15/2020

An improved quantum-inspired algorithm for linear regression

We give a classical algorithm for linear regression analogous to the qua...
05/24/2019

Quantum-inspired algorithms in practice

We study the practical performance of quantum-inspired algorithms for re...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Background. One of the most celebrated quantum algorithms discovered so far is the HHL algorithm [12]. This quantum algorithm solves a system of linear equations of the form , where is an matrix and is an

-dimensional vector, in time polynomial in

when the matrix is sufficiently sparse and well-conditioned. This is exponentially better that the best known classical algorithms, which run in time polynomial in (see also [1, 6, 7, 20] for improvements and relaxations of the assumptions). There are nevertheless two significant caveats. First, the input should be given in a way that allows very specific quantum access. In particular, the HHL algorithm requires the ability to efficiently create a quantum state proportional to . The second, and main, caveat is that the output of the HHL algorithm is not the solution of the linear system (which is a -dimensional vector) but only a

-qubit quantum state proportional to this vector. While measuring this quantum state can give some meaningful statistics about the solution 

, this naturally does not give enough information to obtain the whole vector . In this perspective, the HHL algorithm does not explicitly solve the system of equation, but instead enables sampling from the solution, in a very efficient way.

There have been several proposals to apply the HHL algorithm to linear-algebra based machine learning tasks, leading for instance to the discovery of quantum algorithms for principal component analysis (PCA)

[15]

and quantum support vector machine 

[16]. We refer to [3] for a recent survey on this field called quantum machine learning. One of the most convincing applications of quantum algorithms to machine learning has been speeding up recommendation systems [14]. In machine learning, recommendations systems are used to predict the preferences of users. From a mathematical perspective, the core task in recommendation systems can be modeled as follows: given an matrix (representing the preferences of users) and an index (representing one specific user), sample from the -th row of a low-rank approximation of . Kerenidis and Prakash [14] showed how to adapt the HHL algorithm to solve this problem in time polynomial in , which was exponentially better than the best known classical algorithms for recommendation systems.

Similarly to the HHL algorithm, the quantum algorithm from [14] works only under the assumption that the input is stored in an appropriate structure (called “Quantum Random-Access Memory”, or “QRAM”) that allows specific quantum access. Very recently, Tang [18] has shown that assuming that the input is stored in a classical data structure that allows

-norm sampling access (i.e., allows sampling rows with probability proportional to their

-norm), -time classical algorithms for recommendation systems can be designed as well. This results eliminates one of the best examples of quantum speedup for machine learning. The paper [18] also introduced the term “quantum-inspired algorithms” to refer to such classical algorithms obtained by “dequantizing” quantum algorithms.

More quantum-inspired algorithms have soon been developed: Tang [17] first showed how to construct classical algorithms for PCA that essentially match the complexity of the quantum algorithm for PCA from [15] mentioned above. Gilyén, Lloyd and Tang [10] and, independently, Chia, Lin and Wang [5] have shown how to obtain new classical algorithms for solving linear systems of equations, which also essentially match the complexity of the quantum algorithms when the input matrix has low-rank (see below for details). We also refer to [2] for a discussion of the performance of these quantum-inspired algorithms in practice.

Singular value transformation. The Singular Value Decomposition (SVD) of a matrix is a factorization of the form where and are unitary matrices and  is a diagonal matrix with non-negative real numbers on the diagonal, where  denotes the complex-conjugate transpose of V. A crucial property is that this decomposition exists for any complex matrix. Given a function , the singular value transformation associated with , denoted , is the function that maps the matrix to the matrix where is the diagonal matrix obtained from by replacing each diagonal entry by . We refer to Definition 1 in Section 2 for more details.

An important example is obtained by taking the “pseudo-inverse” function such that if and . Solving a linear system of equations corresponds111Indeed, one solution is given by , where represents the Moore-Penrose pseudo-inverse of the matrix (or simply the inverse when is invertible). It is easy to check that to calculating (or approximating) the vector . If all the singular values of are between and 1, for some value , the quantum-inspired algorithms from [5, 10] solve this task in time , where  denotes the rank of , denotes the Frobenius norm of and denotes the approximation error.222The term represents the time complexity of implementing sampling and query operations (see Proposition 1 in Section 2.3), which we also include in the complexity. One crucial point here is that the dependence on the dimensions of the matrix is only poly-logarithmic. Another important point is that the best known quantum algorithms (see [4, 10]) enable -norm sampling from the output in time in the QRAM input model. This means that, except for the dependence in , for low-rank matrices the classical running time is polynomially related to the quantum running time.

The core computational problem in recommendation systems can also be described as approximating the -row of the matrix for the threshold function such that if and otherwise (for some appropriate threshold value ). This corresponds to approximating the vector where is the vector with  in the -th coordinate and zero elsewhere. Ref. [18] shows how to solve this problem in time . (For the value  chosen for recommendation systems, the term becomes an upper bound on the rank of a low-rank approximation of .)

Our results. In this paper we significantly extend the class of functions for which the singular value transformation can be efficiently computed by “quantum-inspired” classical algorithms. The formal and most general statements of our results are given in Section 3. For the sake of readability, in this introduction we only describe our results for a restricted (but still very general) class of “smooth” functions. Let and denote the sets of non-negative numbers and positive numbers, respectively. We say below that a function is “smooth” if  is differentiable on and there exist three polynomials and such that and for all , where denotes the derivative of . We are mostly interested in functions such that since typically we do not want the transformation to increase the rank.

Our main results are the following two theorems (we refer to Section 3 for the formal versions).333These informal versions can be derived from the formal versions given in Section 3 by observing that if all the singular values of are between and 1.

Theorem 1 (Informal Version).

Let be any smooth function such that . For any sufficiently small , there exists a classical algorithm that has sampling access to a matrix with singular values in and to a vector , receives as input an index , outputs with high probability an approximation of the -th coordinate of the vector with additive error , and has time complexity.

Theorem 2 (Informal Version).

Let be any smooth function such that . For any sufficiently small , there exists a classical algorithm that has sampling access to a matrix with singular values in and to a vector , and samples with high probability from a distribution -close in total variation distance to the distribution associated with the vector , and has time complexity.

Note that instead of stating our results for the transformation we state them for the transformation in Theorems 1 and 2. The reason is that this simplifies the presentation of our algorithms and makes the comparison with prior works easier.

Theorems 1 and 2 show that under the same assumptions (namely, sampling access to the input) and similar requirements for the output (i.e., outputting one coordinate of or sampling from the associated distribution) as the prior works on quantum-inspired algorithms, we can efficiently compute classically the singular value transformation for any smooth enough function. This extends the results from [5, 10, 18] and significantly broaden the applicability of quantum-inspired algorithms.

Fast quantum algorithms have been constructed in recent works [4, 11] for singular value transformation. For the class of smooth functions we consider, the quantum running time obtained would be in the QRAM input model. Our results thus show that except possibly for the dependence on , we can again obtain classical algorithms with running time polynomially related to the quantum running time.

Overview of our approach. We use the same sampling methods as in [2, 5, 8, 10, 13, 18]: we first sample rows from the input matrix according to probability proportional to the row norms, which gives (after normalization) a matrix . We then do the same with matrix , this time sampling columns, which gives (after normalization) a matrix . The analysis of this process, which has been done in the seminal work by Frieze, Kannan and Vempala [8], shows that with high probability we have and when and are large enough (but still much smaller than and ). Since is a small matrix, we can then afford to compute its SVD.

The main contribution of this paper is the next step (and its analysis). We show how to use the SVD of the matrix in order to compute the singular value transformation . Using the SVD of , we first compute the matrices , and . We then compute the matrix This matrix is the output of Algorithm 1 presented in Section 3.2. Our central claim is the following:

(1)

Proving (1) and quantifying the quality of the approximation is our main technical contribution. This is done in Proposition 2 (which itself relies on several lemmas proved in Sections 3.1 and 3.2).

Finally, using similar post-processing techniques as in prior works [5, 18], from the output  of Algorithm 1 we can efficiently approximate coordinates of and sample from . This post-processing is described in Algorithms 2 and 3 in Section 3.3.

We now give an outline of the main ideas used to establish (1). The basic strategy is to exploit the relations and mentioned above. Our first insight is to define the function such that if and , and observe that . We then prove, in Lemma 7, that implies . The next natural step would be to relate and , but this cannot be done directly since the only guarantee is , and not . Instead, we observe that where , and with the matrix defined above. Since and , and since we can show that is close to using Lemma 7, we are able to prove that (this is proved in Lemma 9). To summarize, we have as needed.

Related independent work. Independently from our work, Chia, Gilyén, Li, Lin, Tang and Wang simultaneously derived similar results. They additionally provide general matrix arithmetic primitives for adding and multiplying matrices having sample and query access, show how to recover known dequantized algorithms using their techniques, and obtain new quantum-inspired algorithms for other applications, including Hamiltonian simulation and discriminant analysis.

2 Preliminaries

2.1 Notations and conventions

General notations. In this paper we use the notation for any integer . For any set we denote the convex hull of .

Given a matrix , we use , and to denote its -th row, its -th column and its -th element, respectively. The complex-conjugate transpose or Hermitian transpose of a matrix (or a vector ) is denoted as (and , respectively). The notations and represent the Frobenius and spectral norm, respectively. Note that for any . For a vector , we denote the norm of the vector. In this paper we will use several times the following standard inequalities that hold for any vector and any matrices and :

(2)

For a non-zero vector , let

denote the probability distribution on

where the probability of choosing is defined as . For two vectors and , the total variation distance between distributions and is defined as

We will use the following easy inequality (see for instance [5, 18] for a proof): for any two vectors ,

(3)

Singular Value Decomposition. The Singular Value Decomposition (SVD) of a matrix is a factorization of the form where and are unitary matrices and  is an diagonal matrix with non-negative real numbers, in non-increasing order, down the diagonal. The columns of and represent the left and right singular vectors, respectively. Each entry of this diagonal matrix is a singular value of matrix . A crucial property is that a SVD exists for any complex matrix.

We can also write the SVD of a matrix as

(4)

where and are columns of matrices and and thus the left and right singular vectors of matrix , respectively, and denotes the -th singular value (the -th entry of the diagonal matrix ) for each .

For any matrix , we denote the set of all singular values of as . We denote its -th singular value (in non-increasing order) as , i.e., the value  in the decomposition of Equation (4). We write the largest singular value (i.e., ), and write the smallest non-zero singular value. We define the condition number of as . Note that with this definition, is well defined even for singular matrices.

In this paper, we will use the following inequality by Weyl [19] quite often.

Lemma 1 (Weyl’s inequality [19]).

For two matrices , and any ,

Singular Value Transformation. We are now ready to introduce the Singular Value Transformation.

Definition 1 (Singular Value Transformation).

For any function such that , the Singular Value Transformation associated to is the function denoted that maps any matrix to the matrix defined as follows:

where the ’s, the ’s and the ’s correspond to the SVD of given in Eq. .

It is easy to check that the value does not depend on the SVD of chosen in the definition (i.e., it does not depend on which and which are chosen). Also note that from our requirement on the function , the rank (i.e., the number of nonzero singular values) of is never larger than the rank of .

The Moore-Penrose pseudo-inverse of matrix is the matrix , where is the rank of the matrix . Note that we only consider non-trivial singular values of the matrix. As in the introduction, we define the inverse function such that and for . Then we have . Note that and , where denotes the orthogonal projector into the column space of and denotes the orthogonal projector into the row space of .

Eigenvalue Transformation.

Let us introduce below another transformation applicable to a diagonalizable matrix

, i.e., a matrix than can be written as

(5)

for some invertible matrix

where denotes the diagonal matrix with diagonal entries as complex numbers . We write , which is the set of eigenvalues of .

Definition 2 (Eigenvalue Transformation).

For any function such that , the Eigenvalue Transformation associated to is the function denoted that maps any diagonalizable matrix to the matrix defined as follows:

where and correspond to the decomposition of given in Eq. .

Similarly to Definition 1, due to our assumption on the eigenvalue transformation function does not increase the rank of the input matrix.

We will use the following upper bound on the norm of for diagonalizable matrices and from [9].

Lemma 2 (Corollary 2.3 in [9]).

Let and be diagonalizable matrices with decompositions

For any function we have

where the convention if is used.

2.2 -norm sampling

We now present the assumptions to sample from a matrix and then introduce the technique of -norm sampling that has been used in previous works [2, 5, 8, 10, 18].

Sample accesses to matrices. Let be a matrix. We say that we have sample access to  if the following conditions hold:

  1. We can sample from the probability distribution defined as for any .

  2. For each , we can sample from the probability distribution defined as for any . (Note that is precisely the distribution introduced in Section 2, where is the -th row of .)

We define sample access to a vector using the same definition, by taking the matrix that has as unique row. Note that with this definition, the distribution is precisely the distribution introduced in Section 2.

For an algorithm handling matrices and vectors using sample accesses, the sample complexity of the algorithm is defined as the total number of samples used by the algorithm.

-norm sampling. Let be a matrix for which we have sample access. Consider the following process. For some integer , sample row indices using the probability distribution and then form the matrix by defining

for each . Note that this corresponds to selecting the rows with indices of and renormalizing them. The central insight of the -norm sampling approach introduced in [8] is that the matrix obtained by this process is in some sense close enough to to be able to perform several interesting calculations. We will in particular use the following result444The version we give is from the survey paper by Kannan and Vempala [13]. The actual result from [13] is stated for real matrices and for operator/spectral norm but the proof works for complex matrices and Frobenius norm as well. that shows that when  is large enough, with high probability the matrix is close to the matrix .

Lemma 3 (Theorem 4.4 in [13]).

For any , any and for , the following inequality holds with probability at least :

We will also use the following lemma from [8] that guarantees that when is large enough, with high probability the Frobenius norm of is close to the Frobenius norm of .

Lemma 4 ([8]).

With probability at least the following inequality holds:

2.3 Data structures for storing matrices

The following proposition shows that there exist low over-head data structures that enable sampling access to matrices.

Proposition 1 ([18]).

There exists a tree-like data structure that stores a matrix in space, where denotes the number of non-zero entries of , and supports the following operations:

  • Output in time;

  • Read and update an entry in time;

  • Output in time;

  • Sampling from in time;

  • For any , sampling from in time.

The data structure of Proposition 1 can naturally be used to store vectors as well.

We will need the following two technical lemma in our main algorithms. Lemma 5 shows that a vector-matrix-vector product can be efficiently approximated given sampling access. Lemma 6 states that, given sampling access to vectors represented by a matrix, sampling from their linear combination is possible.

Lemma 5 ([5]).

Let and be two vectors and be a matrix, all stored in the data structure specified in Proposition 1. Then for any and , the value can be approximated with additive error with probability at least in sample complexity and time complexity.

Lemma 6 ([18]).

Let be a matrix stored in the data structure specified in Proposition 1. Let be an input vector. Then a sample from can be obtained in expected sample complexity and expected time complexity , where

3 Formal Versions and Proofs of the Main Theorems

We now give the formal versions of Theorems 1 and 2 presented in the introduction. In this section, will always denote the condition number of the matrix . We define the intervals and (which depend on ) as follows:

(6)
Theorem 1 (Formal Version).

Let be any function such that . For any and any sufficiently small , there exists a classical algorithm that has sampling access to a matrix and to a vector as in Proposition 1, receives as input an index and has the following behavior: if is differentiable on the set , the algorithm outputs with probability at least a value such that , using

samples and

time complexity, where and .

Theorem 2 (Formal Version).

Let be any function such that . For any and any sufficiently small , there exists a classical algorithm that has sampling access to a matrix and to a vector as in Proposition 1 and has the following behavior: if is differentiable on the set and the projection of on the column space of has norm , with probability at least the algorithm samples from a distribution which is -close in total variation distance to the distribution , using

samples and

time complexity, where , and .

Theorems 1 and 2 are stated for a fixed function and their correctness is guaranteed for matrices  such that is differentiable on (remember that depends on ). Another way of interpreting these theorems is as follows: for a matrix and vector (given as inputs), the algorithms of Theorems 1 and 2 work for any function with that is differentiable on the set .

Section 3 is organized as follows. Section 3.1 presents a crucial lemma that gives an upper bound on in terms of , the values of and the values of its derivative . In Section 3.2 we present our central procedure, which performs row and column sampling to compute a matrix , and analyze this procedure using the lemma proved in Section 3.1. Finally, in Section 3.3 we prove Theorems 1 and 2 by applying appropriate post-processing to the matrix .

3.1 Bound on the distance between two singular value transformations

The following lemma uses the result in Lemma 2 in order to derive an upper bound on the distance between two singular value transformations of positive semi-definite matrices.

Lemma 7.

Let and be two positive semi-definite matrices, and write . For any function such that and is differentiable in , we have:

Proof.

For a positive semi-definite matrix the singular values are equal to the eigenvalues and the matrix in the decomposition of Equation (5

) can be taken as a unitary matrix. For means that for a positive semi-definite matrix, its singular value transformation is equal to its eigenvalue transformation. Note that if

is unitary then .

Using Lemma 2 we thus obtain:

where we write and .

For conciseness, let us write for any . There are three cases:

  1. For any such that and we have . This happens because  is differentiable in . Indeed, if we choose values and such that , we can always find a value such that by then Intermediate Value Theorem. Since this happens for all values of , we obtain .

  2. For any such that and , or and , we have ;

  3. For any such that and we have (by convention in Lemma 2).

Then

Therefore,

as claimed. ∎

3.2 Core procedure

Let us consider Algorithm 1 described below. The goal of this subsection is to analyze its behavior.

Parameters: Three real numbers , and
Input: stored in the data structure specified in Proposition 1

1:Set .
2:Set .
3:Sample row indices ,…., using operation 4) of Proposition 1. Let be the matrix whose -th row is , for each .
4:Sample column indices ,…., by repeating the following procedure times: sample a row index uniformly at random and then sample a column index with probability using operation 5) of Proposition 1.
5:Define the matrix such that , for each . Query all the entries of corresponding to entries of using operation 2) of Proposition 1.
6:Compute the singular value decomposition of matrix .
7:Compute the matrix using the output of the SVD step.
Algorithm 1 Computing the matrix .

The sampling process of Steps 3–5 is exactly the same as in prior works [2, 5, 8, 10, 13, 18], but with different values for and . The following lemma analyzes the matrices and obtained by this process.

Lemma 8.

Assume that . For any input matrix and any parameters in the specified range, with probability at least the following statements are simultaneously true for the matrices and computed by Algorithm 1:

(7)
(8)
(9)
(10)
(11)
Proof.

Since , Lemma 4 guarantees that Statement (7) holds with probability at least .

Using Lemma 3 twice, the following two inequalities simultaneously hold for matrices , and  in Algorithm 1 with probability at least :

Thus with probability at least , Statements (7), (8) and (9) simultaneously hold. We now show that in this case, Statements (10) and (11) always hold.

Using Weyl’s inequality (Lemma 1), we have

Since by the definition of , we get

By a similar argument we get

Using Weyl’s inequality (Lemma 1) again we obtain:

Since we finally obtain the lower bound

A similar argument gives the upper bound