Querying a Matrix through Matrix-Vector Products

06/13/2019
by   Xiaoming Sun, et al.
0

We consider algorithms with access to an unknown matrix M∈F^n × d via matrix-vector products, namely, the algorithm chooses vectors v^1, ..., v^q, and observes Mv^1,..., Mv^q. Here the v^i can be randomized as well as chosen adaptively as a function of Mv^1,...,Mv^i-1. Motivated by applications of sketching in distributed computation, linear algebra, and streaming models, as well as connections to areas such as communication complexity and property testing, we initiate the study of the number q of queries needed to solve various fundamental problems. We study problems in three broad categories, including linear algebra, statistics problems, and graph problems. For example, we consider the number of queries required to approximate the rank, trace, maximum eigenvalue, and norms of a matrix M; to compute the AND/OR/Parity of each column or row of M, to decide whether there are identical columns or rows in M or whether M is symmetric, diagonal, or unitary; or to compute whether a graph defined by M is connected or triangle-free. We also show separations for algorithms that are allowed to obtain matrix-vector products only by querying vectors on the right, versus algorithms that can query vectors on both the left and the right. We also show separations depending on the underlying field the matrix-vector product occurs in. For graph problems, we show separations depending on the form of the matrix (bipartite adjacency versus signed edge-vertex incidence matrix) to represent the graph. Surprisingly, this fundamental model does not appear to have been studied on its own, and we believe a thorough investigation of problems in this model would be beneficial to a number of different application areas.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/24/2020

Vector-Matrix-Vector Queries for Solving Linear Algebra, Statistics, and Graph Problems

We consider the general problem of learning about a matrix through vecto...
02/22/2021

Quantum query complexity with matrix-vector products

We study quantum algorithms that learn properties of a matrix using quer...
10/19/2020

Hutch++: Optimal Stochastic Trace Estimation

We study the problem of estimating the trace of a matrix A that can only...
09/05/2021

On the query complexity of connectivity with global queries

We study the query complexity of determining if a graph is connected wit...
11/20/2017

Edge Estimation with Independent Set Oracles

We study the problem of estimating the number of edges in a graph with a...
03/24/2021

On the ℓ^∞-norms of the Singular Vectors of Arbitrary Powers of a Difference Matrix with Applications to Sigma-Delta Quantization

Let A _max := max_i,j |A_i,j| denote the maximum magnitude of entries of...
11/23/2021

On the Column and Row Ranks of a Matrix

Every m by n matrix A with rank r has exactly r independent rows and r i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose there is an unknown matrix that you can only access via a sequence of matrix-vector products , where we call the vectors the query vectors, which can be chosen in a randomized, possibly adaptive way. By adaptive, we mean that can depend on as well as . Here is a field, and we study different fields for different applications. Suppose our goal is to determine if satisfies a specific property , such as having approximately full rank, or for example whether has two identical columns. A natural question is the following:

Question 1: How many queries are necessary to determine if has property ?

A number of well-studied problems are special cases of this question, i.e., compressed sensing or sparse recovery, for which is an approximately -sparse vector, and one would like a number of queries close to . It is known that if the query sequence is non-adaptive, meaning are chosen before making any queries, then is necessary and sufficient [12, 6] to recover an approximately -sparse vector222Here the goal is to output a vector for which , where is the best -sparse approximation to , and is a constant.. However, if the queries can be adaptive, then queries suffice [16], while there is a lower bound of [30] (see also recent work [29, 17]).

The above problem is representative of an emerging field called linear sketching which is the underlying technique behind a number of algorithmic advances the past two decades. In this model one queries for non-adaptive queries . For brevity we write this as , where has -th column equal to . Linear sketching has played a central role in the development of streaming algorithms [3]. Perhaps more surprisingly, linear sketches are also known to achieve the minimal space necessary of any, possibly non-linear, algorithm for processing dynamic data streams under certain general conditions [24, 2, 19], which is an essential result for proving a number of lower bounds for approximating matchings in a stream [22, 5]. Linear sketching has also led to the fastest known algorithms for problems in numerical linear algebra, such as least squares regression and low rank approximation; for a survey see [36]. Note that given and , by linearity one can compute . This basic versatility property allows for fast updates in a data stream and mergeability in environments such as MapReduce and other distributed models of computation.

Given the applications above, we consider Question 1 an important question to understand for many different properties of interest, which we describe in more detail below. A central goal of this work is to answer Question 1 for such properties and to propose this be a natural model of study in its own right.

One notable difference with our model and a number of appications of linear sketching is that we will allow for adaptive query sequences. In fact, our upper bounds will be non-adaptive, and our nearly matching lower bounds for each problem we consider will hold even for adaptive query sequences. Our model is also related to property testing, where one tries to infer properties of a large unknown object by (possibly adaptively) sampling a sublinear number of locations of that object. We argue that linear queries are a natural extension of sampling locations of an object, and that this is a natural “sampling model” not only because of the desired properties of the distributed, linear algebra, and streaming applications above, but sometimes also for physical constraints, e.g., in compressed sensing, where optical devices naturally capture linear measurements.

From a theoretical standpoint, any property testing algorithm, i.e., one that samples entries of , can be implemented in our model with linear queries. However, our model gives the algorithm much more flexibility. From a lower bound perspective, as in the case of property testing [10], some of our lower bounds will be derived from communication complexity. However, not all of our bounds can be proved this way. For example, one notable result we show is an optimal lower bound on the number of queries needed to approximate the rank of up to a factor by randomized, possibly adaptive algorithms; we show that queries are necessary and sufficient. A natural alternative way to prove this would be to give part of the matrix to Alice, part of to Bob, and have the players exchange the and , where and is Alice’s part and is Bob’s part. Then, if the -player randomized communication complexity of approximating the rank of up to a factor of were known to be , we would obtain a nearly-matching query lower bound of , where is the number of bits needed to specify the entries of and the queries. However, it is unknown what the -player communication complexity of approximating the rank of up to a factor is over ! We are not aware of any lower bound better than for constant for this problem for adaptive queries. We note that for non-adaptive queries, there is an sketching lower bound over the reals given in [23], and an lower bound for finite fields (of size ) in [4]. There is also a property testing lower bound in [7], though such a lower bound makes additional assumptions on the input. Thus, our model gives a new lens to study this problem from, from which we are able to derive strong lower bounds for adaptive queries. Our techniques could be helpful for proving lower bounds in existing models, such as two-party communication complexity.

Our model is also related to linear decision tree complexity, see, e.g.,

[9, 18], though such lower bounds typically involve just seeing a threshold applied to , and typically is a vector. In our case, we observe the entire output vector .

An interesting twist in our model is that in our formulation above, we only allowed to query via matrix-vector products on the right, i.e., of the form . One could ask if there are natural properties of for which the number of queries one would need to make if querying via queries of the form can be significantly smaller than the number of queries one would need to make if querying via queries of the form :

Question 2: Are there natural problems for which ?

We show that this is in fact the case, namely, if we can only multiply on the right, then it takes queries to determine if there is a column of a matrix which is all s. However, if we can multiply on the left, then the single query can determine this.

We study a few problems around Question 2, which is motivated from several perspectives. First, matrices might be stored on computers in a specific encoding, e.g., a sparse row format, from which it may be much easier to multiply on the right than on the left. Also, in compressed sensing, it may be natural for physical reasons to obtain linear combinations of columns rather than rows.

Another important question is how the query complexity depends on the underlying field for which matrix-vector products are performed. Might it be that for a natural problem the query complexity if the matrix-vector products are performed modulo is much higher than if the matrix-vector products are performed over the reals?

Question 3: Is there a natural problem for which the query complexity in our model over is much larger than that over the reals?

Yet another important application of this model is to querying graphs. A natural question is which representation to use for the graph. For example, a natural representation of a graph on vertices is through its adjacency matrix , where if and only if occurs as an edge. A natural representation for a bipartite graph with vertices in each part could be an matrix where iff there is an edge from the -th left vertex to the -th right vertex. Yet another representation could be the edge-vertex incidence matrix, where the -th row is either , or has exactly two ones, one in location and one in location . One often considers a signed edge-vertex incidence matrix, where one first arbitrarily fixes an ordering on the vertices and then the -th entry has a in the -th position and a in the -th position if , otherwise positions and are swapped. Yet another possible representation of a graph is through its Laplacian.

Question 4: Do some natural representations of graphs admit much more efficient query algorithms for certain problems than other natural representations?

We note that in the data stream model, where one sees a long sequence of insertions and deletions to the edges of a graph, each of the matrix representations above can be simulated and so they lead to the same complexity. We will show, perhaps surprisingly, that in this model there can be an exponential difference in the query complexity for two different natural representations of a graph for the same problem.

We next get into the details of our results. We would like to stress that even basic problems in this model are not immediately obvious how to tackle. As a puzzle for the reader, what is the query complexity of determining if a matrix is symmetric if one can only query vectors on the right? We will answer this later in the paper.

1.1 Formal Model and Our Results

We now describe our model and results formally in terms of an oracle. The oracle has a matrix , for some underlying field that we specify in each application. We can only query this matrix via matrix-vector products, i.e., we pick an arbitrary vector and send it to the oracle, and the oracle will respond with a vector . We focus our attention when the queries only occur on the right. Our goal is to approximate or test a number of properties of with a minimal number of queries, i.e., to answer Question 1 for a large number of different application areas.

We study a number of problems as summarized in the Table 1. We assume is an matrix and

is a parameter of the problem. The bounds hold for constant probability algorithms. In some problems, such as testing whether the matrix is a diagonal matrix, we always assume

, and in the graph testing problems we explicitly describe how the graph is represented using . Interestingly, we are able to prove very strong lower bounds for approximating the rank, which as described above, are unknown to hold for randomized communication complexity.

Motivated be streaming and statistics questions, we next study the query complexity of approximating the norm of each row of . We also study the computation of the majority or parity of each column or row of , the AND/OR of each column or row of , or equivalently, whether has an all ones column or row, whether has two identical columns or rows, and whether contains an unusually large-normed row, i.e., a “heavy hitter”. Here we show there are natural problems, such as computing the parity of all columns, which can be solved with query if sketching on the left, but require queries if sketching on the right, thus answering Question 2. We also answer Question 3, observing for the natural problem of testing if a row is all ones, a single deterministic query suffices over the reals but over this deterministically requires queries.

Problem Query Complexity
Linear Algebra Problems
Approximate Rank (for any (Section 3.1)
distinguishing Rank from Rank )

Trace Estimation

(Section 3.2)
Symmetric Matrix / Diagonal Matrix (Section 3.3 and  3.4)
Unitary Matrix 1 (Section 3.5)
Approximate Maximum Eigenvalue for adaptive queries,
for non-adaptive queries (Section 3.6)
Streaming and Statistics Problems
All Ones Column over ,
over (Section 4.1)
Two Identical Columns
Two Identical Rows (Section 4.2)
Approximate Row Norms / Heavy Hitters (Section 4.3)
Majority of Columns over
Majority of Rows over (Section 4.4)
Parity of Columns
Parity of Rows (Section 4.5)
Graph Problems
Connectivity given Bipartite Adjacency Matrix (Section 5.1)
Connectivity given Signed Edge-Vertex Matrix ([20], noted in Section 5.1)
Triangle Detection (Section 5.2)
Table 1: Our Results

For graph problems, we first argue if the graph is presented as an bipartite adjacency matrix , then it requires possibly adaptive queries to determine if the graph is connected. In contrast, if the graph is presented as an signed vertex-edge incidence matrix, then non-adaptive queries suffices. This answers Question 4, showing that the type of representation of the graph is critical in this model. Motivated by a large body of recent work on triangle counting (see, e.g., [13] and the references therein), we also give strong negative results for this problem in our model, which as with all of our lower bounds unless explicitly stated otherwise, hold even for algorithms which perform adaptive queries.

2 Preliminaries

We use capital bold letters, e.g., , to denote matrices, and use lowercase bold letters, e.g., , to denote column vectors. Sometimes we write a matrix as a list of column vectors in square brackets, e.g., . We use calligraphic letters, e.g.,

, to denote probability distributions, and use

to denote that is sampled from distribution . In particular, we use

to denote a Gaussian distribution and

for a matrix whose entries are sampled from an independently and identically distributed (denoted as i.i.d. in the following) Gaussian distribution.

We call a matrix i.i.d. Gaussian if each element is i.i.d. Gaussian. It is easy to check that if matrix is a i.i.d. Gaussian matrix, and is an rotation matrix, then is still i.i.d. Gaussian, and has the same probability distribution of .

The total variation distance, sometimes called the statistical distance, between two probability measures and is defined as

Let be an matrix with each row i.i.d. drawn from an

-variate normal distribution

. Then the distribution of the random matrix is called the Wishart distribution with degrees of freedom and covariance matrix , denoted by . The distribution of eigenvalues of is characterized in the following lemma.

Lemma 1 (Corollary 3.2.19 in [21]).

If is , with , the joint density function of the eigenvalues of (in descending order) is

In particular, for and , a constant independent from , such that

3 Linear Algebra Problems

In this section we present our lower bound for rank approximation in Section 3.1. In the following, we provide our results about trace estimation in Section 3.2, testing symmetric matrices in Section 3.3, testing diagonal matrices in Section 3.4, testing unitary matrices in Section 3.5, and approximating the maximum eigenvalue in Section 3.6.

3.1 Lower Bound for Rank Approximation

In this section, we discuss how to approximate the rank of a given matrix over the reals when the queries consist of right multiplication by vectors. A naïve algorithm to learn the rank is to pick random Gaussian query vectors non-adaptively. In order to approximate the rank, that is, to distinguish whether or , this algorithm needs at least queries, and it is not hard to see that the algorithm succeeds with probability . Indeed, if is the random Gaussian query matrix, and the unknown matrix, then writing

in its thin singular value decomposition as

, where and have orthonormal columns, and has positive diagonal entries, we have that rank rank, which by rotational invariance of the Gaussian distribution is the the same as the rank of a random Gaussian matrix, which will be the minimum of and the rank of with probability .

In the following, we will show that we cannot expect anything better. We will first show for non-adaptive queries, at least queries are necessary to learn the approximate rank. Then we generalize our results to adaptive queries. Our results hold for randomized algorithms by applying Yao’s minimax principle.

3.1.1 Non-Adaptive Query Protocols

Theorem 1.

Let constant be the error tolerance and let be an oracle matrix and suppose to start that we make non-adaptive queries. For integer , at least queries are necessary to distinguish from with advantage .

Proof.

Given any algorithm distinguishing from for some , we can determine whether a matrix has full rank or

, by padding

to an matrix . Therefore in what follows it suffices to prove the lower bound for two matrices and where and :

  1. ;

  2. .

Here has columns and has columns such that forms an random orthonormal basis, and are and matrices whose entries are sampled i.i.d. from the standard Gaussian distribution, and is a function in which will be specified later. It immediately follows that and with overwhelmingly high probability. Then we assume and discuss the query lower bound for distinguishing from .

Given , without loss of generality we denote the non-adaptive queries with an orthonormal333Non-orthonormal queries can be made orthonormal using a change of basis in post-processing. matrix , where and each column vector is a query to the oracle of matrix which gets response , for . Then, it suffices to show that the following two distributions are hard to distinguish:

  1. , where ;

  2. , where .

Note that is orthonormal, and hence , . We introduce Lemma 2 to eliminate in the representation of .

Lemma 2.

For and defined as above,

Proof.

The direction is trivial by the data processing inequality (i.e., for every and function , ). In what follows we only prove the other direction.

First we notice that for every fixed orthonormal matrix and for a random matrix sampled as or , the product follows exactly the same distribution of . Thus and are identically distributed.

Then, from a random sample we can find such that and for some orthonormal matrix and orthonormal query matrix . Although is not necessarily the same as because of , we have for a uniformly random orthonormal matrix . Thus we transform a random sample from into a sample from via , and hence, we have . ∎

Using Lemma 2, it suffices to prove an upper bound for as follows:

where are diagonal matrices such that and for orthonormal matrices and . The inequality follows because any algorithm separating from implies a separation of from with the same advantage, by multiplying by random orthonormal matrices.

By Weyl’s inequality [35, 38], for every , , and hence . Notice that is an i.i.d. Gaussian matrix, and hence is a chi-squared variable with degrees of freedom, which is bounded by with high probability (c.f. Example 2.12 in [34]). Recalling that , in what follows we condition on the event .

We then show the gaps between eigenvalues are sufficiently large. Note that since is i.i.d. Gaussian and is an orthonormal matrix, each row in is independently drawn from an -variate normal distribution, thus the probability distribution of is a Wishart distribution . Let and be sorted in descending order. Then by Lemma 1 the density function of is:

(1)

Let denote the event that and .

Lemma 3.

For defined as above and sufficiently small , .

Proof.

By equation (2) in [31] we know that . Thus for and we get:

Also, we note that for every , , by setting in Corollary 5.35 of [33]. In what follows we condition on the event that for every .

Then we consider the joint distribution

of in . Let be the event that and has a gap smaller than . Thus . To lower bound , we need to upper bound the probability of for .

Let be the density function of as in (1), and let be the Lebesgue measure in dimensions. Then for every ,

Note that conditioning on such that , the density function is bounded as:

As a result, we get .

Therefore, the probability of is lower bounded for sufficiently small ,

Conditioned on event and recalling that , the probability density of has only a negligible difference from that of , since the small disturbance of eigenvalues is dominated by the corresponding terms in .

Similarly we can prove . Thus the total variation distance between and conditioned on is for sufficiently large . Thus, for sufficiently large , we have:

Therefore, with as many as non-adaptive queries to the oracle matrix , the two distributions and cannot be distinguished with advantage greater than . At least queries are necessary to distinguish those two matrices and of rank and rank , respectively.

Indeed, the above argument holds for every constant advantage if , , and is sufficiently small in the proof of Lemma 3, and letting be sufficiently large. ∎

3.1.2 Equivalence Between Adaptive and Non-Adaptive Protocols

Now, we consider the adaptive query matrix where is the -th query vector. Without loss of generality, we can assume that is a unit vector and it is orthogonal to query vectors . This gives us the following formal definition of an adaptive query protocol.

Definition 1.

For a target matrix , an adaptive query protocol will output a sequence of query vectors . It is called a normalized adaptive protocol if for any , the query vector output by satisfies

  1. is a unit vector;

  2. is orthogonal to the vectors ;

  3. is deterministically determined by .

Let be a standard protocol which outputs where is the -th standard basis vector. We then show that adaptivity is unnecessary by proving that has the same power as any normalized adaptive protocol.

More formally, we show the following lemma:

Lemma 4.

Fix any matrix and any normalized adaptive protocol . Let be a i.i.d. Gaussian matrix. Fix to be the number of queries. Let matrix and be the query matrix output by protocol and , correspondingly. Then, the probability distribution of is the same as the distribution of .

Proof.

Since is i.i.d. Gaussian, it is enough to show is also i.i.d. Gaussian. We will show it column-by-column.

Let and . Note that are unit vectors and orthogonal to each other. We first define unitary rotation matrices recursively as follows. The matrix will take to . The matrix will take to for any and takes to . Note, only depends on the first query vectors. We have for any , and . In the following, we use induction to show is i.i.d. Gaussian for any .

For , since is determined by which is independent of and is a unitary matrix, is i.i.d. Gaussian. Thus, is the first column which is also i.i.d. Gaussian.

Now, suppose is i.i.d. Gaussian. We will prove is also i.i.d. Gaussian. Let which is i.i.d. Gaussian. Since is determined by , it is determined by the response of the first queries, that is, determined by . It means is determined by the first columns of . Therefore, it is dependent on the first columns of . On the other hand, for any , and thus where is the identity matrix, and depends on the first columns of . Consequently, in the multiplication of , the first columns are the same as those in . In the -th column, the -th element is where are the elements in correspondingly. Since only depends on the first columns of , it is independent of when . Thus, the -th column is also i.i.d. Gaussian and independent of the first columns. Therefore, we show is still i.i.d. Gaussian.

By induction is i.i.d. Gaussian. This finishes our proof. ∎

We then show for , adaptivity is also unnecessary by a similar argument.

Corollary 2.

Consider . For any fixed , and any fixed normalized adaptive protocol , has the same distribution as .

Proof.

It is enough to show both and are i.i.d. Gaussian. ∎

Combining these results and Theorem 1, together with Yao’s minimax principle [37],

Theorem 3.

Let constant be the error tolerance and let be an oracle matrix with adaptive queries. For every integer , at least queries are necessary for any randomized algorithm to distinguish whether or with advantage .

3.2 Lower Bound for Trace Estimation

We lower bound the number of queries needed to approximate the trace of a matrix . In particular we reduce this problem to triangle detection as will be proved in Theorem 9.

Theorem 4.

For any integer and symmetric matrix with entries in , the number of possibly adaptively chosen query vectors, with entries in , needed to approximate up to any relative error, is .

Proof.

Suppose we had a possibly adaptive query algorithm making queries which for a symmetric matrix , could approximate up to any relative error. If for a symmetric matrix , we can run the trace esimation algorithm on as follows: if is the first query, we compute , then , then . This then determines the second query , and we similarly compute , then , then , etc. Thus, given only query access to , we can simulate the algorithm on with adaptive queries.

Now, it is well known that for an undirected graph with adjacency matrix , the trace is the number of triangles in . By the argument above, it follows that with queries to , we can determine if has a triangle or has no triangles. On the other hand, by Theorem 9 below, at least queries to are necessary for any adaptive algorithm to decide if there is a triangle in . Therefore and hence we complete the proof with . ∎

3.3 Deciding if is a Symmetric Matrix

Theorem 5.

Given an matrix over any finite field or over fields or , queries are enough to test whether is symmetric or not with probability .

Proof.

We choose two random vectors and

, where over a finite field we choose from a uniform distribution and over fields

or we choose the Gaussian distribution. We then compute and . We declare to be symmetric if and only if . It is easy to check that if is symmetric, the test will succeed. We then show if is not symmetric, with constant probability, so we obtain success probability by repeating the test times.

Let . When is not symmetric, is not . Thus,