 # Distributed Statistical Estimation of Matrix Products with Applications

We consider statistical estimations of a matrix product over the integers in a distributed setting, where we have two parties Alice and Bob; Alice holds a matrix A and Bob holds a matrix B, and they want to estimate statistics of A · B. We focus on the well-studied ℓ_p-norm, distinct elements (p = 0), ℓ_0-sampling, and heavy hitter problems. The goal is to minimize both the communication cost and the number of rounds of communication. This problem is closely related to the fundamental set-intersection join problem in databases: when p = 0 the problem corresponds to the size of the set-intersection join. When p = ∞ the output is simply the pair of sets with the maximum intersection size. When p = 1 the problem corresponds to the size of the corresponding natural join. We also consider the heavy hitters problem which corresponds to finding the pairs of sets with intersection size above a certain threshold, and the problem of sampling an intersecting pair of sets uniformly at random.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

We study the problem of statistical estimations of a matrix product in the distributed setting. Consider two parties Alice and Bob; Alice holds a matrix and Bob holds a matrix , and they want to jointly compute a function defined on and by exchanging messages. The goal is to minimize both the total communication cost and number of rounds of interaction.

One of the main statistical quantities we consider is the -norm of the product , defined as

 ∥C∥p=(∑i,j∈[n]∣∣Ci,j∣∣p)1/p.

Here the matrix product is the standard matrix product over the integers. Interpreting as , we see that corresponds to the number of non-zero entries of , which, interpreting the rows of and columns of as sets, corresponds to the set-intersection join size (see Section 1.1 for the formal definition). This can also be viewed as a matrix form of the well-studied distinct elements problem in the data stream literature (see, e.g., (FM85, ; BJKST02, ; KNW10, )). Again interpreting the rows of and the columns of as sets, the case corresponds to the size of the corresponding natural join (again see Section 1.1 for the formal definition). The case corresponds to the (squared) Frobenius norm of the matrix product , which is a norm of fundamental importance in a variety of distributed linear algebra problems, such as low rank approximation (for a recent survey, see (w14, )). The case corresponds to the pair of sets of maximum intersection size. Estimating the largest entry in a Boolean matrix product has also been studied in the centralized setting. We refer readers to the recent paper (AR17, ) and references therein.

As a closely related problem, we also consider the -sampling problem for which the goal is to sample each non-zero entry in

with probability

, which corresponds to approximately outputting a random pair among the intersecting pairs of sets. -sampling is also extensively studied in the data stream literature (FIS08, ; MW10, ; JST11, ), and is used as a building block for sketching various dynamic graph problems (see (McGregor14, ) for a survey).

We also study the approximate heavy hitter problem defined as follows. Let

The --heavy-hitter () problem asks to output a set such that

 HHpϕ(AB)⊆S⊆HHpϕ−ϵ(AB).

As outputting the matrix product requires outputting numbers, it is natural to output the set as a sparse approximation of ; indeed this can be viewed as a matrix form of the well-studied compressed sensing problem.

As mentioned, these basic statistical problems, being interesting for their own sake, have strong relationships to fundamental problems in databases. We describe such relationships more formally below.

Despite a large amount of work on computing

-norms and heavy hitters on frequency vectors in the streaming literature (see, e.g.,

(Muthu05, ) for a survey), we are not aware of any detailed study of these basic statistical functions on matrix products. The purpose of this paper is to introduce a systematic study of statistical estimations on matrix products.

### 1.1. Motivation and Applications

Estimating the norm of a matrix product is closely related to two of the most important operations in relational databases – the composition and the natural join. Suppose we are given two relations and , where is defined over attributes and is defined over attributes . Assume for simplicity that . We thus have and . The composition of and is defined to be

 A∘B={(i,j) | ∃k:(i,k)∈A∧(k,j)∈B}.

The natural join is defined to be

 A⋈B={(i,k,j) | (i,k)∈A∧(k,j)∈B}.

It is easy to see that the natural join corresponds to the composition together with the requirement that all the “witnesses” are output.

We further define “projection” sets for each , and for each . Then we can rewrite the composition and natural joins as follows:

• ,

• .

We thus also refer to compositions as set-intersection joins, and natural joins as set-intersection joins with witnesses.

As an application of set-intersection joins, consider a job application scenario: we have applicants, with the -th applicant having a set of skills from the universe , and jobs, with the -th job requiring a set of skills . Our goal is to find all the possible applicant-job matches, namely, those pairs such that . One may also be interested in the number of such matches (the -norm) or the most qualified applicants (the entry realizing the -norm, or the heavy hitters).

We can further relate set-intersection joins to Boolean matrix multiplication. Let and be two matrices such that each row is the indicator vector of , and each column is the indicator vector of . Then the non-zero entries of exactly correspond to the outputs of the set-intersection joins on and . If we are interested in estimates to the sizes of the joins, which are very useful for guiding query optimization since they can be computed using much less communication than computing the actual joins, then we have

• , that is, the -norm of is the size of the composition of and ,

• , that is, the -norm of is the size of the natural join of and .

Finally, corresponds to the pair with the maximum overlap, and for a threshold corresponds to the set of heavy hitters, i.e., those pairs of sets whose intersection size exceeds the threshold. These two problems have natural applications in inner product similarity joins on a set of vectors; we refer the reader to recent work (APR016, ) on inner product similarity joins and references therein.

###### Remark 1 ().

We note that all of these problems and the results in this paper can be straightforwardly modified to handle the general case where , and , which corresponds to where and . See Section 6 for more discussions.

### 1.2. Our Results

For simplicity we use the notation to hide factors where is the multiplicative approximation ratio and is the error probability of a randomized communication algorithm. We say that approximates within a factor of if where and .

Set-Intersection Join Size. We give a -round -bit algorithm that approximates , , within a factor. For the important case of , this provides a significant improvement over the previous result in (GWWZ15, ). Also, due to the lower bound in (GWWZ15, ) for one-round algorithms (i.e., algorithms for which Alice sends a single message to Bob, who outputs the answer), this gives a separation in the complexity of this problem for one and two-round algorithms. As the algorithm in (GWWZ15, ) is a direct application of an space streaming algorithm, our algorithm illustrates the power to go beyond streaming algorithms in this framework.

Pair of Sets with Maximum Intersection Size. We first give a constant round -bit algorithm that approximates within a factor. We complement our algorithm by showing a few different lower bounds that hold for algorithms with any (not necessarily constant) number of rounds. First, we show that any algorithm that approximates within a factor of needs bits of communication, thus necessitating our factor approximation. Moreover, we show that any algorithm achieving any constant factor approximation must use bits of communication, which shows that our factor approximation algorithm has optimal communication, up to polylogarithmic factors.

We next look at approximation algorithms that achieve approximation factors to that are larger than constant. We show it is possible to achieve a -approximation factor using bits of communication. We complement this with an bit lower bound.

Finally we show that the fact that the matrices and are binary is crucial. Namely, we first show that for general matrices and with -bounded integer entries, there is an lower bound for any constant factor approximation. For general approximation factors that may be larger than constant, we show an upper and lower bound of communication. This shows an arguably surprising difference in approximation factor versus communication for binary and non-binary matrices.

Heavy Hitters. We give an -round protocol that computes --heavy-hitters, , and , with various tradeoffs depending on whether Alice and Bob’s matrices are arbitrary integer matrices, or whether they correspond to binary matrices. For arbitrary integer matrices, we achieve bits of communication for every .

We are able to significantly improve these bounds for binary matrices, which as mentioned above, have important applications to database joins. Here we show for every an -round protocol with bits of communication.

### 1.3. Related Work

Early work on studying joins in a distributed model can be found in (ME92, ) (Section ) and (Kossmann00, ). Here the goal is to output the actual join rather than its size, and such algorithms, in the worst case, do not achieve communication better than the trivial algorithm in which Alice sends her entire input to Bob for a centralized computation.

With the rise of the MapReduce-type models of computation, a number of works have been devoted to studying parallel and distributed computations of joins. Such works have looked at natural joins, multi-way joins, and similarity joins, in a model called the massively parallel computation model (MPC) (AU11, ; KS11, ; BKS13, ; BKS14, ; KBS16, ; KS17, ; HTY17, ). Unlike our two-party communication model, in MPC there are multiple parties/machines, and the primary goal is to understand the round-load (maximum message size received by any server in any round) tradeoffs of the computation.

In a recent paper (GWWZ15, ) the authors and collaborators studied several join problems in the two-party communication model. The studied problems include set-intersection joins, set-disjointness joins, set-equality joins, and at-least- joins. Our results can be viewed as a significant extension to the results in (GWWZ15, ), as well as a systematic study of classical data stream problems in the context of matrix products. In particular, (GWWZ15, ) did not study estimating the -norms of , for any other than . For , they obtain an algorithm using communication, which we significantly improve to communication, and extend to any . Moreover, we obtain the first bounds for approximating , where perhaps surprisingly, we are able to obtain an -approximation in communication, beating the naïve amount of communication. This leads us to the first algorithms for finding the frequent entries, or heavy hitters of .

While a number of recent works (kvw14, ; lbkw14, ; blswx16, ; bwz16, ; wz16, ) look at distributed linear algebra problems (for a survey, see (w14, )), in all papers that we are aware of, the matrix is distributed additively. What this means is that we want to estimate statistics of a matrix , where and are held by Alice and Bob, respectively, who exchange messages with each other. In this paper, we instead study the setting for which we want to estimate statistics of a matrix , where and are again held by Alice and Bob, respectively, who exchange messages with each other. Thus, in our setting the underlying matrix of interest is distributed multiplicatively. When is distributed additively, a common technique is for the players to agree on a random linear sketching matrix , and apply it to their inputs to reduce their size. For example, if Alice has matrix and Bob has matrix , then Alice can send to Bob, who can compute . A natural extension of it in the multiplicative case is for Alice to send to Bob, who can compute . This is precisely how the algorithm for of (GWWZ15, ) proceeds. We show by using the product structure of and more than one round, it is possible to obtain significantly less expensive algorithms than this direct sketching approach.

Finally, we would like to mention several papers considering similar problems but working in the centralized model. In (c98, )

, Cohen uses exponential random variables and applies a minimum operation to obtain an unbiased estimator of the number of non-zero entries in each column of a matrix product

. However, a direct adaptation of this algorithm to the distributed model would result bits of communication and -round, which is the same as using the -round -sketching protocol applied to each of the columns in earlier work (GWWZ15, ). In contrast we show that surprisingly, at least to the authors, bits of communication is possible with only rounds. In (ACP14, ), Amossen, Campagna, and Pagh improve the time complexity of (c98, ), provided is not too small. However, a direct adaptation of this algorithm to the distributed model would result an even higher communication cost of .

In (cl99, ), the -sampling problem is considered. In this paper we do not emphasize estimation of , since this quantity can be computed exactly using bits of communication, as stated in Remark 2. Similarly -sampling can also be done in bits of communication, as illustrated in Remark 3.

In (p13, ), it is shown how to apply CountSketch to the entries of a matrix product where . The time complexity is , where denotes the number of non-zero entries of , and is the number of hash buckets in CountSketch which is at least . This outperforms the naïve time complexity of first computing and then hashing the entries of one-by-one. While interesting from a time complexity perspective, it does not provide an advantage over CountSketch in a distributed setting. Indeed, for each of the hashes on Alice’s side of the outer products computed in (p13, ), the size of the hash is , and consequently communicating this to Bob takes bits in total.

## 2. Preliminaries

In this section we give background on several sketching algorithms that we will make use of, as well as some basic concepts in communication complexity. We will also describe some mathematical tools and previous results that will be used in the paper.

For convenience we use to differentiate from a binary matrix, but we will assume that all the input matrices have polynomially bounded integer entries. For all sketching matrices we will make use of, without explicitly stated, each of their entries can be stored in bits.

Sketches.   A sketch of a data object is a summary of of small size (sublinear or even polylogarithmic in the size of ) such that if we want to perform a query (denoted by a function ) on the original data object , we can instead apply another function on such that . Sketches are very useful tools in the development of space-efficient streaming algorithms and communication-efficient distributed algorithms. Many sketching algorithms have been developed in the data stream literature. In this paper we will make use of the following.

###### Lemma 2.1 ((Indyk00, ; Knw10, ), ℓp-Sketch (0≤p≤2)).

For and a data vector , there is a sketch where is a random sketching matrix, and a function such that with probability , approximates within a factor of .

Communication Complexity.   We will use two-party communication complexity to prove lower bounds for the problems we study. In the two-party communication complexity model, there are parties Alice and Bob. Alice gets an input , and Bob gets an input . They want to jointly compute a function via a communication protocol. Let be a (randomized) communication protocol, and let be the private randomness used by Alice and Bob, respectively. Let denote the transcript (the concatenation of all messages) when Alice and Bob run on input using private randomness , and let denote the output of the protocol. We say errs with probability if for all ,

 PrrA,rB[ΠX,Y,rA,rB≠f(x,y)]≤δ.

We define the randomized communication complexity of , denoted by , to be , where denotes the length of the transcript .

We next introduce a concept called the distributional communication complexity. Let be a distribution over the inputs . We say a deterministic protocol computes with error probability on if

 Pr(X,Y)∼μ[ΠX,Y≠f(x,y)]≤δ.

The -error distributional communication complexity under input distribution , denoted by , is the minimum communication complexity of a deterministic protocol that computes with error probability on . The following lemma connects distributional communication complexity with randomized communication complexity.

###### Lemma 2.2 (Yao’s Lemma).

For any function and any , .

A standard method to obtain randomized communication complexity lower bounds is to first find a hard input distribution for a function , and then try to obtain a lower bound on the distributional communication complexity of under inputs . By Yao’s Lemma, this is also a lower bound on the randomized communication complexity of .

We now introduce two well-studied problems in communication complexity.

Set-Disjointness (DISJ).   In this problem we have Alice and Bob. Alice holds , and Bob holds . They want to compute

 {DISJ}(x,y)=∨ti=1(xi∧yi).
###### Lemma 2.3 ((Bjkst04, )).

.

Gap-.   In this problem Alice holds , and Bob holds , with the following promise: either for all ; or for some , . Define if , and otherwise.

###### Lemma 2.4 ((Bjkst04, )).

.

Tools and Previous Results.  We will make use of the following results on distributed matrix multiplication and -sampling on vectors.

###### Lemma 2.5 ((Gwwz15, ), Distributed Matrix Multiplication).

Suppose Alice holds a matrix , and Bob holds a matrix . There is an algorithm for Alice and Bob to compute and such that with probability , . The algorithm uses bits of communication and rounds.

###### Lemma 2.6 ((Jst11, ), ℓ0-Sampling).

For a data vector , there is a sketch where is a random sketching matrix, and a function such that returns for each coordinate with probability . The process fails with probability at most .

We will also need the standard Chernoff bound.

###### Lemma 2.7 (Chernoff Bound).

Let be independent Bernoulli random variables such that . Let . Let . It holds that and for any .

## 3. (1+ϵ)-Approximation of ℓp (p∈[0,2])

For notational convenience (in order to unify and for constant ), we define to be the number of non-zero entries of .

Note that for a constant , approximating within a factor and approximating within a factor are asymptotically equivalent – we can always scale the multiplicative error by a factor of (a constant), which will not change the asymptotic communication complexity. We will thus use these interchangeably for convenience.

The Idea.   The high level idea of the algorithm is as follows. We first perform a rough estimation – we try to estimate the -norm of each row of within a factor. We then sample rows of with respect to their estimated (-th power of their) -norm, obtaining a matrix . We finally use to obtain a finer estimation (i.e., a -approximation) of .

Algorithm.   Set parameters , . The algorithm for approximating -norms for is presented in Algorithm 1. We describe it in words below.

Alice and Bob first try to estimate the -norm of each row in within a factor of . This can be done by letting Bob send an -sketch of of size to Alice using the sketch in Lemma 2.1; Alice then computes . With probability , we have that for all ,

 (1) ∥∥˜Ci,∗∥∥pp∈[∥Ci,∗∥pp,(1+β)⋅∥Ci,∗∥pp].

We note that we can set (instead of ) and directly get a approximation of for each row (and thus ). This is exactly what was done in (GWWZ15, ). However, the communication cost in this case is , which is higher than our goal by a factor of .

Alice then sends Bob for all . Both parties partition all the rows of into up to groups , such that the -th group contains all for which

 (2) (1+β)ℓ≤∥∥˜Ci,∗∥∥pp<(1+β)ℓ+1.

By (1) and (2), we have that for each ,

 (3) (1+β)ℓ≤∥Ci,∗∥pp<(1+3β)⋅(1+β)ℓ.

For a fixed group , let and . For each , set

 pℓ=ρ|Gℓ|⋅∥∥˜Gℓ∥∥pp/∥∥˜C∥∥pp.

By (1) we have

 (4) pℓ∈[12⋅ρ|Gℓ|⋅∥Gℓ∥pp∥C∥pp, 2⋅ρ|Gℓ|⋅∥Gℓ∥pp∥C∥pp]

For each , Alice randomly samples each with probability . Alice then sends Bob which consists of all the sampled rows of with other rows being replaced by all- vectors. Bob then computes , and outputs as the approximation to .

We can show the following regarding Algorithm 1.

###### Theorem 3.1 ().

For any , there is an algorithm that approximates for within a factor with probability , using bits of communication and rounds.

Correctness.   For each , and each , let be a random variable such that if is sampled by Alice, and otherwise. Define

It is clear that

. We now compute its variance.

 Var[Zℓ] = 1p2ℓ∑i∈Gℓ⎛⎝(∥Ci,∗∥pp−∥Gℓ∥pp|Gℓ|)2Var[Xℓi]⎞⎠ ≤ 1pℓ∑i∈Gℓ(∥Ci,∗∥pp−∥Gℓ∥pp|Gℓ|)2 ≤ 1pℓ∑i∈Gℓ(3β⋅∥Gℓ∥pp|Gℓ|)2(by (???)) = 9β2⋅(∥Gℓ∥pp)2pℓ|Gℓ| ≤ 18β2ρ⋅∥Gℓ∥pp⋅∥C∥pp.(by (???))

Define . We then have , and

 Var[Z] = ∑ℓ∈[L]Var[Zℓ] ≤ 18β2ρ⋅∥C∥pp⋅∑ℓ∈[L]∥Gℓ∥pp ≤ 18β2ρ(∥C∥pp)2.

By Chebyshev’s inequality, we have

 Pr[|Z|≥ϵ⋅∥C∥pp]≤Var[Z](ϵ⋅∥C∥pp)2=18β2ρϵ2≤0.01.

We thus have with probability (conditioned on (1) holding, which happens with probability as well).

Finally note that we can always boost the success probability of the algorithm from to using the standard median trick and paying another factor in the communication cost (which will be absorbed by the notation).

Complexity.   The communication cost of sending the -sketch in the first round is words. The cost of sending the sampled rows is bounded by . Thus the total communication cost is bounded by

 = ~O(n)⋅(ρ+1β2) = ~O(n/ϵ)(by our choices of% ρ and β).

It is clear that the whole algorithm finishes in rounds of communication.

###### Remark 2 ().

We comment that for , can actually be computed exactly using bits of communication and round: Alice simply sends for each to Bob, and then Bob computes , which is exactly .

###### Remark 3 ().

We can also perform -sampling on using bits of communication and round. Alice sends for each the value and a random sample from column . Bob computes for each the value as well as , from which he samples a proportional to . Finally, Bob samples a random entry , and if is the uniform sample in that Alice sent to Bob, Bob outputs the pair as the -sample.

### 3.1. ℓ0-Sampling

We now present a simple algorithm for -sampling. Recall that the goal of -sampling on matrix is to sample each non-zero entry in with probability .

The idea is fairly simple: we employ an -sketch and -samplers in parallel. We first use the -sketch to sample a column of proportional to its -norm, and then apply the -sampler to that column. For the first step, we use the one-way -sketching algorithm in Lemma 2.1 to approximate the -norm of each column of within a factor of . For the second step, we use the one-way -sampling algorithm for vectors in Lemma 2.6 for each column of .

###### Theorem 3.2 ().

There is an algorithm that performs -sampling on with success probability using bits of communication and round.

###### Proof.

The size of the -sampler (i.e., the sketching matrix ) in Lemma 2.6 is bounded by , and the size of the -sketch in Lemma 2.1 is bounded by . Thus the total number of bits of communication is bounded by . The algorithm finishes in round since both the -sketch and -sampler can be computed in one round.

The success probability follows from a union bound on the success probabilities of the -sketch and -sampler for each of the columns of . ∎

## 4. (2+ϵ)-Approximation of ℓ∞

In this section we give almost tight upper and lower bounds for approximating , that is, the maximum entry in the matrix product . We first consider the product of binary matrices, and then consider the product of general matrices.

### 4.1. Upper Bounds for Binary Matrices

#### 4.1.1. An Upper Bound for 2+ϵ Approximation

The Idea.   The high level idea is to scale down each entry of so that is as small as possible subject to the constraint that the largest entry of is still approximately preserved (after scaling back). This down-scaling can be done by sampling each -entry of with a certain probability (we replace the non-sampled ’s by ’s). Let be the matrix of after applying sampling. Alice and Bob then communicate for each item the number of rows and columns in and respectively that contain item (i.e., those rows and columns with -th coordinate equal to ), and the one with the smaller number sends all the indices of those rows/columns to the other party. After this, Alice and Bob can compute matrices and independently such that , and then output as an approximation to .

Algorithm.   Let . Set . We present the algorithm in Algorithm 2, and describe it in words below.

For , Alice samples each -entry in with probability (i.e., with probability the -entry is replaced by a -entry). Let be the matrix after sampling with probability , and let .

For each , Alice and Bob compute using Remark 2. Let be the smallest index such that .

Let us focus on and , and consider each item . For convenience we identify the rows of and columns of as sets and respectively. Suppose appears times in Alice’s sets, and times in Bob’s sets. Alice and Bob exchange the information of and for all . Then for each , if then Alice sends all the indices of sets containing to Bob, otherwise Bob sends all the indices of sets containing to Alice.

At this point, Alice and Bob can form matrices and respectively so that , where corresponds to the portion of each entry of restricted to the items for which Alice knows the intersections (in other words, Alice knows the inner product defining the entry restricted to a certain subset of items), and similarly define . Finally Alice and Bob output as the approximation of .

We have the following theorem.

###### Theorem 4.1 ().

Algorithm 2 approximates for two Boolean matrices within a factor with probability using bits of communication and rounds.

Correctness.   We first show that the claimed approximation holds. The following lemma is a key ingredient.

###### Lemma 4.2 ().

With probability , approximates within a factor of .

###### Proof.

We assume that since otherwise there is nothing to prove (in this case we have and ).

We first define a few events.

1. .

2. For all pairs , if , then approximates within a factor of .

3. For all pairs , if , then .

In words, states that the maximum entry of will be large. states that for all large entries in , the values , after rescaling by a factor of , can be used to approximate within a factor of . states that for all small entries in , the corresponding values cannot be the maximum in the matrix .

It is not difficult to see that if all three events hold then Lemma 4.2 holds. Indeed, by we can approximate each by within a factor of as long as , and by we have . Therefore

 (5) ∥C∥∞≥12γ/(pℓ∗(1+ϵ))>14γ/p