Distributed and Streaming Linear Programming in Low Dimensions

03/13/2019 ∙ by Sepehr Assadi, et al. ∙ Princeton University 0

We study linear programming and general LP-type problems in several big data (streaming and distributed) models. We mainly focus on low dimensional problems in which the number of constraints is much larger than the number of variables. Low dimensional LP-type problems appear frequently in various machine learning tasks such as robust regression, support vector machines, and core vector machines. As supporting large-scale machine learning queries in database systems has become an important direction for database research, obtaining efficient algorithms for low dimensional LP-type problems on massive datasets is of great value. In this paper we give both upper and lower bounds for LP-type problems in distributed and streaming models. Our bounds are almost tight when the dimensionality of the problem is a fixed constant.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As machine learning becomes pervasive, how to effectively support machine learning tasks in database systems has become an imminent question. In a recent paper [MVPV17], Makrynioti et al. observed that many machine learning problems can be expressed by linear programs (LP). They designed a level of abstraction called SolverBlox on top of a declarative language LogiQL333An extended version of Datalog [ACGKOPVW15]. as a framework for expressing linear program formulations. The query in the format of SolverBlox will then be translated to a format supported by an LP solver for computing the solution. In this paper we consider the algorithmic side of this research direction, that is, we focus on the design of efficient LP solvers for large-scale datasets. In particular, we propose algorithms for linear programming in three popular “big data” models, namely, the coordinator model [PVZ12], the streaming model [MP80, AMS99], and massively parallel computation (MPC) [KSV10, GSZ11, BKS17]. We also provide almost matching lower bounds when the dimensionality of the linear program is a fixed constant.

In the rest of the introduction we will start with the definition of the problem and the description of the computation models, and then present our results and discuss previous work.

Problem Definition.

The basic linear programming problem can be described as follows: we have a set of variables and a set of linear constraints each of which (indexed by ) is in the form of , where are coefficients and is the dimension of the problem. We also have an objective function . The goal is to find an assignment for variables that minimizes the objective function while satisfying all the constraints.

Linear programming is a special case of a more general problem called LP-type problem [MSW96], which we will discuss in details in Section 2.1. Besides linear programming, LP-type problems also include several other important problems in machine learning, such as Linear Support Vector Machines (SVM) [BGV92]

, which is widely used in classification and regression analysis 

[GJ09, Burges98, CB99]), and Core Vector Machines [TKC05], which is used to speed up general SVM computation (or, Linear SVM augmented by the kernel trick [BGV92]). We will give the formal definitions of these problems in Section 4. The algorithms we propose in this paper work for general LP-type problems.

In this paper we are interested in the scenario when the dimension of the linear program (and LP-type problem in general) is small compared to the number of constraints. Various examples of linear programming and LP-type problems in machine learning are of this type: SVMs and regression problems (in particular, least absolute error regression that can be modeled by linear programming) are often over-constrained; in the problems of Chebyshev approximation and linear separability, the number of variables are typically small.

Computational Models.

We study linear programming and LP-type problems in the following big data models.

  • [leftmargin=12pt]

  • The (multi-pass) streaming model.   In this model, we have a single machine which can make linear scans of the input data sequence. The task is to compute some function defined on the input data sequence. The goal is to minimize the memory space usage and the number of passes needed. This model captures data that cannot fit the memory, and on which sequential scan is much more efficient than random access.

  • The coordinator model.   In this model, we have sites and a central coordinator. Each site is connected by a two-way communication channel with the coordinator. The input is initially partitioned among the sites. The task is for the sites and coordinator to jointly compute some function defined on the union of the datasets. The computation proceeds in rounds: At the beginning of each round, the coordinator sends a message to each site, and then each site replies with a message back to the coordinator. At the end of the computation, the coordinator outputs the answer. The goal is to minimize the total bits of the communication and the rounds of the computation. This model fits data that is inherently distributed or cannot fit the storage of a single machine

  • Massively parallel computation (MPC).   In this model, we have machines interconnected in a network that allows communication between any pairs of machines. Similar to the coordinator model, the input is partitioned among the machines, and the task is for them to compute some function defined on the union of the datasets. The computation is again in terms of rounds. At each round, the machines communicate with each other over the network by sending and receiving messages. The message sent by a machine at each round is a function of its input data and all messages it has received in previous rounds. Our goal is to minimize the number of rounds of the computation, and the maximum bits of information sent or received by a machine at any round (often called the load in the literature). MPC has already become the model of choice for studying parallel computation in computer clusters.

Description of the input. Since we are dealing with low-dimensional problems, we assume that the memory on each site/machine in each model is at least proportional to , the dimension of the problem, but is significantly smaller than , the number of constraints. As a result, the input is presented by giving the constraints one by one to the algorithm in the streaming model, or partitioning them across different sites/machines in the coordinator and MPC models.

1.1 Our Contributions and Related Work

In the following, we present our results for linear programming in the three big data models described above, and postpone the specifics of their generalization to LP-type problems to later sections. Our main upper bound result is the following.

[backgroundcolor=lightgray!40,topline=false,rightline=false,leftline=false,bottomline=false,innertopmargin=2pt]

Result 1.

We give the following polynomial time algorithms for -dimensional linear programming with constraints. For any integer and parameter :

  • [leftmargin=12pt]

  • Streaming: An -pass streaming algorithm with space.

  • Coordinator: An -round distributed algorithm with total communication.

  • MPC: An -round algorithm with load per machine.

Our algorithms are randomized and output the correct answer with probability

for any desired constant .

By Result 1 for and , we obtain linear programming algorithms that use passes or rounds, and have space, communication, or load requirements in each model that is almost independent of the number of constraints. For low-dimensional instances, this results in a dramatic saving compared to direct implementations of standard LP algorithms in these models.

Previously, Chan and Chen [CC07] proposed an -pass streaming algorithm for linear programming that uses space. Result 1 improves upon this result by achieving an exponentially smaller pass-complexity in terms of .

In the coordinator model, Daumé et al. [DPSV12] gave an algorithm using communication based on an adaptation of the algorithm of [CC07]. The round-complexity and communication cost of this algorithm again depends exponentially on .

In the MPC model, very recently Tao [Tao18] gave a -round MPC algorithm with load when (for any ). This algorithm is then used as a building block for an interesting database application called entity matching with linear classification. The round complexity of our MPC algorithm in Result 1 improves that of [Tao18] by an exponential factor.

To summarize, Result 1 exponentially improves upon the pass/round complexities of the state-of-the-art, while using the same or smaller space, communication, or load, in the considered big data models.

We complement our algorithms by giving almost tight lower bounds for any fixed dimension (even ) in the streaming and coordinator model.

[backgroundcolor=lightgray!40,topline=false,rightline=false,leftline=false,bottomline=false,innertopmargin=2pt]

Result 2.

We give the following lower bounds for -dimensional linear programming with constraints. For any integer :

  • [leftmargin=12pt]

  • Streaming: Any -pass algorithm requires space.

  • Coordinator: Any -round algorithm requires communication even when number of sites is only .

Our lower bounds hold even for randomized algorithms that output the correct answer with probability at least .

A few remarks about Result 2: Firstly, it is easy to see that linear programming in one dimension in the models we consider is a trivial task. Result 2 thus proves the lower bound for the smallest non-trivial dimension. We note that unlike Result 1 that worked in all the three models, Result 2 does not prove any lower bound for MPC algorithms. Proving lower bounds for MPC algorithms is considered to be a challenging task as it has serious implications for long standing open problems in complexity theory [RoughgardenVW16]. Hence, no unconditional lower bounds are known so far in the literature for any MPC problem and Result 2 is of no exception.

Prior to our work, Chan and Chen [CC07] gave a lower bound for -dimensional linear programming for a restricted family of deterministic streaming algorithms in the decision tree

model (the only permitted operation of these streaming algorithms is testing the sign of a function evaluated at the coefficients of a subset of stored hyperplanes). Their lower bound states that this type of algorithms require

space to compute the solution in passes. Our lower bound in Result 2 is much stronger in that it proves a similar pass-space tradeoff for all streaming algorithms (even randomized). Finally, Guha and McGregor [GM08] showed that there is a fixed dimensional optimization problem for which any -pass streaming algorithm requires space. However, it is not clear how to adapt their proof to linear programming since their optimization problem involves quadratic constraints [McGregor18].

Further Related Work.

Special cases of linear programming have been studied previously in the big data models. In particular, Ahn and Guha gave multi-pass streaming algorithms for -approximation of packing LPs [AG11] and Indyk et al.  [IMRUVY17] gave similar algorithms for covering LPs (see also [AssadiKL16]). These results focus on high-dimensional linear programs (non-constant ) and only packing/covering LPs, and are hence quite different from our approach in this paper.

Unlike the case for big data models, low-dimensional linear programming has been studied extensively in the RAM model since the 1980s. Megiddo [Megiddo84] gave an algorithm for -dimensional linear programming with time complexity , which is linear in terms of the number of constraints . This bound was consequently improved by a series of papers [Clarkson86, Dyer86, DF89, Kalai92, Clarkson95, ClarksonS89, MSW96, BCM99, Chan16].

2 Preliminaries

Notations.

For integers , we define , , and (we define and

analogously). We use capital letters for sets and random variables and calligraphic letters for set families. We use the notation

to denote a function of the form .

Throughout the paper, we say an event happens “with high probability” if its probability can be lower bounded by for any desired constant ( is the number of constraints).

We use the following standard variant of Chernoff bound.

Proposition 2.1 (Chernoff bound).

Suppose are independent random variables taking value in and . Then, for any ,

2.1 LP-type Problems

We consider a generalization of linear programming referred to as LP-type problems444The class of LP-type problems is also known as abstract linear programming [Bland78].. An LP-type problem consists of a pair , where is a finite set of elements, and is a set function with a range which is assumed to have a total order. The function satisfies two properties:

  • Monotonicity: for any two sets , .

  • Locality: for any two sets , and any elements , if , then .

For an LP-type problem , we call a set a basis of if , and for all we have . The goal is to compute a basis such that . We say an element violates if . It helps to think of an LP-type problem as an optimization problem in which elements of are the constraints, and computes the best feasible solution on the set of constraints . In the case when the optimal solution is not unique, we just break the tie arbitrarily. Computing hence amounts to computing the optimal solution subject to all the constraints (we will make this connection explicit in the context of linear programming and other problems in Section 4).

Combinatorial Dimension.

Note that an LP-type problem may have several bases which are of different sizes. We define the combinatorial dimension of an LP-type problem to be the maximum cardinality of a basis for , denoted by ( for short when and are clear from the context).

2.2 -Nets and VC Dimension

We now define another important notion that we use in designing our algorithms.

VC Dimension.

A set-system is a tuple consists of a universe and a set family . Let be a set. Define the intersection between a set family and a set to be the set family

We say that a set is shattered by if contains all the subsets of , i.e., The VC dimension of set-system , denoted by (or for short when is clear in the context), is then the cardinality of the largest set that is shattered by .

-Net.

Given a set-system , and a weight function , for any , let . We say a set is an -net of with respect to for a parameter , iff for any point such that it holds that .

The notion of -net is well-studied in the literature (particularly in the computational geometry community [HW87, BG95, Mulmuley94]), and has been used in the algorithm design for many problems. We use the following simple randomized construction of -net for designing a distributed version of Clarkson’s algorithm for LP-type problems.

Lemma 2.2 ([Hw87]).

For any set-system of VC dimension , any weight function , and , a set family obtained by randomly sampling

(1)

sets with probability proportional to their weights is an -net of with probability at least .

3 Algorithms

In this section we present our algorithms for Result 1. We will work with a special class of LP-type problems that contains the most natural LP-type problems that we are aware of, including linear programming, Linear SVMs, and Core SVMs mentioned earlier. In particular, we require the LP-type problem to satisfy the following properties:

  1. [label=(P0)]

  2. Each constraint is associated with a set of elements ( is the range of ).

  3. For any , is the minimal element of .

It is useful to think of as the set of feasible solutions. For example, in the case of linear programming, with the natural ordering induced by scalar product with the vector in the objective function. Each constraint (inequality) corresponds to the subset of points which satisfy the constraint, and is equal to the point which satisfies all constraints in and has a minimal scalar product with . For convenience, we use and interchangeably.

For this special class of LP-type problems, we define the VC dimension of the problem as the VC dimension of the set system .

In the following, we first give a general meta-algorithm for solving LP-type problems with Properties 1 and 2, and then show how to implement this meta-algorithm efficiently in each model.

3.1 The Meta Algorithm for LP-Type Problems

Our meta-algorithm follows Clarkson’s algorithm [Clarkson95] for linear programming, but we use a different sampling procedure (by using -net) which enables us to work with general LP-type problems with bounded VC dimension; it also significantly simplifies the analysis and facilitates the implementation of our algorithm in the big data models we consider. We further use a different weight increase rate after each iteration, which is essential for reducing the number of passes in the streaming, and the number of rounds in the coordinator and MPC models.

The algorithm proceeds in iterations. We maintain a weight function throughout the algorithm which is initialized by setting for all . In each iteration, we first sample a set family of sets from with probability proportional to their weights so as to obtain an -net of (according to Lemma 2.2). We then compute a basis of , and the set of constraints which violate the basis . If , then we say this iteration “succeeds”, and update the weights of all sets by setting . Otherwise, we say this iteration “fails”, and continue to the next one without modifying the weights. A pseudo-code is provided in Algorithm 1.

Input: An LP-type problem satisfying Properties 1 and 2 and integer .
Output: .
1 Let , and as the VC dimension of the LP-type problem .
2 Set for every .
3 repeat
4       Sample a family of size by picking each set in with probability proportional to for the parameter in Lemma 2.2.
5       Compute a basis of .
6       Let be the family of sets in that violate .
7       if   then
8             Set for every set .
9       end if
10      
11until ;
return .
ALGORITHM 1 A Meta-Algorithm for LP-Type Problems

In the following, we first establish the correctness of the meta-algorithm and then bound the number of iterations it needs.

Lemma 3.1.

When Algorithm 1 stops, it correctly computes .

Proof.

At the end of the algorithm, we have . This means that for any , we have by the monotonicity property of . By the locality property and induction we obtain that , finalizing the proof.       

We now bound the number of iterations. We say that an iteration of Algorithm 1 (at Lines 1 to 1) is successful iff in this iteration.

Claim 3.2.

Each iteration of Algorithm 1 is successful with probability at least .

Proof.

Since the VC dimension of is , by Lemma 2.2, with probability at least , the family sampled in Line 1 is an -net for with respect to the weight function . In the following, we condition on this event.

Let . By Property 2 of the LP-type problems we consider, we know that is the minimal element in the intersection of all sets in according to the ordering of . For any set to violate , we need to have ; otherwise which is in contradiction with . Recall that is the family of all sets in that violate . Suppose towards a contradiction that . Since none of the sets in contain , and is an -net, by definition there is a set where does not contain . But this is in contradiction with being a basis. To see this, if , then belongs to all sets in , and consequently it should also be in . We thus have , finalizing the proof.       

Lemma 3.3.

The number of iterations in Algorithm 1 is with probability at least , where denotes the combinatorial dimension of .

Proof.

Recall that the weight function is updated only when an iteration is successful, and each iteration succeeds with probability at least by Claim 3.2. By Chernoff bound (Proposition 2.1), we have that if the algorithm terminates in iterations, then with probability at least , at least of these iterations are successful.

We now focus on successful iterations. Let be the weight function after the -th successful iteration. Initially, for any we have (and thus ). We claim that for any integer , if Algorithm 1 reaches the -th successful iteration, then

(2)

We establish Eq (2) in the following two claims.

Claim 3.4.

For any integer , we have .

Proof.

Fix an arbitrary basis of for some (recall that by definition, is size of the largest basis). Since , we have for any . We thus only need to show .

The first observation is that in any iteration, if then we must have . Indeed, if , then where the first equality is by the locality property of and induction, and the second equality holds since is a basis for . However, this is in contradiction with the fact that .

Let us now define as the basis of the -net computed in the -th successful iteration. For any , let be the number of iterations such that violates . That is,

Since in each of the first successful iterations, there must exist at least one which violates for each . We thus have Moreover, by the weight update rule of the algorithm, we can write the weight of as By combining these and Jensen’s inequality we have

since . This concludes the proof of Claim 3.4.       

Claim 3.5.

For any integer , we have .

Proof.

For any iteration , the weight update procedure at Line 1 of Algorithm 1 gives

(3)

Moreover, by the condition at Line 1 of the algorithm, we have,

(4)

by the choice of in the algorithm. Combining (3) and (4) we have

 

We get back to the analysis of the number of iterations. By Eq (2) we have , hence, Since , we have . Therefore the number of successful iterations cannot exceed , and hence the total number of iterations is bounded by with probability .       

Remark 3.6.

We can easily turn our Las-Vegas algorithm in this section (Algorithm 1) into a Monte-Carlo algorithm by the following modifications: First we pick an -net of size , and second, the algorithm return “FAIL” whenever , which will not happen in the first iterations with probability at least .

3.2 Implementation in the Streaming Model

Starting from this section, we show how to implement Algorithm 1 in the three big data models considered in the paper. We start with the streaming algorithm. In the multi-pass streaming model the elements of arrive one by one, and is known to the algorithm at the beginning. We allow the algorithm to make multiple linear scans of the input.

The main challenge in the streaming implementation of Algorithm 1 is that we cannot afford to store the weights of all elements in which are needed in the -net sampling. To resolve this issue, we instead store the set of bases computed at all the successful iterations – these are the only iterations that we change the weight function – in a collection , using which we can compute the weight of each element of on the fly. In particular, the weight of a set in iteration of the algorithm, namely, , is computed as where . It is immediate to verify that this indeed implements the same weight function in Algorithm 1. It is also easy to see that having access to these weights, we can sample each set with probability proportional to its weight using the weighted version of reservoir sampling [Chao82], and hence implement each iteration of Algorithm 1 in one pass over the stream.

The rest of Algorithm 1 can be implemented in the streaming model in a straightforward way. Let be the time complexity of computing a basis for a set of size , and be the time complexity of finding all elements in a set of size which violate a set of size , i.e., all such that . This allows us to prove the following theorem.

Theorem 1.

Suppose is an LP-type problem with combinatorial dimension , VC dimension , and bit-complexity for each element of . For any integer , we can compute with high probability in the streaming model, using passes, and space. The total running time of the algorithm is also .

Proof.

The correctness of the algorithm follows from Lemma 3.1. As each iteration of Algorithm 1 can be implemented in one pass, the total number of passes needed by our streaming algorithm is with high probability by Lemma 3.3.

Recall that the size of each -net sampled in Algorithm 1 is by the choice of in the algorithm and in Lemma 2.2. The space needed by the algorithm to store in each iteration is , which is equal to bits. We also need to store all bases in successful iterations, which requires (since ) as each basis requires bits to represent and there are total of such bases.

Each pass of the algorithm involves performing a violation test over the elements of , which takes time. And computing a basis of elements which takes times. The run-time follows by multiplying these numbers by the number of passes, and by choice of .       

3.3 Implementation in the Coordinator Model

Recall that in the coordinator model the input set is arbitrarily partitioned among sites such that for any , the site receives the elements . The sites and the coordinator want to jointly compute via communication. The function is a public knowledge, that is, all parties know how to evaluate the function for any assuming resides entirely on that machine.

Similar to the streaming model, the main step here is also the implementation of the -net sampling procedure in Algorithm 1.

Lemma 3.7.

The coordinator can sample a subset of size according to the weight function using rounds and bits of communication, where is the number of times the weight function has been updated when simulating Algorithm 1 in the coordinator model.

Proof.

The sampling algorithm is as follows. In the first round each site sends to the coordinator. Note that for any can be described in bits.

In the second round the coordinator generates i.i.d. random numbers from from the distribution , and sends the -th site the number . After obtaining , site samples elements from its local set according to the distribution , and sends the sampled elements to the coordinator. Note that for any , and thus the communication cost of this round is bounded by bits.

Finally, the sampling is indeed with respect to the weight function , since

This concludes the proof.       

In order to implement Algorithm 1, each site should also be able to determine the set of violating elements in its input. This can be done easily by asking the coordinator to share the basis computed in each iteration with every site. The proof of Theorem 2 follows directly from that of Theorem 1 by plugging in Lemma 3.7.

Theorem 2.

Suppose is an LP-type problem with combinatorial dimension , VC dimension , and bit-complexity for each element of . For any integer , we can compute with high probability in the coordinator model with machines, using rounds, and communication in total. The local computation time of the coordinator is and the local computation time of the -th site is where .

3.4 Implementation in the MPC Model

The implementation of Algorithm 1 in the MPC model can be done similarly as that in the coordinator model, by choosing one of the machines to play the role of coordinator. The only problem is that when the number of machines is large, the machines cannot simply send all the messages to the coordinator directly, as it will blow up the load in the coordinator.

Our general strategy is to simulate our implementation of the meta-algorithm for the coordinator model in the MPC model for round protocols. The main challenge in implementing this is that once we require the load of roughly per machine, we need to start with machines to begin with to fit the whole input across all machines. This means that the number of sites in the simulation is . But then, if all these machines need to send even one bit to the designated coordinator machine (or vice versa), this requires a load of on the coordinator machine which is prohibitively large for any .

In order to fix this, we are going to use the by now standard approach of [GSZ11]. There are only two steps that the coordinator and the machines need to communicate with each other: (1) when the machines need to send a sample of the -net, and (2) when the coordinator needs to send the basis to the machines. The latter can be done easily in MPC rounds on machines of memory : the coordinator first shares this information with other machines in one round; each of these machines next shares this information with another set of machines (unique to each original machine). In rounds all the machines would receive this information (see [GSZ11] for more details on this general approach).

To handle the part when the machines need to send the -net to the coordinator, we do as follows. Recall that the size of is at most , and thus it will fit the memory of the coordinator. However, we first need to sample this according to the correct distribution. In order to do this, we use our approach for implementing the streaming algorithm. Since by the previous part we managed to share the basis computed in each iteration with every machine, as in the case of streaming algorithms, the machines can compute the weights of every constraint they have. The total weight of the constraints can also be computed in rounds using the sort and search method of [GSZ11]. As a result, each machine can locally perform the sampling of and send this information to the coordinator. To summarize, we have the following theorem.

Theorem 3.

Suppose is an LP-type problem with combinatorial dimension , VC dimension , and bit-complexity for each element of . For any , we can compute with high probability in the MPC model using rounds with load per machine.

4 Examples and Applications

We now give examples of the application of our algorithms for general LP-type problems. We will discuss several fundamental optimization problems in machine learning, namely, linear programming, Linear SVM, and Core SVM. Recall that when implementing our meta algorithm in each model, we have left two functions (the time needed for performing the violation test) and (the time for computing the basis) unspecified. In this section we will provide concrete bounds for these functions in the context of the concrete problems we study. Throughout this section, we assume that the bit-complexity of each number in the input is bits.

4.1 Linear Programming

A linear program is an optimization problem of the type:

(5)

A -dimensional linear program can be modeled as an LP-type problem as follows. Let be a set family of size such that for every constraint in (5), there exists a unique element which is the half-space in the -dimensional Euclidean space containing the points that satisfy this single constraint. We define the function over subsets of such that for every , is the lexicographically smallest point that minimizes the objective value of LP while satisfying only the constraints in . The linear program (5) now corresponds to the LP-type problem (we use as opposed to our previous notation , since each element of is now itself a subset of , and hence forms a set family). We refer the interested readers to [MSW96] for more details on connection between linear programming and LP-type problems.

It is known that the combinatorial dimension of this particular LP-type problem is at most  [MSW96]. The VC dimension is also at most  [VC15].

In the following, let denotes the time needed to solve a linear program with constraints and variables.

Proposition 4.1.

For any linear program with constraints and dimension :

  • The time needed to compute a basis of given constraints is

  • The time needed to compute all constraints that violate a given basis of size among constraints is

Proof.

To find a basis of a set of constraints, we first solve the LP only given the constraints in to obtain a point with optimal value . Recall that in our mapping of LP to an LP-type problem, we need to find a lexicographically smallest optimal solution on constraints in , which may not be the point even though the objective value is still . Hence, we now write a separate linear program:

This allows us to find an optimal solution to the LP with the minimum value of . Repeating this procedure for iterations and for -th iteration fixing computed so far, and finding the minimum value for , allows us to find the lexicographically smallest optimal solution. These LPs all are -dimensional with constraints, and hence can be solved in time in total, finalizing the first part.

A basis of size in a linear program consists of constraints of the LP that are all tight by the assignment of the variables. Hence, given the basis, we only need to solve the linear program on a system of linear inequalities to determine a value of that is tight for all the constraints in the basis. This can be done in time (as we do before). After this, we can simply check the -dimensional vector against all the constraints and add each one as a violating set if does not satisfy the constraint in time, finalizing the second part.       

Plugging in the currently best known bound for by [LS14] in Proposition 4.1, and the aforementioned bounds on , we can prove the following theorem using Theorems 12, and 3.

Theorem 4.

We give the following randomized algorithms for -dimensional linear programming with constraints. For any and :

  • [leftmargin=12pt]

  • Streaming: An -pass algorithm with space in time.

  • Coordinator: An -round algorithm with total communication in which the coordinator and each site spend time and time, respectively, where is the number of constraints on site .

  • MPC: An -round algorithm with load per machine and time in total.

4.2 Linear Support Vector Machine

In Linear Support Vector Machine (SVM) problem [BGV92], we have a set of tuples such that for each index , and . The goal is to compute a hyperplane which is the outcome of the following quadratic optimization problem [BGV92]:

(6)

From a geometrical point of view, the problem (6) corresponds to finding a hyperplane which separates the set of point according to their labels with the maximum margin value (if possible); see, e.g., [BGV92] for more information on this fundamental problem.555Our algorithm works effectively for the hard-margin Linear SVM. In the case of the soft-margin Linear SVM, the optimization problem can also be formulated in the form of LP-type problem, but the dimension of such formulation is large – proportional to the size of input. Note that the problem (6) is not a linear program. However, one can show that it is an LP-type problem where is a set family in in which every set contains the points that satisfy a particular constraint, and for computes the optimal solution of (6) given only the constraints to  [MSW96] (unlike linear programming, the optimal solution to (6) under any set of constraints is unique and hence we do not need the lexicographically first constraint).

The combinatorial dimension of is  [MSW96], and the VC dimension of is  [VC15]. In the following, let denote the time needed to solve an instance of Linear SVM problem with constraints and variables. We show how to implement the basis computation and violation test for Linear SVM in the following proposition.

Proposition 4.2.

For any Linear SVM problem with constraints and dimension :

  • The time needed to compute a basis of given constraints is

  • The time needed to compute all constraints that violate a given basis of size among constraints is

Proof.

To find a basis of a set of constraints, we simply need to solve another instance of Linear SVM, i.e., (6), only on the given constraints. This can be done in by definition. The second part can also be solved by solving a linear equation exactly as in the case in Proposition 4.1.       

Plugging in the currently best known bound for by quadratic programming in [YT89] in Proposition 4.2, and the aforementioned bounds on , we can prove Theorem 5 using Theorems 12, and 3.

Theorem 5.

We give the following randomized algorithms for -dimensional linear support vector machine problem with constraints. For any and :

  • [leftmargin=12pt]

  • Streaming: An -pass algorithm with space in time.

  • Coordinator: An -round algorithm with total communication in which the coordinator and each site spend time and time, respectively, where is the number of constraints on site .

  • MPC: An -round algorithm with load per machine and time in total.

4.3 Core Vector Machine

Tsang at el. [TKC05] proposed core vector machines as a way of speeding up kernel methods in SVM training (see [BGV92]). This is achieved by reformulating the original kernel method as an instance of the minimum enclosing ball (MEB) problem, defined as follows: Given a set of points in , find a center and a minimum radius such that all the points in are within a -dimensional sphere of radius centered at . MEB can be formulated as the following optimization problem:

(7)

This problem is also an LP-type problem formulated similarly to linear programming and Linear SVM [MSW96].

The combinatorial dimension of is  [MSW96] and the VC dimension of is  [WD81]. Let denote the time needed to solve an instance of MEB problem with constraints and variables. The following proposition show how to implement the basis computation and violation test for MEB (the proof is identical to Proposition 4.2 and is hence omitted).

Proposition 4.3.

For any Linear SVM problem with constraints and dimension :

  • The time needed to compute a basis of given constraints is

  • The time needed to compute all constraints that violate a given basis of size among constraints is

As MEB can be cast as a convex quadratic program, we have by [YT89] as before. Hence, Theorems 12, and 3 imply the following result.

Theorem 6.

We give the following randomized algorithms for -dimensional core vector machine problem with constraints. For any integer :

  • [leftmargin=12pt]

  • Streaming: An -pass algorithm with space in time.

  • Coordinator: An -round algorithm with total communication in which the coordinator and each site spend time and time, respectively, where is the number of constraints on site .

  • MPC: An -round algorithm with load per machine and time in total.

5 Lower Bounds

In this section we prove information-theoretic lower bounds for linear programming that hold against any algorithm. We obtain our lower bounds by establishing the communication complexity for -dimensional linear programming, and then translating it to lower bounds in the big data models. In the following, we first give some background on communication complexity and then present an intermediate problem, called two-curve intersection problem (TCI), that we consider en route to proving our result for linear programming. We then prove a lower bound for TCI and present its implications for linear programming in the streaming and coordinator models.

5.1 Background

Communication Complexity.

We focus on the standard two-party communication complexity model of Yao [Yao79]. In this model, Alice and Bob receive an input and , respectively. In an -round protocol, Alice and Bob can communicate up to messages with each other. In particular, for an even

, Bob first sends a message to Alice, followed by a message from Alice to Bob, and so on, until Bob receives the last message and outputs the answer. For an odd

, the only difference is that Alice starts first and then the players continue like before until Bob outputs the answer.

The communication complexity of a problem , denoted by , is the minimum worst-case communication cost of any protocol (possibly randomized) that can solve with probability at least . The -round communication complexity of , denoted by , is similarly defined with respect to protocols that are allowed at most rounds of communication.

Augmented Indexing. In the Augmented Indexing Problem, denoted by , Alice is given a binary string , and Bob is given an index plus the first bits of the string , i.e., . The goal is for Bob to output the bit . It is well-known that -round communication complexity of this problem is (see, e.g. [MiltersenNSW98]).

Information Theory.

Throughout this section, we use bold-face fonts, say , to denote random variables, and normal font, say , to denote their realizations. For a random variable , denotes its support and its distribution. We sometimes abuse the notation and use and interchangeably. Furthermore, for a -tuple and any integer , we define and .

Our proof relies on basic concepts from information theory, which we review briefly here. For a broader introduction, we refer the interested reader to the excellent text by Cover and Thomas [ITbook].

Entropy and Mutual Information.

The Shannon entropy of is defined as

The conditional entropy of on random variable is defined as . The (conditional) mutual information between and is . We shall use the following basic properties of entropy and mutual information throughout.

Fact 5.1 (cf. [ITbook]; Chapter 2).

Let , , , and be four (possibly correlated) random variables.

  1. . The right equality holds iff is uniform.

  2. . The equality holds iff and are independent.

  3. Conditioning on a random variable can only reduce the entropy: . The equality holds iff .

  4. Chain rule for mutual information: .

Measures of Distance Between Distributions.

For two distributions and