1 Introduction
As machine learning becomes pervasive, how to effectively support machine learning tasks in database systems has become an imminent question. In a recent paper [MVPV17], Makrynioti et al. observed that many machine learning problems can be expressed by linear programs (LP). They designed a level of abstraction called SolverBlox on top of a declarative language LogiQL^{3}^{3}3An extended version of Datalog [ACGKOPVW15]. as a framework for expressing linear program formulations. The query in the format of SolverBlox will then be translated to a format supported by an LP solver for computing the solution. In this paper we consider the algorithmic side of this research direction, that is, we focus on the design of efficient LP solvers for largescale datasets. In particular, we propose algorithms for linear programming in three popular “big data” models, namely, the coordinator model [PVZ12], the streaming model [MP80, AMS99], and massively parallel computation (MPC) [KSV10, GSZ11, BKS17]. We also provide almost matching lower bounds when the dimensionality of the linear program is a fixed constant.
In the rest of the introduction we will start with the definition of the problem and the description of the computation models, and then present our results and discuss previous work.
Problem Definition.
The basic linear programming problem can be described as follows: we have a set of variables and a set of linear constraints each of which (indexed by ) is in the form of , where are coefficients and is the dimension of the problem. We also have an objective function . The goal is to find an assignment for variables that minimizes the objective function while satisfying all the constraints.
Linear programming is a special case of a more general problem called LPtype problem [MSW96], which we will discuss in details in Section 2.1. Besides linear programming, LPtype problems also include several other important problems in machine learning, such as Linear Support Vector Machines (SVM) [BGV92]
, which is widely used in classification and regression analysis
[GJ09, Burges98, CB99]), and Core Vector Machines [TKC05], which is used to speed up general SVM computation (or, Linear SVM augmented by the kernel trick [BGV92]). We will give the formal definitions of these problems in Section 4. The algorithms we propose in this paper work for general LPtype problems.In this paper we are interested in the scenario when the dimension of the linear program (and LPtype problem in general) is small compared to the number of constraints. Various examples of linear programming and LPtype problems in machine learning are of this type: SVMs and regression problems (in particular, least absolute error regression that can be modeled by linear programming) are often overconstrained; in the problems of Chebyshev approximation and linear separability, the number of variables are typically small.
Computational Models.
We study linear programming and LPtype problems in the following big data models.

[leftmargin=12pt]

The (multipass) streaming model. In this model, we have a single machine which can make linear scans of the input data sequence. The task is to compute some function defined on the input data sequence. The goal is to minimize the memory space usage and the number of passes needed. This model captures data that cannot fit the memory, and on which sequential scan is much more efficient than random access.

The coordinator model. In this model, we have sites and a central coordinator. Each site is connected by a twoway communication channel with the coordinator. The input is initially partitioned among the sites. The task is for the sites and coordinator to jointly compute some function defined on the union of the datasets. The computation proceeds in rounds: At the beginning of each round, the coordinator sends a message to each site, and then each site replies with a message back to the coordinator. At the end of the computation, the coordinator outputs the answer. The goal is to minimize the total bits of the communication and the rounds of the computation. This model fits data that is inherently distributed or cannot fit the storage of a single machine

Massively parallel computation (MPC). In this model, we have machines interconnected in a network that allows communication between any pairs of machines. Similar to the coordinator model, the input is partitioned among the machines, and the task is for them to compute some function defined on the union of the datasets. The computation is again in terms of rounds. At each round, the machines communicate with each other over the network by sending and receiving messages. The message sent by a machine at each round is a function of its input data and all messages it has received in previous rounds. Our goal is to minimize the number of rounds of the computation, and the maximum bits of information sent or received by a machine at any round (often called the load in the literature). MPC has already become the model of choice for studying parallel computation in computer clusters.
Description of the input. Since we are dealing with lowdimensional problems, we assume that the memory on each site/machine in each model is at least proportional to , the dimension of the problem, but is significantly smaller than , the number of constraints. As a result, the input is presented by giving the constraints one by one to the algorithm in the streaming model, or partitioning them across different sites/machines in the coordinator and MPC models.
1.1 Our Contributions and Related Work
In the following, we present our results for linear programming in the three big data models described above, and postpone the specifics of their generalization to LPtype problems to later sections. Our main upper bound result is the following.
[backgroundcolor=lightgray!40,topline=false,rightline=false,leftline=false,bottomline=false,innertopmargin=2pt]
Result 1.
We give the following polynomial time algorithms for dimensional linear programming with constraints. For any integer and parameter :

[leftmargin=12pt]

Streaming: An pass streaming algorithm with space.

Coordinator: An round distributed algorithm with total communication.

MPC: An round algorithm with load per machine.
Our algorithms are randomized and output the correct answer with probability
for any desired constant .By Result 1 for and , we obtain linear programming algorithms that use passes or rounds, and have space, communication, or load requirements in each model that is almost independent of the number of constraints. For lowdimensional instances, this results in a dramatic saving compared to direct implementations of standard LP algorithms in these models.
Previously, Chan and Chen [CC07] proposed an pass streaming algorithm for linear programming that uses space. Result 1 improves upon this result by achieving an exponentially smaller passcomplexity in terms of .
In the coordinator model, Daumé et al. [DPSV12] gave an algorithm using communication based on an adaptation of the algorithm of [CC07]. The roundcomplexity and communication cost of this algorithm again depends exponentially on .
In the MPC model, very recently Tao [Tao18] gave a round MPC algorithm with load when (for any ). This algorithm is then used as a building block for an interesting database application called entity matching with linear classification. The round complexity of our MPC algorithm in Result 1 improves that of [Tao18] by an exponential factor.
To summarize, Result 1 exponentially improves upon the pass/round complexities of the stateoftheart, while using the same or smaller space, communication, or load, in the considered big data models.
We complement our algorithms by giving almost tight lower bounds for any fixed dimension (even ) in the streaming and coordinator model.
[backgroundcolor=lightgray!40,topline=false,rightline=false,leftline=false,bottomline=false,innertopmargin=2pt]
Result 2.
We give the following lower bounds for dimensional linear programming with constraints. For any integer :

[leftmargin=12pt]

Streaming: Any pass algorithm requires space.

Coordinator: Any round algorithm requires communication even when number of sites is only .
Our lower bounds hold even for randomized algorithms that output the correct answer with probability at least .
A few remarks about Result 2: Firstly, it is easy to see that linear programming in one dimension in the models we consider is a trivial task. Result 2 thus proves the lower bound for the smallest nontrivial dimension. We note that unlike Result 1 that worked in all the three models, Result 2 does not prove any lower bound for MPC algorithms. Proving lower bounds for MPC algorithms is considered to be a challenging task as it has serious implications for long standing open problems in complexity theory [RoughgardenVW16]. Hence, no unconditional lower bounds are known so far in the literature for any MPC problem and Result 2 is of no exception.
Prior to our work, Chan and Chen [CC07] gave a lower bound for dimensional linear programming for a restricted family of deterministic streaming algorithms in the decision tree
model (the only permitted operation of these streaming algorithms is testing the sign of a function evaluated at the coefficients of a subset of stored hyperplanes). Their lower bound states that this type of algorithms require
space to compute the solution in passes. Our lower bound in Result 2 is much stronger in that it proves a similar passspace tradeoff for all streaming algorithms (even randomized). Finally, Guha and McGregor [GM08] showed that there is a fixed dimensional optimization problem for which any pass streaming algorithm requires space. However, it is not clear how to adapt their proof to linear programming since their optimization problem involves quadratic constraints [McGregor18].Further Related Work.
Special cases of linear programming have been studied previously in the big data models. In particular, Ahn and Guha gave multipass streaming algorithms for approximation of packing LPs [AG11] and Indyk et al. [IMRUVY17] gave similar algorithms for covering LPs (see also [AssadiKL16]). These results focus on highdimensional linear programs (nonconstant ) and only packing/covering LPs, and are hence quite different from our approach in this paper.
Unlike the case for big data models, lowdimensional linear programming has been studied extensively in the RAM model since the 1980s. Megiddo [Megiddo84] gave an algorithm for dimensional linear programming with time complexity , which is linear in terms of the number of constraints . This bound was consequently improved by a series of papers [Clarkson86, Dyer86, DF89, Kalai92, Clarkson95, ClarksonS89, MSW96, BCM99, Chan16].
2 Preliminaries
Notations.
For integers , we define , , and (we define and
analogously). We use capital letters for sets and random variables and calligraphic letters for set families. We use the notation
to denote a function of the form .Throughout the paper, we say an event happens “with high probability” if its probability can be lower bounded by for any desired constant ( is the number of constraints).
We use the following standard variant of Chernoff bound.
Proposition 2.1 (Chernoff bound).
Suppose are independent random variables taking value in and . Then, for any ,
2.1 LPtype Problems
We consider a generalization of linear programming referred to as LPtype problems^{4}^{4}4The class of LPtype problems is also known as abstract linear programming [Bland78].. An LPtype problem consists of a pair , where is a finite set of elements, and is a set function with a range which is assumed to have a total order. The function satisfies two properties:

Monotonicity: for any two sets , .

Locality: for any two sets , and any elements , if , then .
For an LPtype problem , we call a set a basis of if , and for all we have . The goal is to compute a basis such that . We say an element violates if . It helps to think of an LPtype problem as an optimization problem in which elements of are the constraints, and computes the best feasible solution on the set of constraints . In the case when the optimal solution is not unique, we just break the tie arbitrarily. Computing hence amounts to computing the optimal solution subject to all the constraints (we will make this connection explicit in the context of linear programming and other problems in Section 4).
Combinatorial Dimension.
Note that an LPtype problem may have several bases which are of different sizes. We define the combinatorial dimension of an LPtype problem to be the maximum cardinality of a basis for , denoted by ( for short when and are clear from the context).
2.2 Nets and VC Dimension
We now define another important notion that we use in designing our algorithms.
VC Dimension.
A setsystem is a tuple consists of a universe and a set family . Let be a set. Define the intersection between a set family and a set to be the set family
We say that a set is shattered by if contains all the subsets of , i.e., The VC dimension of setsystem , denoted by (or for short when is clear in the context), is then the cardinality of the largest set that is shattered by .
Net.
Given a setsystem , and a weight function , for any , let . We say a set is an net of with respect to for a parameter , iff for any point such that it holds that .
The notion of net is wellstudied in the literature (particularly in the computational geometry community [HW87, BG95, Mulmuley94]), and has been used in the algorithm design for many problems. We use the following simple randomized construction of net for designing a distributed version of Clarkson’s algorithm for LPtype problems.
Lemma 2.2 ([Hw87]).
For any setsystem of VC dimension , any weight function , and , a set family obtained by randomly sampling
(1) 
sets with probability proportional to their weights is an net of with probability at least .
3 Algorithms
In this section we present our algorithms for Result 1. We will work with a special class of LPtype problems that contains the most natural LPtype problems that we are aware of, including linear programming, Linear SVMs, and Core SVMs mentioned earlier. In particular, we require the LPtype problem to satisfy the following properties:

[label=(P0)]

Each constraint is associated with a set of elements ( is the range of ).

For any , is the minimal element of .
It is useful to think of as the set of feasible solutions. For example, in the case of linear programming, with the natural ordering induced by scalar product with the vector in the objective function. Each constraint (inequality) corresponds to the subset of points which satisfy the constraint, and is equal to the point which satisfies all constraints in and has a minimal scalar product with . For convenience, we use and interchangeably.
For this special class of LPtype problems, we define the VC dimension of the problem as the VC dimension of the set system .
In the following, we first give a general metaalgorithm for solving LPtype problems with Properties 1 and 2, and then show how to implement this metaalgorithm efficiently in each model.
3.1 The Meta Algorithm for LPType Problems
Our metaalgorithm follows Clarkson’s algorithm [Clarkson95] for linear programming, but we use a different sampling procedure (by using net) which enables us to work with general LPtype problems with bounded VC dimension; it also significantly simplifies the analysis and facilitates the implementation of our algorithm in the big data models we consider. We further use a different weight increase rate after each iteration, which is essential for reducing the number of passes in the streaming, and the number of rounds in the coordinator and MPC models.
The algorithm proceeds in iterations. We maintain a weight function throughout the algorithm which is initialized by setting for all . In each iteration, we first sample a set family of sets from with probability proportional to their weights so as to obtain an net of (according to Lemma 2.2). We then compute a basis of , and the set of constraints which violate the basis . If , then we say this iteration “succeeds”, and update the weights of all sets by setting . Otherwise, we say this iteration “fails”, and continue to the next one without modifying the weights. A pseudocode is provided in Algorithm 1.
In the following, we first establish the correctness of the metaalgorithm and then bound the number of iterations it needs.
Lemma 3.1.
When Algorithm 1 stops, it correctly computes .
Proof.
At the end of the algorithm, we have . This means that for any , we have by the monotonicity property of . By the locality property and induction we obtain that , finalizing the proof.
We now bound the number of iterations. We say that an iteration of Algorithm 1 (at Lines 1 to 1) is successful iff in this iteration.
Claim 3.2.
Each iteration of Algorithm 1 is successful with probability at least .
Proof.
Since the VC dimension of is , by Lemma 2.2, with probability at least , the family sampled in Line 1 is an net for with respect to the weight function . In the following, we condition on this event.
Let . By Property 2 of the LPtype problems we consider, we know that is the minimal element in the intersection of all sets in according to the ordering of . For any set to violate , we need to have ; otherwise which is in contradiction with . Recall that is the family of all sets in that violate . Suppose towards a contradiction that . Since none of the sets in contain , and is an net, by definition there is a set where does not contain . But this is in contradiction with being a basis. To see this, if , then belongs to all sets in , and consequently it should also be in . We thus have , finalizing the proof.
Lemma 3.3.
The number of iterations in Algorithm 1 is with probability at least , where denotes the combinatorial dimension of .
Proof.
Recall that the weight function is updated only when an iteration is successful, and each iteration succeeds with probability at least by Claim 3.2. By Chernoff bound (Proposition 2.1), we have that if the algorithm terminates in iterations, then with probability at least , at least of these iterations are successful.
We now focus on successful iterations. Let be the weight function after the th successful iteration. Initially, for any we have (and thus ). We claim that for any integer , if Algorithm 1 reaches the th successful iteration, then
(2) 
We establish Eq (2) in the following two claims.
Claim 3.4.
For any integer , we have .
Proof.
Fix an arbitrary basis of for some (recall that by definition, is size of the largest basis). Since , we have for any . We thus only need to show .
The first observation is that in any iteration, if then we must have . Indeed, if , then where the first equality is by the locality property of and induction, and the second equality holds since is a basis for . However, this is in contradiction with the fact that .
Let us now define as the basis of the net computed in the th successful iteration. For any , let be the number of iterations such that violates . That is,
Since in each of the first successful iterations, there must exist at least one which violates for each . We thus have Moreover, by the weight update rule of the algorithm, we can write the weight of as By combining these and Jensen’s inequality we have
since . This concludes the proof of Claim 3.4.
Claim 3.5.
For any integer , we have .
Proof.
For any iteration , the weight update procedure at Line 1 of Algorithm 1 gives
(3) 
Moreover, by the condition at Line 1 of the algorithm, we have,
(4) 
by the choice of in the algorithm. Combining (3) and (4) we have
We get back to the analysis of the number of iterations. By Eq (2) we have , hence, Since , we have . Therefore the number of successful iterations cannot exceed , and hence the total number of iterations is bounded by with probability .
Remark 3.6.
We can easily turn our LasVegas algorithm in this section (Algorithm 1) into a MonteCarlo algorithm by the following modifications: First we pick an net of size , and second, the algorithm return “FAIL” whenever , which will not happen in the first iterations with probability at least .
3.2 Implementation in the Streaming Model
Starting from this section, we show how to implement Algorithm 1 in the three big data models considered in the paper. We start with the streaming algorithm. In the multipass streaming model the elements of arrive one by one, and is known to the algorithm at the beginning. We allow the algorithm to make multiple linear scans of the input.
The main challenge in the streaming implementation of Algorithm 1 is that we cannot afford to store the weights of all elements in which are needed in the net sampling. To resolve this issue, we instead store the set of bases computed at all the successful iterations – these are the only iterations that we change the weight function – in a collection , using which we can compute the weight of each element of on the fly. In particular, the weight of a set in iteration of the algorithm, namely, , is computed as where . It is immediate to verify that this indeed implements the same weight function in Algorithm 1. It is also easy to see that having access to these weights, we can sample each set with probability proportional to its weight using the weighted version of reservoir sampling [Chao82], and hence implement each iteration of Algorithm 1 in one pass over the stream.
The rest of Algorithm 1 can be implemented in the streaming model in a straightforward way. Let be the time complexity of computing a basis for a set of size , and be the time complexity of finding all elements in a set of size which violate a set of size , i.e., all such that . This allows us to prove the following theorem.
Theorem 1.
Suppose is an LPtype problem with combinatorial dimension , VC dimension , and bitcomplexity for each element of . For any integer , we can compute with high probability in the streaming model, using passes, and space. The total running time of the algorithm is also .
Proof.
The correctness of the algorithm follows from Lemma 3.1. As each iteration of Algorithm 1 can be implemented in one pass, the total number of passes needed by our streaming algorithm is with high probability by Lemma 3.3.
Recall that the size of each net sampled in Algorithm 1 is by the choice of in the algorithm and in Lemma 2.2. The space needed by the algorithm to store in each iteration is , which is equal to bits. We also need to store all bases in successful iterations, which requires (since ) as each basis requires bits to represent and there are total of such bases.
Each pass of the algorithm involves performing a violation test over the elements of , which takes time. And computing a basis of elements which takes times. The runtime follows by multiplying these numbers by the number of passes, and by choice of .
3.3 Implementation in the Coordinator Model
Recall that in the coordinator model the input set is arbitrarily partitioned among sites such that for any , the site receives the elements . The sites and the coordinator want to jointly compute via communication. The function is a public knowledge, that is, all parties know how to evaluate the function for any assuming resides entirely on that machine.
Similar to the streaming model, the main step here is also the implementation of the net sampling procedure in Algorithm 1.
Lemma 3.7.
The coordinator can sample a subset of size according to the weight function using rounds and bits of communication, where is the number of times the weight function has been updated when simulating Algorithm 1 in the coordinator model.
Proof.
The sampling algorithm is as follows. In the first round each site sends to the coordinator. Note that for any can be described in bits.
In the second round the coordinator generates i.i.d. random numbers from from the distribution , and sends the th site the number . After obtaining , site samples elements from its local set according to the distribution , and sends the sampled elements to the coordinator. Note that for any , and thus the communication cost of this round is bounded by bits.
Finally, the sampling is indeed with respect to the weight function , since
This concludes the proof.
In order to implement Algorithm 1, each site should also be able to determine the set of violating elements in its input. This can be done easily by asking the coordinator to share the basis computed in each iteration with every site. The proof of Theorem 2 follows directly from that of Theorem 1 by plugging in Lemma 3.7.
Theorem 2.
Suppose is an LPtype problem with combinatorial dimension , VC dimension , and bitcomplexity for each element of . For any integer , we can compute with high probability in the coordinator model with machines, using rounds, and communication in total. The local computation time of the coordinator is and the local computation time of the th site is where .
3.4 Implementation in the MPC Model
The implementation of Algorithm 1 in the MPC model can be done similarly as that in the coordinator model, by choosing one of the machines to play the role of coordinator. The only problem is that when the number of machines is large, the machines cannot simply send all the messages to the coordinator directly, as it will blow up the load in the coordinator.
Our general strategy is to simulate our implementation of the metaalgorithm for the coordinator model in the MPC model for round protocols. The main challenge in implementing this is that once we require the load of roughly per machine, we need to start with machines to begin with to fit the whole input across all machines. This means that the number of sites in the simulation is . But then, if all these machines need to send even one bit to the designated coordinator machine (or vice versa), this requires a load of on the coordinator machine which is prohibitively large for any .
In order to fix this, we are going to use the by now standard approach of [GSZ11]. There are only two steps that the coordinator and the machines need to communicate with each other: (1) when the machines need to send a sample of the net, and (2) when the coordinator needs to send the basis to the machines. The latter can be done easily in MPC rounds on machines of memory : the coordinator first shares this information with other machines in one round; each of these machines next shares this information with another set of machines (unique to each original machine). In rounds all the machines would receive this information (see [GSZ11] for more details on this general approach).
To handle the part when the machines need to send the net to the coordinator, we do as follows. Recall that the size of is at most , and thus it will fit the memory of the coordinator. However, we first need to sample this according to the correct distribution. In order to do this, we use our approach for implementing the streaming algorithm. Since by the previous part we managed to share the basis computed in each iteration with every machine, as in the case of streaming algorithms, the machines can compute the weights of every constraint they have. The total weight of the constraints can also be computed in rounds using the sort and search method of [GSZ11]. As a result, each machine can locally perform the sampling of and send this information to the coordinator. To summarize, we have the following theorem.
Theorem 3.
Suppose is an LPtype problem with combinatorial dimension , VC dimension , and bitcomplexity for each element of . For any , we can compute with high probability in the MPC model using rounds with load per machine.
4 Examples and Applications
We now give examples of the application of our algorithms for general LPtype problems. We will discuss several fundamental optimization problems in machine learning, namely, linear programming, Linear SVM, and Core SVM. Recall that when implementing our meta algorithm in each model, we have left two functions (the time needed for performing the violation test) and (the time for computing the basis) unspecified. In this section we will provide concrete bounds for these functions in the context of the concrete problems we study. Throughout this section, we assume that the bitcomplexity of each number in the input is bits.
4.1 Linear Programming
A linear program is an optimization problem of the type:
(5) 
A dimensional linear program can be modeled as an LPtype problem as follows. Let be a set family of size such that for every constraint in (5), there exists a unique element which is the halfspace in the dimensional Euclidean space containing the points that satisfy this single constraint. We define the function over subsets of such that for every , is the lexicographically smallest point that minimizes the objective value of LP while satisfying only the constraints in . The linear program (5) now corresponds to the LPtype problem (we use as opposed to our previous notation , since each element of is now itself a subset of , and hence forms a set family). We refer the interested readers to [MSW96] for more details on connection between linear programming and LPtype problems.
It is known that the combinatorial dimension of this particular LPtype problem is at most [MSW96]. The VC dimension is also at most [VC15].
In the following, let denotes the time needed to solve a linear program with constraints and variables.
Proposition 4.1.
For any linear program with constraints and dimension :

The time needed to compute a basis of given constraints is

The time needed to compute all constraints that violate a given basis of size among constraints is
Proof.
To find a basis of a set of constraints, we first solve the LP only given the constraints in to obtain a point with optimal value . Recall that in our mapping of LP to an LPtype problem, we need to find a lexicographically smallest optimal solution on constraints in , which may not be the point even though the objective value is still . Hence, we now write a separate linear program:
This allows us to find an optimal solution to the LP with the minimum value of . Repeating this procedure for iterations and for th iteration fixing computed so far, and finding the minimum value for , allows us to find the lexicographically smallest optimal solution. These LPs all are dimensional with constraints, and hence can be solved in time in total, finalizing the first part.
A basis of size in a linear program consists of constraints of the LP that are all tight by the assignment of the variables. Hence, given the basis, we only need to solve the linear program on a system of linear inequalities to determine a value of that is tight for all the constraints in the basis. This can be done in time (as we do before). After this, we can simply check the dimensional vector against all the constraints and add each one as a violating set if does not satisfy the constraint in time, finalizing the second part.
Plugging in the currently best known bound for by [LS14] in Proposition 4.1, and the aforementioned bounds on , we can prove the following theorem using Theorems 1, 2, and 3.
Theorem 4.
We give the following randomized algorithms for dimensional linear programming with constraints. For any and :

[leftmargin=12pt]

Streaming: An pass algorithm with space in time.

Coordinator: An round algorithm with total communication in which the coordinator and each site spend time and time, respectively, where is the number of constraints on site .

MPC: An round algorithm with load per machine and time in total.
4.2 Linear Support Vector Machine
In Linear Support Vector Machine (SVM) problem [BGV92], we have a set of tuples such that for each index , and . The goal is to compute a hyperplane which is the outcome of the following quadratic optimization problem [BGV92]:
(6) 
From a geometrical point of view, the problem (6) corresponds to finding a hyperplane which separates the set of point according to their labels with the maximum margin value (if possible); see, e.g., [BGV92] for more information on this fundamental problem.^{5}^{5}5Our algorithm works effectively for the hardmargin Linear SVM. In the case of the softmargin Linear SVM, the optimization problem can also be formulated in the form of LPtype problem, but the dimension of such formulation is large – proportional to the size of input. Note that the problem (6) is not a linear program. However, one can show that it is an LPtype problem where is a set family in in which every set contains the points that satisfy a particular constraint, and for computes the optimal solution of (6) given only the constraints to [MSW96] (unlike linear programming, the optimal solution to (6) under any set of constraints is unique and hence we do not need the lexicographically first constraint).
The combinatorial dimension of is [MSW96], and the VC dimension of is [VC15]. In the following, let denote the time needed to solve an instance of Linear SVM problem with constraints and variables. We show how to implement the basis computation and violation test for Linear SVM in the following proposition.
Proposition 4.2.
For any Linear SVM problem with constraints and dimension :

The time needed to compute a basis of given constraints is

The time needed to compute all constraints that violate a given basis of size among constraints is
Proof.
Plugging in the currently best known bound for by quadratic programming in [YT89] in Proposition 4.2, and the aforementioned bounds on , we can prove Theorem 5 using Theorems 1, 2, and 3.
Theorem 5.
We give the following randomized algorithms for dimensional linear support vector machine problem with constraints. For any and :

[leftmargin=12pt]

Streaming: An pass algorithm with space in time.

Coordinator: An round algorithm with total communication in which the coordinator and each site spend time and time, respectively, where is the number of constraints on site .

MPC: An round algorithm with load per machine and time in total.
4.3 Core Vector Machine
Tsang at el. [TKC05] proposed core vector machines as a way of speeding up kernel methods in SVM training (see [BGV92]). This is achieved by reformulating the original kernel method as an instance of the minimum enclosing ball (MEB) problem, defined as follows: Given a set of points in , find a center and a minimum radius such that all the points in are within a dimensional sphere of radius centered at . MEB can be formulated as the following optimization problem:
(7) 
This problem is also an LPtype problem formulated similarly to linear programming and Linear SVM [MSW96].
The combinatorial dimension of is [MSW96] and the VC dimension of is [WD81]. Let denote the time needed to solve an instance of MEB problem with constraints and variables. The following proposition show how to implement the basis computation and violation test for MEB (the proof is identical to Proposition 4.2 and is hence omitted).
Proposition 4.3.
For any Linear SVM problem with constraints and dimension :

The time needed to compute a basis of given constraints is

The time needed to compute all constraints that violate a given basis of size among constraints is
As MEB can be cast as a convex quadratic program, we have by [YT89] as before. Hence, Theorems 1, 2, and 3 imply the following result.
Theorem 6.
We give the following randomized algorithms for dimensional core vector machine problem with constraints. For any integer :

[leftmargin=12pt]

Streaming: An pass algorithm with space in time.

Coordinator: An round algorithm with total communication in which the coordinator and each site spend time and time, respectively, where is the number of constraints on site .

MPC: An round algorithm with load per machine and time in total.
5 Lower Bounds
In this section we prove informationtheoretic lower bounds for linear programming that hold against any algorithm. We obtain our lower bounds by establishing the communication complexity for dimensional linear programming, and then translating it to lower bounds in the big data models. In the following, we first give some background on communication complexity and then present an intermediate problem, called twocurve intersection problem (TCI), that we consider en route to proving our result for linear programming. We then prove a lower bound for TCI and present its implications for linear programming in the streaming and coordinator models.
5.1 Background
Communication Complexity.
We focus on the standard twoparty communication complexity model of Yao [Yao79]. In this model, Alice and Bob receive an input and , respectively. In an round protocol, Alice and Bob can communicate up to messages with each other. In particular, for an even
, Bob first sends a message to Alice, followed by a message from Alice to Bob, and so on, until Bob receives the last message and outputs the answer. For an odd
, the only difference is that Alice starts first and then the players continue like before until Bob outputs the answer.The communication complexity of a problem , denoted by , is the minimum worstcase communication cost of any protocol (possibly randomized) that can solve with probability at least . The round communication complexity of , denoted by , is similarly defined with respect to protocols that are allowed at most rounds of communication.
Augmented Indexing. In the Augmented Indexing Problem, denoted by , Alice is given a binary string , and Bob is given an index plus the first bits of the string , i.e., . The goal is for Bob to output the bit . It is wellknown that round communication complexity of this problem is (see, e.g. [MiltersenNSW98]).
Information Theory.
Throughout this section, we use boldface fonts, say , to denote random variables, and normal font, say , to denote their realizations. For a random variable , denotes its support and its distribution. We sometimes abuse the notation and use and interchangeably. Furthermore, for a tuple and any integer , we define and .
Our proof relies on basic concepts from information theory, which we review briefly here. For a broader introduction, we refer the interested reader to the excellent text by Cover and Thomas [ITbook].
Entropy and Mutual Information.
The Shannon entropy of is defined as
The conditional entropy of on random variable is defined as . The (conditional) mutual information between and is . We shall use the following basic properties of entropy and mutual information throughout.
Fact 5.1 (cf. [ITbook]; Chapter 2).
Let , , , and be four (possibly correlated) random variables.

. The right equality holds iff is uniform.

. The equality holds iff and are independent.

Conditioning on a random variable can only reduce the entropy: . The equality holds iff .

Chain rule for mutual information: .
Measures of Distance Between Distributions.
For two distributions and , the KullbackLeibler divergence between and is denoted by
Comments
There are no comments yet.