One of the fundamental problems in machine learning is
linear regression, where the goal is to fit a hyperplane that minimizes the sum of squared vertical distances to a set ofinput -dimensional points (samples, vectors, training data). Formally, the input is an matrix and a vector in that contains the labels (heights, or last dimension) of the points. The goal is to minimize the sum over every -dimensional vector of coefficients,
One disadvantage of these techniques is that overfitting may occur (Bühlmann & Van De Geer, 2011). For example, if the entries in are relatively small, then may give an approximated but numerically unstable solution. Moreover, (1) is not robust to outliers, in the sense that e.g. adding a row whose entries are relatively very large would completely corrupt the desired vector .
This motivates the addition of a constraint on the norm of , where is a constant that may depend on the scale of the input and . Without loss of generality, we can assume , otherwise we divide entries ofP accordingly. The result is constrained regression problem,
The special case can be solved in
time, where the optimum is the smallest singular value ofand is the largest singular vector of (Golub & Reinsch, 1970).
A generalization of (2) for a given constant would be
where for . Note that for we obtain a non-standard norm which is a non-convex function over .
Optimization problem (3) for can be defined geometrically as follows. Compute a point on the unit sphere that minimizes the weighted sum of distances over given hyperplanes and multiplicative weights. Here, the th hyperplane is defined by its normal (unit vector) , its distance from the origin , and its weight . The weighted distance between and the th hyperplane is defined as .
In the context of machine learning, in linear regression we wish to fit a hyperplane whose unit normal is , that minimizes the sum of squared vertical distances between the hyperplane at point (predicted value) and (the actual value), over every . In low-rank approximation (such as SVD / PCA) we wish to fit a hyperplane that passes through the origin and whose unit normal is , that minimizes the sum of squared Euclidean distances between the data points and the hyperplane. Our problem is a mixture of these two problems: compute a hyperplane that passes through the origin (as in low-rank approximation) and minimizes sum of squared vertical distances (as in linear regression).
Further generalization of (3) suggests handling data with outliers. For example when one of the rows of is very noisy, or if an entry of is unknown. Let be the number of such outliers in our data. In this case, we wish to ignore the largest distances (fitting errors), i.e., consider only the closest points to . Formally,
where is a vector that consists of the smallest entries in , where is an integer.
In some cases, our set of observations is unordered, i.e., we do not know which observation in matches each point in . For example, when half of the points should be assigned to class and half of the points to class . Here, denotes a bijective function, called a matching function, and denotes the permutation of the entries in with respect to . In this case, we need to compute
where the minimum is over every unit vector and matching function .
2 Related Work
Regression problems are fundamental in statistical data analysis and have numerous applications in applied mathematics, data mining, and machine learning; see references in (Friedman et al., 2001; Chatterjee & Hadi, 2015). Computing the simple (unconstrained) Linear regression in (1) for the case was known already in the beginning of the previous century (Pearson, 1905). Since the norm is a convex function, for
it can be solved using linear programming, and in general forusing convex optimization techniques in time or using recent coreset (data summarization) technique (Dasgupta et al., 2009) in near-linear time.
To avoid overfitting and noise, there is a need to bound the norm of the solution , which yields problem (2) when . The constraint in (2) can be replaced by adding a Lagrange multiplier (Rockafellar, 1993) to obtain,
Unfortunately, (2) and (6) are non-convex problems in quadratic programming as explained in (Park & Boyd, 2017), which are also NP-hard if , so there is no hope for running time that is polynomial in ; see Conclusion section. It was proved in (Jubran & Feldman, 2018) that the problem is non-convex even if every input point (row) is on the unit circle and . Similarly, when we are allowed to ignore outlier as defined in (4), or if we use
-estimators, the problem is no longer convex.
Instead, a common leeway is to “guess” the value of in (6), i.e., turn it into an input parameter that is calibrated by the user and is called regularization term (Zou & Hastie, 2005) to obtain a relaxed convex version of the problem,
This problem is the ridge regression which is also called Tikhonov regularization in statistics (Hoerl & Kennard, 1970), weight decay in machine learning (Krogh & Hertz, 1992), and constrained linear inversion method (Twomey, 1975)
in optimization. Many heuristics were suggested to calibrate
automatically in order to remove it such as automatic plug-in estimation, cross-validation, information criteria optimization, or Markov chain Monte Carlo (MCMC)(Kohavi et al., 1995; Gilks et al., 1995) but no provable approximations for the constrained regression (2) are known; see (Karabatsos, 2018, 2014; Cule & De Iorio, 2012) and references therein.
Another reminiscent approach is LASSO (least absolute shrinkage and selection operator) (Tibshirani, 1996), which replaces the non-convex constraint in (2) with its convex inequality to obtain for some given parameter .
LASSO is most common technique in regression analysis(Zou & Hastie, 2005) for e.g. variable selection and compressed sensing (Angelosante et al., 2009) to obtain sparse solutions. As explained in (Tibshirani, 1996, 1997) these optimization problems can be easily extended to a wide variety of statistical models including generalized linear models, generalized estimating equations, proportional hazards models, and M-estimators.
Alternatively, the -norm in (1) may be replaced by the -norm for to obtain the (non-constrained) regression
which is convex for the case . Using in (7) is especially useful for handling outliers (Ding & Jiang, 2017) which arise in real-world data. However, for the (non-standard) -norm, where , (7) is non-convex.
Adding the constraint in (7) yields the constrained regression in (3). Only recently, a pair of breakthrough results were suggested for solving (3) if . (Park & Boyd, 2017) suggested a solution to the constraint regression in (3) for the case . They suggest to convert the constraint into two inequality constraints and . The other result (Jubran & Feldman, 2018) suggested a provable constant approximation for the constrained regression problem, in time for every constant . However, the result holds only for , and the case as in our paper was left as an open problem. In fact, our main algorithm solves the problem recursively where in the base case we use the result from (Jubran & Feldman, 2018).
for regression in this paper is a small weighted subset of the input that approximates for every , up to a multiplicative factor of . Solving the constrained regression on such coreset would thus yield an approximation solution to the original (large) data. Such coresets of size independent of were suggested in (Dasgupta et al., 2009). In Theorem 8.1 we obtain a little smaller coreset by combining (Dasgupta et al., 2009) and the framework from (Feldman & Langberg, 2011; Braverman et al., 2016). Such coresets can also be maintained for streaming and distributed Big Data in time that is near-logarithmic in per point as explained e.g. in (Feldman et al., 2011; Lucic et al., 2017b). Applying our main result on this coreset, thus implies its streaming and distributed versions. We note that this scenario is rare and opposite to the common case: in most coreset related papers, a solution for the problem that takes polynomial time exists and the challenge is to reduces its running time to linear in , by applying it on a coreset of size that is independent, or at least sub-linear in . In our case, the coreset exists but not a polynomial time algorithm to apply on the coreset.
3 Paper Overview
The rest of the paper is organized as follows. We state our main contributions in Section 4 and some preliminaries and notations in Section 5. Section 6 suggests an approximation algorithm for solving (3), Section 6.3 generalizes this solution of (3) to a wider range of functions, Section 7 handles the minimization problem in (5), Section 8 introduces a coreset for the constrained regression, Section 9 presents our experimental results and Section 10 concludes our work.
4 Our Contribution
Some of the proofs have been placed in the appendix to make the reading of the paper more clear.
We provide the first polynomial time algorithms that approximates, with provable guarantees, the functions in (3), (4) and (5) up to some constant factor that depends only on and some error parameter . The factor of approximation is and , respectively for (3), (4) and (5). The running time is and , respectively for (3), (4) and (5); see Table 1.
Our main algorithm takes time and is easily generalized for many objective functions via Observation 6.7, Theorem 6.8 and Theorem 7.2. It implies the results in the last three rows of Table 1. For the case that and we wish to minimize over every unit vector , we can apply our algorithm on the coreset for regression as explained in the previous section. This reduces the running time to near-linear in , and enables parallel computation over distribution and streaming data by applying it on the small coreset that is maintained on the main server. This explains the running time and approximation factor for the first three lines of Table 1.
Our experimental results
show that the suggested algorithms perform better, in both accuracy and computation time, compared to the few state of the art methods that can handle this non-convex problem; See Section 9.
Table 1 summarizes the main contributions of this paper.
In this section we first give notation and main definitions that are required for the rest of the paper.
Let be the set of real matrices. In this paper, every vector is a column vector, unless stated otherwise, that is . We denote by the length of a point , by the Euclidean distance from to a subspace of . We denote for every integer . For a bijection function (matching function) and a set of pairs, we define . For every , we denote by the th standard vector, i.e., the th column of the identity matrix. For , is the set of all unit vectors in . We denote for simplicity. For every , we define . The set denotes the union over every matching function . We remind the reader that is the set (and not a scalar) that contains all the values of that minimize over some set . We denote by .
We first give a brief geometric illustration of the later definitions. A hyperplane in that has distance from the origin can be defined by its orthogonal (normal) unit vector so that . More generally, the distance from to is . In Definition 5.1 below the input is a set of such hyperplanes that are defined by the matrix and vector , such that is the unit normal to the th hyperplane, and is its distance from the origin.
In what follows we define a (possibly infinite) set of unit vectors , which are the unit vectors that are as close as possible to the hyperplane , among all unit vectors that lie on the intersection of . If , we define this set to be the set of closest vectors on the unit sphere to the hyperplane . Observe that for every point , ; See Figure 1 and 2 for a geometric illustration of the following definition.
Let be a pair of integers, and . We define the set
In what follows, for every pair of vectors and in we denote if for every . The function is non-decreasing if for every . For a set in , and a scalar we denote .
The following definition is a generalization of Definition 2.1 in (Feldman & Schulman, 2012) from to dimensions, and from to . Intuitively, an -log-Lipschitz function is a function whose derivative may be large but cannot increase too rapidly (in a rate that depends on r).
Definition 5.2 (Log-Lipschitz function).
Let and let be an integer. Let be a subset of , and be a non-decreasing function. Then is -log-Lipschitz over , if for every and , we have The parameter is called the log-Lipschitz constant.
The following definition implies that we can partition a function which is not a log-Lipschitz function into a constant number of log-Lipschitz functions; see Figure 3 for an illustrative example.
Definition 5.3 (Piecewise log-Lipschitz (Jubran & Feldman, 2018)).
Let be a continuous function over a set , and let be a metric space, i.e. is a distance function. Let . The function is piecewise -log-Lipschitz if there is a partition of into disjoint subsets such that for every :
has a unique infimum at , i.e., .
is an -log-Lipschitz function; see Definition 5.2.
for every .
The set of minima is denoted by .
6 Regression with a Given Matching
In this section we suggest our main approximation algorithm for solving (3), i.e., when the matching between the rows of and the entries are given.
The following two corollaries lies in the heart of our main result.
Corollary 6.1 (Claims 19.1 and 19.2 in (Jubran & Feldman, 2018)).
Let and let such that . Then is a piecewise log-Lipschitz function; See Definition 5.3.
Let and . Let . Then for every such that we have
See proof of Corollary A.1 in the appendix. ∎
In what follows, a non-zero matrix is a matrix of rank at least one (i.e., not all its entries are zero). Let and , where we assume that is a non-zero matrix. For every , let be a hyperplane whose normal is and its distances from the origin is , and let denote their union. Let be a unit vector. The following lemma generalizes Corollary 6.2 to higher dimensions. It states that there is a unit vector of minimal distance to one of the hyperplanes , that approximates for every up to a multiplicative factor of .
Let be a non-zero matrix such that points, and let and . Then there exists where and such that for every
See proof of Lemma A.2 in the appendix. ∎
The following lemma states that there is a set of hyperplanes from , and a unit vector in the intersection of the first hyperplanes that is closest to the last hyperplane, , that is closer to every one of the hyperplane in , up to a factor of than , i.e., for every . The proof is based on applying Lemma 6.3 recursively times. In the following lemma we use from Definition 5.1.
Let be a non-zero matrix of rows, let , and let . Then there is a set of indices such that for and every , there is that satisfies
Moreover, (28) holds for every if .
See proof of Lemma A.3 in the appendix. ∎
6.1 Computing the set
In this section we give a suggested simple implementation for computing the set given a matrix and ; see Algorithm 1. A call to returns if . Otherwise, it returns some . In other words, the algorithm always returns a set of finite size.
6.2 Geometric interpretation and intuition behind Algorithm 1.
Algorithm 1 takes a set of hyperplanes in as input, each represented by its normal and its distance from the origin, and computes a point that minimizes its distance to , i.e., . In other words, among all vectors that lie simultaneously on , we intend to find the unit vector which minimizes its distance to . There can either be ,, or infinite such points. In an informal high-level overview, the algorithm basically starts from some unit vector , and rotates it until it either intersects , or minimizes its distance to without intersecting it. If they intersect, then we rotate while maintaining that , until either intersects or minimizes its distance to . If intersects , we rotate in the intersection until it intersects and so on. If at some iteration we observe that all the remaining subspaces are parallel, then the set is of infinite size, so we return one element from it. We stop this process when at some step , we can not intersect under the constraint that is a unit vector and . If , then is empty. If we return the vector that is as close as possible to .
More formally, at each step of the algorithm we do the following. In Line 1 we check weather the intersection contains at most point. This happens when the distance of the hyperplane to the origin is bigger than or equal to . This is the simple case in which we terminate. The interesting case is when the distance of the hyperplane to the origin is less than . In this case, if , we compute and return the two possible intersection points in . If , then . Observe that is simply a sphere of dimension , but is not a unit sphere. We rotate the coordinates system such that the hyperplane containing is orthogonal to the th standard vector . We observe that in order to minimize the distance from a point in to , we need to minimize its distance to which is simply a -subspace contained in . We thus project and every -subspace in onto the hyperplane orthogonal to and passes through the origin, obtaining a sphere of dimension , and -subspaces, all contained in . We scale the system such that becomes a unit sphere. We then continue recursively.
Overview of Algorithm 2.
The input is a matrix and a vector . The output is a set of unit vectors that satisfies Theorem 6.5. We assume without loss of generality that the entries of are non-negative, otherwise we change the corresponding signs of rows in in Lines 2-2. In Lines 2-2 we run exhaustive search over every possible set of at most indices, which corresponds to a set of unit vectors, to find the set from Lemma 6.4. The algorithm then returns the union of these sets in Line 2. See Algorithm 1 for a possible implementation for Line 2. Notice that Algorithm 1 does not return a set of infinite size. In such a case, it returns one element from the infinite set.
What follows is the main theorem of Algorithm 2. The following theorem states that the output of Algorithm 2 contains the desired solution. This is by Lemma 6.4 that ensures that one of the sets contains the desired solution.
Let be a matrix of rows, and let . Let be an output of a call to Calc-x-candidates; see Algorithm 2. Then for every there exists a unit vector such that for every ,
Moreover, the set can be computed in time and its size is .
See proof of Theorem A.5 in the appendix. ∎
We now prove that the output of Algorithm 2 contains approximations for a large family of optimization functions. Note that each function may be optimized by a different candidate in . This family of functions includes squared distances, -estimators and handling outliers. It is defined precisely via the following cost function, where the approximation error depends on the parameters and .
Definition 6.6 (Definition 4 in (Jubran & Feldman, 2018)).
Let be a finite input set of elements and let be a set of queries. Let be a function. Let be an -log-Lipschitz function and be an -log-Lipschitz function. Let . We define