A wide spectrum of problems in applied mathematics and statistics can be formulated as an instance of the convex programming problem
where is continuously differentiable and convex and the are closed convex sets in . At one extreme, problem (1
) includes classical least squares. At the other, it includes finding a feasible point in the intersection of several closed convex sets. In between, the formulation covers a variety of shape restricted regression problems such as fitting a support vector machine and projecting an exterior point onto a complicated convex set. Great progress has been made in attacking specific incarnations of problem (1). The projected gradient algorithm and its Newton and quasi-Newton extensions have been very successful when constraints are simple, for example box constraints, and admit a correspondingly simple projection operator Ber1982 ; KimSraDhi2010 ; Ros1960 ; SchBerFri2009 . However, there is still room for improvement. In the current paper we present a unified approach to solving a smoothed relaxation of problem (1) via the majorization-minimization (MM) principle Lan2010 . This approach is especially attractive when it is easy to project onto each separate set but nontrivial to project onto their intersection.
Problem (1) can be written as the unconstrained optimization problem
where the indicator function equals 0 if and if not. Although problem (2) is now unconstrained, the indicator functions introduce two challenges. The new objective function takes on infinite values and is non-differentiable. This prompts us to replace by a finite valued smooth approximation , where denotes the standard Euclidean norm. Further progress can be made by solving the related problem
where is a positive parameter that penalizes deviation from the original feasible region. The smooth approximation introduced in formulation (3) is an example of the quadratic penalty method Ber2003 ; NocWri2006 ; Rus2006 . Problem has many appealing features. The problem is unconstrained with an objective function that is convex and differentiable when is convex and differentiable. Consequently optimality conditions can be readily identified. The distance function is closely tied to the projection of onto ; specifically , and . Thus, a point solves problem if and only if it satisfies the stationarity condition
Of course finding such an is often analytically intractable due to the projection term. To solve (3) iteratively, we resort to the MM principle. Because we rely on majorizing
, we call our approach distance majorization. A key step in solving the subproblems will be calculating projection operators. Fortunately, many useful projection operators are easy to compute. The best known examples include projection onto: (a) a closed Euclidean ball, (b) a closed rectangle, (c) a hyperplane, (d) a closed halfspace, (e) a vector subspace, (f) the set of positive semidefinite matrices, (g) the unit simplex, (h) a closedball, and (i) an isotone convex cone. While there are no analytic solutions for the last three projections, there are efficient algorithms for computing them BarBarBre1972 ; DucShaSin2008 ; Mic1986 ; RobWriDyk1988 ; SilSen2005 .
The rest of the paper is organized as follows. After a brief review of the MM principle and its place among related iterative minimization schemes, we illustrate the virtues of distance majorization in five different problem areas: (a) finding a point in the intersection of a finite collection of closed convex sets, (b) projection of a point onto the closest point in the intersection of a finite collection of closed convex sets, (c) convex regression, (d) classification via support vector machines, and (e) the facilities location problem. The literature on some of the examples is enormous, so we apologize in advance for omitting relevant references and slighting the ramifications of the various models. After our tour of examples, we present relevant convergence theory in a general algorithmic framework. Our concluding discussion indicates a few extensions and limitations of distance majorization.
2 The MM Principle and Distance Majorization
Although first articulated by the numerical analysts Ortega and Rheinboldt OrtRhe1970 , the MM principle currently enjoys its greatest vogue in computational statistics BecYanLan1997 ; LanHunYan2000 . The basic idea is to convert a hard optimization problem (for example, non-differentiable) into a sequence of simpler ones (for example, smooth). The MM principle requires majorizing the objective function by a surrogate function anchored at the current point . Majorization is a combination of the tangency condition and the domination condition for all . The iterates of the associated MM algorithm are defined by
the MM iterates generate a descent algorithm driving the objective function downhill. Strict inequality usually prevails unless is a stationary point of .
The most useful majorization of follows immediately from the observations
for all pairs and . In practice, majorizing by leads to more convenient updates than majorizing by . Most of our applications can be phrased as minimizing the criterion
for a convex loss , a collection of closed convex sets, a positive penalization parameter , and a corresponding set of positive weights . Without loss in generality, we can require to sum to one, since scaling of the weights can be absorbed into the overall penalty parameter . Uniform weights equally penalize an iterate’s violation of each constraint. Nonuniform weights will penalize constraint violations differently. This can be a useful mechanism if it is more important to satisfy some constraints over others in an application. In this paper we consider examples where constraints are all equally important and consequently employ uniform weights. For notational simplicity, we drop the weights from the remainder of our exposition but note that they can be employed in all the examples we cover. The introduction of weights also leaves the convergence analysis presented later untouched. Algorithm 1 shows the pseudocode for the distance majorization algorithm.
We highlight the fact that the algorithm does not require the projection onto the intersection but rather only the projection onto each of the constituent sets . As we will see in our first example, distance majorization can be considered a generalization of the simultaneous projection algorithm for finding a point in the intersection of a collection of closed convex sets. We note, however, that distance majorization is not unique in this regard. For comparison’s sake, we will also present a dual ascent algorithm at the end of the next section that employs projections onto the constituent sets. Although the two methods exhibit comparable empirical performance, the distance majorization algorithm is guaranteed to converge under weaker conditions than the dual ascent algorithm.
Finally, we note that MM algorithms are often plagued by a slow rate of convergence in a neighborhood of the minimum point. To remedy this situation, we employ quasi-Newton acceleration. MM algorithms can be accelerated via Newton’s method just as the classic gradient descent algorithm. Adjusting the direction of steepest descent to account for the curvature in the objective yields more efficient step directions, and the number of iterations to a minimum can be drastically reduced. Newton’s method, however, requires computing and storing a full Hessian matrix, a demanding task in high-dimensional problems. To ease the computational burden, quasi-Newton methods obtain curvature information by approximating the Hessian with secants or differences between successive iterates. Using more secants leads to better approximations of the Hessian initially, but using too many secants can actually lead to a poorer approximation as a smaller collection of secants can adapt more dynamically to changes in the curvature as the iterations proceed. Moreover, using more secants entails additional storage and computation. In the following examples, we use either two or five secants. Using a handful of secants is a modest additional burden in computation and storage but leads to noticeable acceleration in our MM algorithm. For details on the scheme we employed as well as comparisons with alternative acceleration schemes, we direct readers to our earlier paper ZhoAleLan2011 .
2.1 Sequential Unconstrained Minimization
During the review of this paper, a referee brought to our attention that the MM algorithm is an instance of a broad class of methods termed sequential unconstrained minimization FiaMcC1990 . Consider minimizing over a closed, non-empty set . In sequential unconstrained minimization, we generate a sequence of iterates that minimize an unconstrained surrogate
where the auxiliary functions encode information about the constraint set .
When is chosen so that for all and , then
This is a restatement of the descent property of an MM algorithm. In fact, we can identify and . The tangency and domination conditions of the MM principle can be expressed alternatively as
where and .
Byrne Byr2008a introduced an important subset of sequential unconstrained minimization methods in which the auxiliary functions satisfy
Methods satisfying (7) are examples of sequential unconstrained minimization algorithms (SUMMA) and generate iterates for which converges to
. The SUMMA class includes a wide range of general iterative methods including barrier and penalty function methods, forward-backward splitting methods, and instances of the expectation maximization (EM) algorithm to name a few. Readers can consult the referencesByr2008a ; Byr2013 ; Byr2013a to learn more about the breadth and applicability of the SUMMA class.
Given that examples of the EM algorithms have been shown to belong to the SUMMA class Byr2013 and that EM algorithms are a special case of the MM algorithm WuLan2010 , it is natural to wonder if MM algorithms, which have now been shown to be sequential unconstrained minimization algorithms, belong to the SUMMA class. The answer to this question is not immediately obvious. It is possible to concoct majorizations that fail to meet the SUMMA condition. Rewriting the SUMMA condition (7) in terms of majorizations yields
for all . Roughly speaking, (8) says that a sequence of majorizations should be hugging uniformly more closely as the iterations proceed. While this is intuitively desirable, it is not necessary to ensure convergence of an MM algorithm.
Nonetheless, it can be non-trivial to declare an iterative algorithm to be outside the SUMMA class, since we must prove that the resulting iterative algorithm could not be derived from some sequence of auxiliary functions that do obey (7). Although majorizations chosen may violate the SUMMA condition, the resulting iterative algorithm may ultimately belong to the SUMMA class. In the Appendix we give an example of a convergent MM algorithm with a surrogate function that fails condition (8) globally. Locally the algorithm does belong to the SUMMA class. The ambiguity about the proper domain of an algorithm spills over into selection of starting points and highlights the practical benefits of the MM principle, which leaves the door ajar to less restrictive auxiliary functions. Fortunately, the qualitative features of convergence carry over to this broader set of auxiliary functions.
3 Examples of Distance Majorization
Finding a Feasible Point
When the intersection is nonempty, majorization can be employed to locate a point in . The general idea is to drive the convex combination
to 0. Minimization of the surrogate function
leads to the well-known simultaneous projection algorithm
The earliest version of this algorithm is attributed to Cimmino Cim1938 . It does not necessarily find the closest point in to . The evidence suggests that simultaneous projection converges more slowly than alternating projection CenCheCom2012 ; Gou2008 . However, simultaneous projection enjoys the advantage of being parallelizable. One can invoke the theory of paracontractive operators to prove the convergence of both simultaneous and alternating projections Byr2008 .
The alternating projection algorithm can also be derived by distance majorization. The least distance between two closed convex sets and can be found by minimimizing over . If we majorize by the surrogate function , where , then the minimum of the surrogate occurs at . When the two sets intersect, the least distance of 0 is achieved at any point in the intersection. Thus, the MM principle provides very simple and direct derivations of the simultaneous and alternating projection algorithms.
Distance majorization can be generalized by replacing Euclidean distances with Bregman divergences. For simplicity we limit our discussion to Bregman divergences generated by strictly convex twice differentiable functions . The Bregman divergence
is a convex function of anchored at and majorizing 0. For instance, the four convex functions , , , and generate the Bregman divergences
The matrix in the definition of is assumed positive definite.
The Bregman projection onto a closed convex set is defined as
Under suitable additional hypotheses, the Bregman projection exists. It is unique because is strictly convex. Moreover, (equivalently ) exactly when . The analogue of the proximity function (9) is the proximity function
If we abbreviate , then the function
majorizes . The MM principle suggests that we minimize . A brief calculation produces the stationarity condition
Projection onto the Intersection of Closed Convex Sets
We next consider how distance majorization can be used to find the closest point in the intersection to a point . This involves minimizing the strictly convex function
for large. The solution tends to the optimal point as tends to . The MM update for minimizing the surrogate function
is the convex combination
The corresponding algorithm map is strictly contractive with contraction constant . According to the contraction mapping theorem, the iterates converge to the unique fixed point at geometric rate . This fixed point coincides with the minimum point of the function . Indeed, is differentiable with gradient
Rearrangement of the stationarity condition gives the fixed point condition
One can generalize these results in various ways. For instance, if we replace Euclidean loss by weighted Euclidean loss , then the MM update of the penalized loss has components
The quadratic penalty method suffers from roundoff errors and numerical instability for large . These are mitigated in the MM algorithm since its updates (11) rely on stable projections and avoid matrix inversion. The slow rate of convergence for large is an issue. In practice one can improve the rate of convergence by starting small and gradually increasing it to its target value. For a fixed one can also accelerate the MM iterates by systematic extrapolation. For instance, our quasi-Newton acceleration ZhoAleLan2011 often reduces the required number of iterations by one or two orders of magnitude.
Projection as a Dual Program
For the sake of comparison, we describe a dual algorithm for solving the projection problem. This alternative algorithm can be accelerated by Nesterov’s method BecTeb2009 ; Nes2007 . The unaccelerated dual algorithm is a variation of Dykstra’s algorithm Dyk1983 , which solves the dual problem by block descent.
To derive the dual problem, we first observe that the primal problem consists of minimizing
subject to . The Lagrangian for the primal problem is
If denotes the concatenation of the dual variables , then the dual function
where . The dual function can be maximized by the proximal gradient algorithm
For the sake of clarity, we have adopted novel notation in this derivation; and denote the th MM iterate of the th primal and dual variables respectively.
Derivation of this algorithm and its Nesterov acceleration (FISTA) appear in the Appendix. The dual updates, which are essentially projection steps, can be computed in parallel. Thus, the dual algorithm matches the MM algorithm in this regard.
Projecting onto the set of doubly nonnegative matrices
As a numerical example, consider the problem of projecting a symmetric matrix onto the set of doubly nonnegative matrices, namely the intersection of the set of nonnegative matrices with the set of positive semi-definite matrices. Many covariance matrices – for example, kinship matrices in statistical genetics – have nonnegative entries. Projection onto each of the component sets is relatively easy while projection onto the intersection is not. Projecting onto the set of nonnegative matrices is accomplished by setting all negative entries of a matrix to zero. Projecting onto the set of positive semi-definite matrices is accomplished by truncating the eigenvalue decomposition of the matrix and rejecting all outer products with negative eigenvalues.
As a test case, we generated a 200-by-200 matrix with independent and identically distributed (i.i.d.) entries drawn from a standard normal distribution. After projecting the simulated matrix onto the space of symmetric matrices, we compared the distance majorization algorithm to its quasi-Newton acceleration (2 secants), the dual proximal gradient algorithm, and its FISTA acceleration. We implemented the MM algorithm with the geometrically increasing sequenceof penalty constants . The decision to switch to the next larger was based on the ratio
Whenever this ratio fell below , we updated . To track the progress of each algorithm, we calculated two measures of constraint violation by the current matrix: (a) the absolute value of the most negative eigenvalue, and (b) the absolute value of the most negative entry. Figure 1 plots the maximum of the two constraint violations on a log scale at each iteration. The abrupt transitions in the MM and quasi-Newton MM paths reflect the switch points for the penalty constant . Obviously, the amount of work done in each iterate varies across the methods. For a more direct comparison, Table 1 records several statistics, including run times in seconds. In the table, the distance column conveys the Frobenius norm of the difference between the simulated matrix and the fitted matrix. The two featured algorithms perform about equally well. As expected, their accelerated versions do much better.
|Method||Time (sec)||Iterations||Distance||Constraint Violation|
Isotone regression minimizes the least squares criterion subject to the isotonic constraint . This problem is readily amenable to the projection algorithm. Projection onto the isotone convex cone
is rapidly accomplished by the pool adjacent violators algorithm BarBarBre1972 ; RobWriDyk1988 ; SilSen2005 . More complicated order restrictions such as for all arcs in a directed graph can be handled as well. In this setting all components of a vector projected on the convex set are left untouched except components and , These are left untouched when . Both and are replaced by their average when .
We considered the problem of fitting a nondecreasing function to the data shown in Figure 2 (black dots). Each observed pair was generated as follows. The are equally spaced points between 1 and 3, and the satisfy
where the are i.i.d. standard normal deviates. For the MM algorithms we used the geometrically increasing sequence of penalty constants featured in the previous example and two secant conditions for the quasi-Newton acceleration. We switched to the next value of whenever the stopping criterion (12) fell below . A looser threshold resulted in unacceptably poor fits for these data.
To track the progress of each algorithm, we measured the constraint violation of an iterate as the maximum absolute constraint violation between two successive parameters. Figure 2 shows that all four algorithms return similar solutions under the specified stopping rule. Figure 3 plots the constraint violation for each method on a log scale. Table 2 compares timing results and constraint violations at convergence. In the table the distance column conveys the Euclidean norm of the difference between observed points and fitted points. Compared to the previous example, we see an even greater improvement in the performance in the accelerated versions of the two algorithms. In general, it is safe to conclude that distance majorization is a viable alternative to its most likely fastest competitor in non-smooth convex optimization.
|Method||Time (sec)||Iterations||Distance||Constraint Violation|
Least Squares Fitting with Convex Functions
Given responses , predictor vectors in , and case weights , convex regression seeks to minimize the sum of squares of residuals
subject to the constraints
for every ordered pairBoyVan2004 . In effect, is viewed as the value of the regression function at the point . The unknown vector serves as a subgradient of at . Because convexity is preserved by maxima, the formula
defines a convex function with value at
. In concave regression the opposite constraint inequalities are imposed. Interpolation of predicted values in this model is accomplished by simply taking minima or maxima. Estimation reduces to a positive semidefinite quadratic program involvingvariables and inequality constraints. Note that the feasible region is nontrivial because it contains the point , where .
The penalized objective function is
where . Let and denote the components of relevant to and , respectively. The surrogate function
admits the minimizer
The projection operator is easy to compute because is a half-space. Furthermore, if we define the quantities
then the sums entering the MM updates reduce to
evaluated at and .
Figure 4 displays a randomly generated data set with 51 data points and the corresponding least squares fit with convexity constraints. We employed the same geometrically increasing sequence of used earlier, took five secant conditions for the quasi-Newton acceleration, and set the stopping criterion (12) to . The MM algorithm requires 8940 iterations and 4.12 seconds in total to achieve the objective value of 1.0709 and the maximal constraint violation at order of .
Support Vector Machine
Given data , , where and , the goal of discriminant analysis is to choose classification labels using the -dimensional predictor . The support vector machine (SVM) Vap2000
subject to the inequality constraints
using slack variables . See Example 15.5.2 of the book Lan2012 for further details about problem formulation and passing to the dual. In the following we assume that the first element of is 1, and thus the intercept is absorbed in the parameter . Then the penalized objective function is
where . Minimizing the surrogate function
subject to the non-negativity of yields the next iterate
Because is a half-space,
where the vector has all entries equal to 0 except for a 1 at entry .
We report the results on an example SVM problem with a training data set of observations and features. We employed the same geometrically increasing sequence of and the same stopping criterion used in the previous example. At , the MM algorithm takes 14,432 iterations and 2.69 seconds to achieve the objective value 489.0058 and the maximal constraint violation .
As a generalization, consider the kernel SVM SchSmo2001 attractive in handling problems. The optimization problem is to minimize
Common choices of kernels include the polynomial kernel
and the Gaussian kernel
Since is positive semi-definite, it can be expressed in terms of a Cholesky decomposition . With reparameterization , the problem transforms to
which is essentially the same as the original SVM. The Cholesky decomposition costs flops and might be a concern for data with huge number of observations. Some kernels used in genomics are naturally low rank with trivial Cholesky factors and . Even for a full-rank kernel , one can resort to the fast Lanczos algorithm GolVan1996 to extract its top eigen-pairs and set , an matrix.
The Fire Station Problem
Finally, we give another example that distance majorization need not be fettered to Euclidean distances. Indeed, Euclidean distances may be inappropriate in some problems. Consider the problem of determining the optimal location of a new fire station in a city where the streets occur on a rectangular grid. The station should be situated to guarantee the shortest routes to several major buildings spread throughout the city. This is just the generalized Heron problem with the norm substituting for the Euclidean norm ChiLan2013 . More general treatment of the problem under arbitrary norms and infinite dimensions can be found in MorNam2011 ; MorNamSal2012 ; MorNamSal2011 . Here we are concerned with efficient computation with a particular norm. The projection operators are now harder to calculate. Indeed, they are often sets rather than points. Fortunately, when is a rectangle with sides parallel to the standard axes, is a point with components
To minimize the objective function, we minimize the surrogate function
Because the norm separates variables, we obtain a very simple update formula.
Consider the example where the buildings have centers , , , , and and half-side lengths of 0.5. Minimizing the sum of and distances yields the results shown in Figure 5. The optimal position clearly depends on the underlying norm. For more general problems, the solution may not be unique because the projection operator does not reduce to a single point.
4 Convergence Analysis
We now prove convergence of the distance majorization algorithm under conditions pertinent to Euclidean distances. For broader impact, we relax the convexity requirement on . For example, in statistics, the objective function corresponding to many widely used robust estimators are often not-convex HubRon2009 . Some of our convergence results hold for such objective functions. When is convex, it is possible to prove stronger results, and we comment on what changes when convexity is assumed. Let us first consider the convergence of the MM algorithm for solving subproblem (3). The convergence theory of MM algorithms hinges on the properties of the algorithm map . For easy reference, we state a simple version of Meyer’s monotone convergence theorem Mey1976 instrumental in proving convergence in our setting.
Let be a continuous function on a domain and be a continuous algorithm map from into satisfying for all with . Suppose for some initial point that the set
is compact. Then (a) , (b) all cluster points are fixed points of , and (c) converges to one of the fixed points if they are finite in number.
The function is the objective to be minimized. In our context, the objective function is . We make the following assumptions: (a) is continuously differentiable and is convex for some constant , (b) is coercive in the sense that , and (c) . Note that inherits coerciveness from . Assumption (b) is met in several different scenarios, for example, if at least one of the is bounded or if itself is coercive. When is convex, is convex for any . Consequently, assumption (c) holds for any positive . If is non-convex, but the smallest eigenvalue of the Hessian is bounded below by , then one can take . As a rule, it can be challenging to identify in advance, and consequently in practice we would not know how large to choose to ensure the conditions for convergence when is unknown. Nonetheless, can be explicitly determined in many useful cases. In the Appendix, we derive for the classic Tukey biweight of robust estimation.
The cluster points of the MM iterates for solving subproblem (3) are stationary points of under assumptions (a) through (c) above. If the number of stationary points is finite, then the MM iterates converge. Finally, if has a unique stationary point, then the MM iterates converge to that stationary point, which globally minimizes .