1 Introduction
Alternating optimization (AO) is a commonly used technique for finding the extremum of a multivariate function, , where . In this approach, one breaks up the (multidimensional) input variable into a few blocks, say, , and successively optimizes the objective function over each block of variables while holding all other blocks fixed. That is, one solves
(1) 
successively over until convergence is achieved. This is an especially natural approach when each individual optimization problem (1) over is relatively easy to solve. Two wellknown examples in statistics are: matrix factorization and penalized regression, but there are many others.
1.1 Matrix factorization
The Netflix contest drew much attention to the matrix factorization problem (Koren et al. 2009; Feuerverger et al. 2012; Zhu 2014). Given a useritem rating matrix , whose element is the rating of item by user , the goal is to find lowrank matrices and , such that
The vector
can be viewed as the coordinate of user in a dimensional map and the vector , the coordinate of item . With these coordinates, it is then possible to recommend item to user if and are closely aligned. Since we don’t know every user’s preferences on every item, many entries of are missing. Letbe the set of observed ratings. In order to estimate these user and itemcoordinates, we can solve the following optimization problem:
(2) 
where the bracketed terms being multiplied by are penalties on the parameters being estimated, introduced to avoid overfitting, because and are typically quite large relative to the number of known ratings (or the size of the set ). Here, we follow the work of Nguyen and Zhu (2013) and use an extra factor to balance the penalties imposed on the two matrices, and .
1.2 Penalized regression
During the last decade, penalized regression techniques have attracted much attention in the statistics literature (Tibshirani 1996; Fan and Li 2001; Zhang 2010). Suppose that are all properly standardized to have mean zero (,
) and variance one (
, ). The prototypical problem can be expressed as follows:(3) 
where is a penalty function. Many different penalty functions have been proposed. A widely used class of penalty functions is
The case of is known as the ridge penalty (Hoerl and Kennard 1970), and that of is known as the LASSO (Tibshirani 1996). In both of these cases, the function is convex. In recent years, nonconvex penalty functions have started to garner the attention of the research community, e.g., the SCAD (Fan and Li 2001), and the MC+ (Zhang 2010):
(4) 
We will focus on the MC+ in this paper; hence, details of the SCAD are omitted, and we refer the readers to the original papers (Fan and Li 2001; Zhang 2010) for explanations of for why these particular types of penalty functions are interesting.
Currently, the preferred algorithm for fitting these penalized regression models is the coordinate descent algorithm (Friedman et al. 2010). One can view the coordinate descent algorithm as the “ultimate” AO strategy. For given , the coordinate descent algorithm solves
while fixing all (), successively over until convergence. In fact, sometimes the general AO algorithm, with which we started this article, is dubbed the “blockwise coordinate descent” algorithm.
1.3 Saddle points
For the ridge penalty and the LASSO, the penalty function is convex, so the coordinate descent algorithm behaves “well”. But for nonconvex penalty functions such as the SCAD and the MC+, there is no guarantee that the coordinate descent algorithm can reach the global solution of (3). Exactly the same point can be made about the AO algorithm for solving (2).
In fact, for these nonconvex problems, not only can the AO strategy get stuck at inferior local solutions, but it also can be trapped at saddle points. Dauphin et al. (2014) found that getting stuck at a saddle point can be a far more serious problem than getting trapped at a local minimum. Tayal et al. (2014) proposed an intriguing method to improve AO by facilitating AO algorithms to escape saddle points. They introduced the concept of a shared “perspective variable” (more details in Section 3.1), but lacked intuition for why this was a good idea.
1.4 Our contribution
In this article, we begin by interpreting the approach of Tayal et al. (2014) geometrically — in particular, sharing a “perspective variable” results in an expanded search space at each step. However, searching over a slightly larger space at each step comes with a higher computational cost, which should be avoided if possible. Thus, our proposal is as follows: first, run the faster AO iterations until convergence; then, try to escape being trapped at an undesirable location by searching over a different space. Of critical importance is the choice of the search space. To this end, we introduce the important idea of defining search spaces that depend upon the particular data observed, as opposed to traditional techniques, including AO and those in Tayal et al. (2014), that fix the search spaces a priori. We apply these ideas to improve the AO algorithm for solving (2) as well as the coordinate descent algorithm for solving (3), with a focus on the MC+ penalty. However, we stress that this is a general algorithm that can be applied to other AO problems as well.
1.5 Notation
In what follows, we will use the notation to denote all other components except those in block .
2 Motivation
It is convenient for us to motivate our key ideas with a very simple example.
2.1 A simple example
Consider the function,
Suppose that we attempt to minimize with AO, and that, at iteration , we have reached the point . While fixing , is minimized at . Similarly, is the optimal point when is fixed at 0. Thus, the AO algorithm is stuck at . However, it is easy to see that is actually unbounded below, and that is a saddle point.
At , the search space defined by AO is
In this case, we can see that being restricted to this particular search space is the very reason why we are trapped at the saddle point. Therefore, if we could use a slightly different search space, we might be able to escape this saddle point. For example, Figure 1 shows that using the search space,
would suffice.
2.2 Main idea
The main lesson from the simple example above is that the restricted search space defined by the AO strategy, , may cause the search to be trapped at undesirable locations such as saddle points, and that we may escape such traps by conducting the search in a slightly different space, . This observation naturally leads us to propose the following strategy.
First, we run standard AO — that is, search in , , …, — until convergence. Then, starting from the AO solution, we continue searching in a different sequence of spaces — , , …, — until convergence. If we see a sufficient amount of improvement in the objective function — an indication that the escaping strategy “worked”, then we repeat the entire process using the improved result as the new starting point; otherwise, the algorithm terminates (see Algorithm 1).
Needless to say, the key to the strategy we outlined above lies in the definition of the escaping sequence, . We discuss this next, in Section 3.
3 Escaping strategies
The key insight from Section 2.1 is that switching to a different search space than the one defined by AO may allow us to escape being trapped at undesirable locations. In this section, we describe how these different search spaces can be specified.
3.1 Scaling
The proposal by Tayal et al. (2014) of sharing a “perspective variable”, which we alluded to earlier in Section 1.3, essentially amounts to the following. At each alternating step, rather than solving (1), they proposed that we solve
(5) 
instead, where is the socalled “perspective variable”. That is, we no longer just search for the optimal while keeping fixed. When searching for the optimal component , we are free to scale all other components as well. Suppose the optimal scaling variable coming out of solving (5) is . The component is then adjusted accordingly, i.e.,
before the next alternating step (for optimizing over and scaling ) begins.
When so described, it is somewhat mysterious why it helps to scale when solving for . However, when viewed in terms of their respective search spaces, we can interpret this proposal as one particular way to define the search space .
As illustrated in Figure 2, at iteration , the search space defined by AO is
whereas, if we are free to scale at the same time, the search space becomes
Clearly, is larger than (but still much smaller than the entire space ). Thus, one way to understand the idea of freely scaling while optimizing over is that it allows us to conduct our search in a slightly larger subspace, thereby improving our chances of finding a better solution.
3.2 Restricted joint search
This particular point of view immediately suggests that there are many other ways to expand, or simply alter, the search space. For example, once the AO steps have converged, we can try to escape by solving a restricted joint optimization problem such as
(6) 
where
are some prechosen directions (more about these later), and
The kind of search spaces generated by (6) can be described as
see Figure 3 for an illustration. The restricted joint search problem (6) can be viewed as a compromise between using a different search space — i.e., rather than — and avoiding a fullscale, simultaneous search over the entire space .
4 Improved AO for matrix factorization
In this section, we apply our escaping strategies to the matrix factorization problem described in Section 1.1. In particular, after solving the optimization problem (2) with AO, we switch to search over a different space so as to escape saddle points, and/or inferior local solutions.
4.1 Scaling
As we described in Section 3.1, introducing a shared variable allows us to search over a slightly expanded space. For fixed , this strategy minimizes
(7) 
over and simultaneously, which we solve using a quasiNewton algorithm with BFGS updates (see, e.g., Gill et al. 1981). Analogously, for fixed , we also numerically optimize over and a scaling variable for .
4.2 Restricted joint search
As the loss function for matrix factorization is generally quite highdimensional, we expect that only increasing the dimensionality of the search space by one can have only a limited effect. In addition, expanding the search space by introducing a scaling variable also limits the types of subspaces we can search over. These are the reasons why we proposed the restricted joint search problem (
6) in Section 3.2.For the matrix factorization problem, this proposal amounts to constructing a family of search vectors and corresponding to each user and itemvector, respectively, and searching over all of these directions simultaneously. Mathematically, this corresponds to solving
(8) 
over and . To make this approach computationally feasible for larger problems, we generally require that for all but a relatively small number of indices . To do so, we sample each
with probability
and each with probability , for some . That is, on average, we randomly choose uservectors and itemvectors to participate in the restricted joint optimization (8).Having established a framework for our desired search space, the most important remaining question is: how to choose our search vectors ? This requires a notion of what an informative subspace is to search over. Below, we describe two different approaches.
4.2.1 Random choices of and
Trying to determine what set of vectors will produce the largest decrease in the loss function (8) is a challenging task. As such, it is a good idea to establish a simple baseline against which to measure more sophisticated spaces. Having such a baseline space will also serve to illustrate the power of our method in its simplest form.
For our baseline, we simply choose our search vectors at random. Specifically, for each chosen index , we sample from the
variate standard normal distribution, and likewise for each chosen index
. This procedure incorporates no information about the optimization problem at hand, nor the data. But, as we shall report below (Section 6.1), even these random choices of directions can lead to better solutions. Next, we provide a more sophisticated subspace that uses such information to produce better results.4.2.2 Greedy choices of and
In general, the optimal search space for a given loss function would depend upon specific properties of the function itself. However, none of our previous approaches (Section 4.1 and Section 4.2.1) explicitly took such information into account. We now describe an approach that does.
Suppose that, for , we are searching for the optimal step size in a given direction , while keeping everything else fixed. The objective function for this particular search operation is
(9) 
where . Differentiating (9) with respect to and setting it equal to zero, we can solve for the optimal as a function of :
By plugging in the optimal step size into (9), the function becomes a function of , . We can now solve for the optimal search direction , using standard numerical optimization techniques — again, we use quasiNewton with BFGS updates.
In doing so, we have solved for a search direction such that letting take an optimal step in its direction will produce the maximal decrease in the overall loss function. We construct our set of search vectors by repeating this process for each chosen , and the set of search vectors is obtained in the same fashion.
5 Improved AO for MC+ regression
In this section, we apply our escaping strategies to the penalized regression problem described in Section 1.2. We focus on the MC+ penalty function, but our strategies can be applied to other nonconvex penalty functions as well. We also investigate the application of our method to fitting an entire regularization surface, as introduced by Mazumder et al. (2011). Although the singularity of the MC+ penalty function at places certain limits on the types of subspaces that can be feasibly optimized over, applying our ideas in their simple forms still produces notable improvements.
5.1 Scaling
Again, the idea of using a shared variable (Section 3.1) applies. In this setting, this amounts to taking a number of expanded coordinate descent steps (after the standard coordinate descent steps steps have converged; see Algorithm 1), so that we simultaneously search over a single coefficient , as well as a scaling variable for the rest of the coefficient vector, . Mathematically, these expanded coordinate descent steps solve
(10) 
for .
We can actually solve for the optimal and explicitly in this case (see Appendix A). This is attractive because it allows us to avoid using numerical optimization for these steps, which would have been more difficult due to the nondifferentiability of the penalty function at zero. The technical details for these steps are provided in the Appendix.
5.2 Selective scaling
Using our general notation (Sections 1–3), coordinate descent corresponds to each “block” being onedimensional. This puts a certain limit on the kind of restricted joint search operations (Section 3.2) we can implement. In particular, for any given , the only available choice of is itself. However, we are still free to determine which .
As we pointed out in Section 4.2.2, tailoring the search space to the observed data can yield improved results. In the context of regression, it is useful to consider how changing the coefficient in front of could affect the coefficient in front of another variable, say . If these two variables are independent, then a change in would not result in a change to the optimal . However, if these two variables are highly correlated, then one would expect that a decrease in could lead to an increase or decrease in depending on whether their correlation is positive or negative, as some of the dependence previously accounted for by can be “taken over” by .
Thus, we implement a selective scaling strategy. While searching over , we only allow the scaling of if the correlation between and is above some threshold, instead of scaling all . Let denote the set of variables that are sufficiently correlated with , i.e.,
(11) 
for some prechosen . The selective scaling steps solve
(12) 
for . As mentioned above, we can compute the optimal and explicitly for this problem (see Appendix A).
5.3 Fitting entire regularization surfaces with multiple warm starts
Using the MC+ penalty (4), the optimal solution to (3) depends on two regularization parameters, and . For different values of , one can think of as tracing out an entire regularization surface. Mazumder et al. (2011) provided a nice algorithm, called SparseNet, for fitting the entire regularization surface.
Fitting an entire surface of solutions, rather than just a single solution, introduces an interesting set of challenges for our work. When fitting a single solution, we are only concerned with how to best find a good solution for a given pair of . However, SparseNet fits the entire surface of solutions sequentially, using each point on the surface, , as a warm start for fitting the “next” point. Thus, improving the solution at may not be desirable if the improved solution provides an inferior warm start for the next point, resulting in worse solutions further down the surface. Empirically, we have found this to be a common occurrence.
To remedy this problem, our strategy is to keep track of a few different solution surfaces:

— this is the “usual” surface obtained by SparseNet, i.e., each point on this surface is obtained by running the coordinate descent algorithm (an ultimate AO strategy), using the “previous” point on this surface (A) as a warm start;

— each point on this surface is obtained by first running the coordinate descent algorithm and then switching over to search in a different space, but each point also uses the “previous” point on this surface (B) as a warm start;

— like surface B above, each point on this surface also is obtained by first running the coordinate descent algorithm and then switching over to search in a different space, except that each point uses the “previous” point from the surface as a warm start.
At each point , we keep the better of or as our solution.
Remark
In the actual implementation, it is clear that we need not start from scratch in order to obtain the surface ; we can simply start with the surface , and apply our escaping strategies directly at each point. Conceptually, we think it is easier for the reader to grasp what we are doing if we describe three separate surfaces rather than two, but this does not mean we have to triple the amount of computation.
6 Experimental results
We now present some experimental results to demonstrate the effectiveness of our method. For matrix factorization, we use a realworld data set; for MC+ regression, we use a simulated data set.
6.1 Matrix factorization
To demonstrate our method for matrix factorization, we used a data set compiled by McAuley and Leskovec (2013), which consists of approximately 35.3 million reviews from www.amazon.com between 1995 and 2013. We took a dense subset of their data consisting of approximately 5.5 million reviews, such that all users in our subset have rated at least 55 items, and all items have been rated at least 24 times.
For our restricted joint optimization approach, we allowed only a small number () of user and itemvectors in each round to participate in the joint optimization (see Section 4.2). Empirically, we obtained reasonably good and comparable performance results with a wide range of , but results reported here are for .
We tested our method by randomly splitting the ratings into two halves, using one half as the training set , and the other half as the test set . All statistics were averaged over ten runs. As our metric, we used the mean absolute error (MAE),
(13) 
McAuley and Leskovec (2013) reported a mean squared error (MSE) of about on their full Amazon data set using the baseline matrix factorization method with . This would translate to about on the root mean squared error (RMSE) scale, which is more comparable with the MAE. Here, our baseline AO produced slightly better results (see Table 2) because we used a dense subset, so there is presumably more information to be learned about each user and item in our subset.
In order to produce fair comparisons between different methods, for given and we used crossvalidation to choose an optimal value of for each method. The optimal values are shown in Table 1, with the corresponding average MAEs shown in Table 2. As can be seen in Table 2, our approach produces meaningfully better models.
6.2 MC+ regression
To demonstrate our method for MC+ regression, we used a simulated data set from Mazumder et al. (2011) — more specifically, their model . The sample size is , with
predictors generated from the Gaussian distribution with mean zero and covariance matrix
, whose th entry is equal to . The response is generated as a linear function of only of the predictors plus a random noise; in particular,That is, for and otherwise. Mazumder et al. (2011) took with so that the signaltonoise ratio is .
Using in equation (11), we estimated on a grid consisting of different ’s and different ’s. The ’s were equally spaced on the logarithmic scale between and . The ’s were equally spaced on the logarithmic scale between , which is the smallest such that for all , and .
For each point on the grid we considered, we computed the percent decrease in the value of the objective function, i.e.,
where are the values of the objective function when the coordinate descent algorithm converged, and after our restricted joint search, respectively. Table 3 shows that, for about of points on the grid, our strategy made little difference ( no more than percentage points), indicating that the original coordinate descent algorithm already found relatively good solutions at those points. For the remaining of the points, however, our strategy found a better solution — searching in a slightly expanded space further reduced the value of the objective function by an average of . For the smaller half of ’s (more nonconvex objective functions), the average percent decrease was a little over ; for the larger half (less nonconvex objective functions), the average percent decrease was close to .
For each point on the grid, we also computed the percent decrease in the variableselection error, i.e.,
where are the errors of the coordinate descent solution and of our solution, respectively. The variableselection error was measured in terms of
(14) 
Table 4 shows that, overall, our strategy led to improved variableselection results as well — a reduction in error on average.
7 Conclusion
In this article, we have proposed a general framework for improving upon alternating optimization of nonconvex functions. The main idea is that, once standard AO has converged, we switch to conduct our search in a different subspace. We have provided general guidelines for how these different subspaces can be defined, as well as illustrated with two concrete statistical problems — namely, matrix factorization and regression with the MC+ penalty — how problemspecific information can (and should) be used to help us identify good and meaningful search subspaces. By carefully selecting a relevant space to search over, we can escape undesirable locations such as saddle points and produce notable improvements. In addition to serving as examples of our general idea, we think that these improved AO algorithms, for matrix factorization and for regression with the MC+ penalty, are meaningful contributions on their own.
Acknowledgments
This research is partially supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada and by the University of Waterloo.
Appendix A Explicit solution of problem (12)
This appendix gives details of how the problem (12) can be solved explicitly. This is done by identifying all points that satisfy the first order conditions, as well as those at which the objective function is not differentiable, and choosing from all these points the one that minimizes the objective function.
a.1 First order conditions
First, we define two (vector) constants,
(15) 
and rewrite the objective function (12) as
(16) 
where is the MC+ penalty function given by (4). The (vector) constants and are terms that do not depend on either or . The firstorder conditions are then given by
(17) 
and
(18) 
where is not differentiable at , and for ,
(19) 
a.2 Expression of for fixed
Recall that we assume for all (Section 1.2). For given , then, equation (17) implies that any solution can be expressed as
(20) 
where and (and hence and as well) are different constants depending on whether and whether is positive or negative. In particular,
and
Without knowing where is a priori, our strategy is to proceed with solving for (Section A.3 below) using different pairs, and discarding “solutions” that turn out to be inconsistent. For example, if a particular “solution” is obtained using a pair that assumes — i.e., and in (20) — but turns out to be outside this interval, such a “solution” is automatically discarded.
For , we proceed to solve for by setting . If multiple solutions exist that consistently satisfy the first order conditions, then the one that minimizes (16) is chosen.
a.3 Solving for
For any given pair — including , substituting (20) into equation (18) gives
(21) 
This is a single, piecewiselinear equation in a single variable , which can be solved explicitly. First, we check whether is a solution. Then, we consider the cases of and separately.
The case of
When , we have
Let be size of the set . For all , we use the notation to indicate that is th smallest member of the set in terms of its absolute value. That is,
We then define a partition as follows:
On each interval , we have not only , but also for all , which means for all . Thus, on a given interval , equation (18) becomes
(22) 
Notice that if . Since equation (22) is linear in , to check for its zeros, it suffices to evaluate the lefthand side at the endpoints of and determine if there is a change of sign. If there is, we can solve for as
(23) 
Notice that equation (22) may have solutions on multiple intervals, that is, for more than one . Each solution will lead to a different solution for — see equation (20). As we already pointed out in Section (A.2), some of these “solutions” may be inconsistent and discarded accordingly; if more than one consistent solutions remain, then the one that minimizes (16) is kept.
The case of
When , we can perform the exact same search, except that each interval is now the reflected about , and that the term in (23) is multiplied by as if .
References
 Dauphin et al. (2014) Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. arXiv:1406.2572.
 Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.
 Feuerverger et al. (2012) Feuerverger, A., He, Y., and Khatri, S. (2012). Statistical significance of the Netflix challenge. Statistical Science, 27(2), 202–231.
 Friedman et al. (2010) Friedman, J. H., Hastie, T. J., and Tibshirani, R. J. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistial Software, 33(1), 1–22.
 Gill et al. (1981) Gill, P. E., Murray, W., and Wright, M. H. (1981). Practical Optimization. Academic Press.
 Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.
 Koren et al. (2009) Koren, Y., Bell, R., and Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30–37.
 Mazumder et al. (2011) Mazumder, R., Friedman, J. H., and Hastie, T. J. (2011). SparseNet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495), 1125–1138.
 McAuley and Leskovec (2013) McAuley, J. and Leskovec, J. (2013). Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, pages 165–172, New York, NY, USA. ACM.
 Nguyen and Zhu (2013) Nguyen, J. and Zhu, M. (2013). Contentboosted matrix factorization techniques for recommender systems. Statistical Analysis and Data Mining, 6, 286–301.

Tayal et al. (2014)
Tayal, A., Coleman, T. F., and Li, Y. (2014).
Primal explicit max margin feature selection for nonlinear support vector machines.
Pattern Recognition, 47, 2153–2164.  Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58, 267–288.
 Zhang (2010) Zhang, C.H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 36(4), 1125–1128.
 Zhu (2014) Zhu, M. (2014). Making personalized recommendations in ecommerce. In J. F. Lawless, editor, Statistics in Action: A Canadian Outlook, pages 259–258. Chapman & Hall.
Comments
There are no comments yet.