DeepAI
Log In Sign Up

Minimizing a Sum of Clipped Convex Functions

10/27/2019
by   Shane Barratt, et al.
8

We consider the problem of minimizing a sum of clipped convex functions; applications include clipped empirical risk minimization and clipped control. While the problem of minimizing the sum of clipped convex functions is NP-hard, we present some heuristics for approximately solving instances of these problems. These heuristics can be used to find good, if not global, solutions and appear to work well in practice. We also describe an alternative formulation, based on the perspective transformation, which makes the problem amenable to mixed-integer convex programming and yields computationally tractable lower bounds. We illustrate one of our heuristic methods by applying it to various examples and use the perspective transformation to certify that the solutions are relatively close to the global optimum. This paper is accompanied by an open-source implementation.

READ FULL TEXT VIEW PDF
05/28/2012

A Mixed Integer Programming Model Formulation for Solving the Lot-Sizing Problem

This paper addresses a mixed integer programming (MIP) formulation for t...
03/16/2020

Scheduling Lower Bounds via AND Subset Sum

Given N instances (X_1,t_1),...,(X_N,t_N) of Subset Sum, the AND Subset ...
02/22/2016

Convexification of Learning from Constraints

Regularized empirical risk minimization with constrained labels (in cont...
10/03/2019

Best-first Search Algorithm for Non-convex Sparse Minimization

Non-convex sparse minimization (NSM), or ℓ_0-constrained minimization of...
11/06/2017

Applying Convex Integer Programming: Sum Multicoloring and Bounded Neighborhood Diversity

In the past 30 years, results regarding special classes of integer linea...
03/07/2018

A Bayesian framework for molecular strain identification from mixed diagnostic samples

We provide a mathematical formulation and develop a computational framew...

1 Introduction

Suppose is a convex function, and . We refer to the function as a clipped convex function. In this paper we consider the problem of minimizing a sum of clipped convex functions,

(1)

with variable , where and for are closed proper convex functions, and for . We use infinite values of to encode constraints on , i.e., to constrain for a closed convex set we let for all . When , the value of the th term in the sum is clipped to , which limits how large each term in the objective can be. Many practical problems can be formulated as instances of (1); we describe a few in §2.

NP-hardness.

In general, problem (1) is nonconvex and as a result can be very difficult to solve. Indeed, (1) is NP-hard. We show this by giving a reduction of the subset sum problem to an instance of (1).

The subset sum problem involves determining whether or not there exists a subset of a given set of integers that sum to zero. The optimal value of the problem

which has the form (1), is zero if and only if , at least one of , and ; in other words, the set sums to zero. Since the subset sum problem can be reduced to an instance of (1), we conclude that in general our problem is at least as hard as difficult problems like the subset sum problem.

Global solution.

There is a simple (exhaustive) method to solve (1) globally: for each subset of , we solve the convex problem

(2)

with variable . The solution to (2) with the lowest optimal value is the solution to (1). This general method is not practical unless is quite small, since it requires the solution of convex optimization problems.

In some specific instances of problem (1), we can cut down the search space if we know that a specific choice of implies

which means that the optimal value of (2) is . In this case, we do not have to solve problem (2) for this choice of , as we know it will be infeasible. One simple example where this happens is when the -sublevel sets of are pairwise disjoint, which implies that we only have to solve convex problems (as opposed to ) to find the global solution. This idea is used in [11] to guide their proposed search algorithm.

Related work.

The general problem of minimizing a sum of clipped convex functions was recently considered in [11]. In their paper, they also show that the problem is NP-hard via a reduction to 3-SAT and give a global solution method in a few special cases whenever is small. They also provide a heuristic method based on cyclic coordinate descent, leveraging the fact that one-dimensional problems are easy to solve.

The idea of using clipped convex functions has appeared in multiple application areas, the most prominent being statistics. For example, the sum of clipped absolute values (often referred to as the capped -norm) has been used as a sparsity-inducing regularizer [25, 26, 13]. In particular, [25, 13] make use of the fact that problem (1) can be written as a difference-of-convex (DC) problem and can be approximately minimized via the convex-concave procedure [10] (see Appendix A). The clipped square function (also known as the skipped-mean loss) was also used in [21]

to estimate view relations, and in 

[14]

to perform robust image restoration. Similar approaches have been taken for clipped loss functions, where they have been used for robust feature selection 

[9], regression [23, 17], classification [19, 16, 22]

, and robust principal component analysis 

[18].

Summary.

We begin by presenting some applications of minimizing a sum of clipped convex functions in §2 to empirical risk minimization and control. We then provide some simple heuristics for approximately solving (1) in §3, which we have found to work well in practice. In §4, we describe a method for converting (1) into a mixed-integer convex program, which is amenable to solvers for mixed-integer convex programs. Finally, we describe an open-source Python implementation of the ideas described in this paper in §5 and apply our implementation to a few illustrative examples in §6.

2 Applications

In this section we describe some possible applications of minimizing a sum of clipped convex functions.

2.1 Clipped empirical risk minimization

Suppose we have data

Here is the

th feature vector,

is its corresponding output (or label), and is the output space.

We find parameters of a linear model given the data by solving the empirical risk minimization (ERM) problem

(3)

with variable , where is the loss function, and is the regularization function. Here the objective is composed of two parts: the loss function, which measures the accuracy of the predictions, and the regularization function, which measures the complexity of . We assume that is convex in its first argument and that is convex, so the problem (3) is a convex optimization problem.

For a given , our prediction of is

where is optimal for (3

). For example, in linear regression,

, , and

; in logistic regression,

, , and , where is equal to if and otherwise.

While ERM often works well in practice, it can perform poorly when there are outliers in the data. One way of fixing this is to clip the loss for each data point to a value

, leading to the clipped ERM problem,

(4)

After solving (or approximately solving) the clipped problem, we can label data points where as outliers. The clipped ERM problem is an instance of what is referred to in statistics as a redescending M-estimator [8, §4.8], since the derivative of the clipped loss goes to as the magnitude of its input goes to infinity. In this terminology, the clip value is referred to as the minimum rejection point.

In §6.1, we show an example where the normal empirical risk minimization problem fails, while its clipped variant has good performance.

2.2 Clipped control

Suppose we have a linear system with dynamics given by

where is the state of the system and denotes the input to the system, at time period . The dynamics matrix and the input matrix are given.

We are given stage cost functions , and an initial state . The standard optimal control problem is

where, at time , is the convex set of allowable states and is the convex set of allowable inputs. The variables in this problem are the states and inputs, and . If the stage cost function are convex, the optimal control problem is a convex optimization problem.

We define a clipped optimal control problem as an optimal control problem in which the stage costs can be expressed as sums of clipped convex functions, i.e.,

where, for all and , the functions are convex and . This gives another instance of our general problem (1).

A simple but practical example of a clipped control problem is described in §6.3. The problem is to design a lane change trajectory for a vehicle; the stage cost is small when the vehicle is centered in either lane, which we express as a sum of two clipped convex functions.

3 Heuristic methods

There are many methods for approximately solving (1). In this section we describe a few heuristic methods that we have observed to work well in practice.

Bi-convex formulation.

Throughout this section, we will make use of a simple reformulation of (1) as the bi-convex problem

(5)

with variables and . (We note that this reformulation was also pointed out in [23, §3].) The equivalence follows immediately from the fact that

Nonlinear programming.

When are all smooth functions and is representable as the sublevel set of a smooth function, it is possible to use general nonlinear solvers to (approximately) solve (5).

Alternating minimization.

Another possibility is to perform alternating minimization on (5), since each respective minimization is a convex optimization problem. In alternating minimization, at iteration , we solve (5) while fixing , resulting in . We then solve (5) while fixing , resulting in . It can be shown that

(6)

is a solution for minimization over with fixed .

Inexact alternating minimization.

Although alternating minimization often works well, we have found that inexact minimization over works better in practice. Instead of fully minimizing over , we instead compute the gradient of the objective with respect to ,

We then perform a signed projected gradient step on with a fixed step size (we have found works well in practice, though a range of values all appear to work equally as well). This results in the update

where is applied elementwise to , and denotes the projection onto the unit box, given by

The final algorithm is described below.

  • Algorithm 3.1  Inexact alternating minimization. given initial , step size , and tolerance . for  1. Minimize over . Set to the solution of the problem  2. Compute the gradient. Set .  2. Update . Set .  3. Check stopping criterion. Terminate if . end for

Algorithm 3 is a descent algorithm in the sense that the objective function of (5) decreases after every iteration. It is also guaranteed to terminate in a finite amount of time, since there is a finite number of possible values of . We also note that alternating minimization can be thought of as a special case of algorithm 3 where . In practice, we have found that algorithm 3 often finds the global optimum in simple problems and appears to work well on more complicated cases. We use algorithm 3 in our generic cvxpy implementation (see §5).

4 Perspective formulation

In this section we describe the perspective formulation of (1). The perspective formulation is a mixed-integer convex program (MICP), for which specialized solvers with reasonable practical performance exist. The perspective formulation can also be used to compute a lower bound on the original objective by relaxing the integral constraints, as in [12], as well to obtain good initializations for any of the procedures described in §3.

Perspective.

Following [15, §8], we define the perspective (or recession) of the closed convex function with as111If , replace with for any . See [15, Thm. 8.3] for more details.

(7)

for . We will use the fact that the resulting function is convex [3, §3.2.6].

Superlinearity assumption.

If is superlinear, i.e., if for all , we have

(8)

then

(9)

since the limit in (7) is equal to the limit in (8) unless .

There are many convex functions that satisfy this superlinearity property. Some examples are the sum of squares function and the indicator function of a compact convex set. Since we will make heavy use of property (9) in this section, we will assume that is superlinear for the remainder of this section. If is not superlinear, then it can be made superlinear by adding, e.g., a small positive multiple of the sum of squares function.

Conic representation of the perspective.

We note that representing the epigraph of the perspective of a function is often simple if the function has a conic representation [6]. More specifically, if has a conic representation

for some closed convex cone , then the perspective of has a conic representation given by

This fact allows us to use a conic representation of the perspective and avoid issues of non-differentiability and division-by-zero that we might encounter with direct numerical implementations of the perspective [12, §2].

Perspective formulation.

We define the perspective formulation of (1) as the following MICP:

(10)

with variables for and . Any MICP solver that can handle the functions for can be used to solve (10).

Proof of equivalence.

To show that (10) is equivalent to the original problem (1), first take that are feasible for (10). Since is Boolean, for each we have or . Since must be finite (as this point is feasible), then implies that (due to (9)). Similarly, when we must have . Therefore the th term in the sum becomes

Summing over the index yields that problem (10) is equivalent to

(11)

Partially minimizing (11) over , we find that is a feasible point for (1) with the same objective value.

Now take feasible for (1). Let

and . Then is feasible for (10) and has the same objective value, and the problems are equivalent.

Lower bound via relaxation.

Since the perspective formulation is equivalent to the original problem, relaxing the Boolean constraint in (10) and solving the resulting convex optimization problem

(12)

with variables , , and , yields a lower bound on the objective value of (1). That is, given any approximate solution of (1) with objective value , the optimal value of (12) yields a certificate guaranteeing that the approximate solution is suboptimal by at most . Additionally, a solution of the relaxed problem can be used as an initial point for any of the heuristic methods described in §3.

Efficiently solving the relaxed problem.

We note that (12) has times as many variables as the original problem, so it is worth considering faster solution methods. To do so, we can convert the problem to consensus form [2, §7.1]; i.e., we introduce additional variables for , and constrain , resulting in the equivalent problem

(13)

Since the objective is separable in over , there exist many efficient distributed algorithms for solving this problem, e.g., the alternating direction method of multipliers (ADMM) [2, 5, 4].

5 Implementation

Our Python package sccf approximately solves generic problems of the form (1) provided all can be represented as valid cvxpy expressions and constraints. It is available at:

https://www.github.com/cvxgrp/sccf.

We provide a method sccf.minimum, which can be applied to a cvxpy Expression and a scalar to create a sccf.MinExpression. The user then forms an objective as a sum of sccf.MinExpressions, passes this objective and (possibly) constraints to a sccf.Problem object, and then calls the solve method, which implements algorithm 3. We take advantage of the fact that the only parameter changing between problems is by caching the canonicalization procedure [1]. Here is an example of using sccf to solve a clipped least squares problem:

import cvxpy as cp
import sccf

A, b = get_data(m, n)

x = cp.Variable(n)
objective = 0.0
for i in range(m):
    objective += sccf.minimum(cp.square(A[i]@x-b[i]), 1.0)
objective += 0.01 * cp.sum_squares(x)

prob = sccf.Problem(objective)
prob.solve()

6 Examples

All experiments were conducted on a single core of an Intel i7-8700K CPU clocked at 3.7 GHz.

6.1 Clipped regression

In this example we compare clipped regression (§2.1) with standard linear regression and Huber regression [7] (a well known technique for robust regression) on a one-dimensional dataset with outliers. We generated data by sampling 20 data points according to

We introduced outliers in our data by flipping the sign of for 5 random data points.

The problems all have the form

(14)

where is a penalty function. In clipped regression, . In linear regression, . In Huber regression,

Figure 1: Clipped regression, linear regression, and Huber regression on a one-dimensional dataset with outliers. The outliers affect the linear regression and Huber regression models, while the clipped regression model appears to be minimally affected.

Let be the clipped regression model; we deem points where as outliers and the remaining points as inliers. In figure 1 we visualize the data points and the resulting models along with the outliers/inliers identified by the clipped regression model. In this figure, the clipped regression model clearly outperforms the linear and Huber regression models since it is able to fully ignore the outliers. Algorithm 3 terminated in 0.13 seconds and took 8 iterations on this instance.

Lower bound.

The relaxed version of the perspective formulation (12) can be used to efficiently find a lower bound on the objective value for the clipped version of (14). The objective value of (14) for clipped regression was 1.147, while the lower bound we calculated was 0.533, meaning our approximate solution is suboptimal by at most 0.614.

In figure 2 we plot the clipped objective (14) for various values of ; note that the function is highly nonconvex and that is the (global) solution. We also plot the objective of the perspective relaxation as a function of , found by partially minimizing (12) over and ; note that the function is convex and a surprisingly good approximation of the true convex envelope. We note that the minimum of the perspective relaxation and the true minimum are surprisingly close, leading us to believe that the solution to the perspective relaxation could be a good initialization for heuristic methods.

Figure 2: The clipped regression loss and its perspective relaxation.

6.2 Clipped logistic regression

In this example we apply clipped logistic regression (§2.1) to a dataset with outliers. We generated data by sampling 1000 data points

from a mixture of two Gaussian distributions in

. We randomly partitioned the data into 100 training data points and 900 test data points and introduced outliers by flipping the sign of for 20 random training data points.

We (approximately) solved the clipped logistic regression problem

with variables and , for various values of . We also solved the problem for , i.e., the standard logistic regression problem. Over the values we tried, on average, algorithm 3 took 6.37 seconds and terminated in 9.64 iterations.

Figure 3: Test accuracy of clipped logistic regression (solid), test accuracy of standard logistic regression (gray), and fraction of outliers (dotted dashed) for varying clip values . Note that the fraction of detected outliers goes down as goes up. Between roughly and , the test accuracy of clipped logistic regression is higher than standard logistic regression. Clipped logistic regression converges to standard logistic regression as .
Figure 4: A plot of throughout the course of algorithm 3 for the clipped logistic regression example. Note that at some of the iterations (e.g., , , or ), the gradient of the loss with respect to a certain changes sign, causing to be updated in the opposite direction.

Figure 3 displays the test loss and fraction of outliers over the range of values of we approximately minimized. Figure 4 shows the trajectory of the entries of during each step of the execution of algorithm 3 for the with the highest test accuracy, while figure 4 plots the histogram of the logistic loss for each of the available data points for this same .

Figure 5: Left: histogram of log logistic loss for each data point in standard logistic regression; right: histogram of log logistic loss for each data point in clipped logistic regression. Note that standard logistic regression attempts to make the loss small for all data points, while its clipped counterpart allows the loss to be high for some of the data points.

6.3 Lane changing

In this example, we consider a control problem where a vehicle traveling down a road at a fixed speed must avoid obstacles, stay in one of two lanes, and provide a comfortable ride. We let denote the lateral position of the vehicle at time ( is the time horizon).

The obstacle avoidance constraints are given as vectors that represent lower and upper bounds on at time .

We can split the objective into the sum of two functions described below.

  • Lane cost. Suppose the two lanes are centered at and . The lane cost is given by

    The lane cost incentivizes the vehicle to be in the center of one of the two lanes. The lane cost is evidently a sum of clipped convex functions.

  • Comfort cost. The comfort cost is given by

    where is the difference operator and are weights to be chosen. The comfort cost is a weighted sum of the squared lateral velocity, acceleration, and jerk.

To find the optimal lateral trajectory we solve the problem

(15)

where are given starting and ending points of the trajectory.

Figure 6: Trajectory of a vehicle looking to avoid obstacles (represented by boxes) while optimizing for comfort and lane position.

Numerical example.

We use , , , , , and . In figure 6 we show the trajectory resulting from an approximate solution to (15) with three obstacles. For this example, algorithm 3 terminated in 1.2 seconds and took 4 iterations. We are able to find a comfortable trajectory that avoid the obstacles and spends as little time as possible in between the lanes.

Lower bound.

Using the relaxed version of the perspective formulation (12), we can compute a lower bound on the objective value of the clipped control problem (15). We found a lower bound value of around 103.55, while the approximate solution we found had an objective value of 119.07, indicating that our approximate solution is no more than 15% suboptimal.

Acknowledgments

S. Barratt is supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518.

References

  • [1] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and Z. Kolter (2019) Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, Cited by: §5.
  • [2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers.

    Foundations and Trends® in Machine Learning

    3 (1), pp. 1–122.
    Cited by: §4.
  • [3] S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge University Press. Cited by: §4.
  • [4] D. Gabay and B. Mercier (1976) A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications 2 (1), pp. 17–40. Cited by: §4.
  • [5] R. Glowinski and A. Marroco (1975) Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. ESAIM: Mathematical Modelling and Numerical Analysis - Modélisation Mathématique et Analyse Numérique 9 (R2), pp. 41–76. Cited by: §4.
  • [6] M. Grant and S. Boyd (2008) Graph implementations for nonsmooth convex programs. In Recent Advances in Learning and Control, pp. 95–110. Cited by: §4.
  • [7] P. Huber (1973) Robust regression: asymptotics, conjectures and monte carlo. The Annals of Statistics 1 (5), pp. 799–821. Cited by: §6.1.
  • [8] P. Huber and E. Ronchetti (2009) Robust statistics. John Wiley & Sons. Cited by: §2.1.
  • [9] G. Lan, C. Hou, and D. Yi (2016) Robust feature selection via simultaneous capped ℓ2-norm and ℓ2, 1-norm minimization. In IEEE Intl. Conf. on Big Data Analysis (ICBDA), pp. 1–5. Cited by: §1.
  • [10] T. Lipp and S. Boyd (2016) Variations and extension of the convex–concave procedure. Optimization and Engineering 17 (2), pp. 263–287. Cited by: §1.
  • [11] T. Liu and H. Jiang (2019) Minimizing sum of truncated convex functions and its applications. Journal of Computational and Graphical Statistics 28 (1), pp. 1–10. Cited by: §1, §1.
  • [12] N. Moehle and S. Boyd (2015) A perspective–based convex relaxation for switched-affine optimal control. Systems & Control Letters 86, pp. 34–40. Cited by: §4, §4.
  • [13] C. Ong and L. An (2013)

    Learning sparse classifiers with difference of convex functions algorithms

    .
    Optimization Methods and Software 28 (4), pp. 830–854. Cited by: §1.
  • [14] J. Portilla, A. Tristan-Vega, and I. Selesnick (2015) Efficient and robust image restoration using multiple-feature l2-relaxed sparse analysis priors. IEEE Transactions on Image Processing 24 (12), pp. 5046–5059. Cited by: §1.
  • [15] T. Rockafellar (1970) Convex analysis. Princeton University Press. Cited by: §4, footnote 1.
  • [16] A. Safari (2014) An e–E–insensitive support vector regression machine. Computational Statistics 29 (6), pp. 1447–1468. Cited by: §1.
  • [17] Y. She and A. Owen (2011) Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106 (494), pp. 626–639. Cited by: §1.
  • [18] Q. Sun, S. Xiang, and J. Ye (2013) Robust principal component analysis via capped norms. In Proc. Intl. Conf. on Knowledge Discovery and Data Mining, pp. 311–319. Cited by: §1.
  • [19] S. Suzumura, K. Ogawa, M. Sugiyama, and I. Takeuchi (2014) Outlier path: a homotopy algorithm for robust SVM. In Intl. Conf. on Machine Learning, pp. 1098–1106. Cited by: §1.
  • [20] P. Tao and L. An (1997) Convex analysis approach to DC programming: theory, algorithms and applications. Acta Mathematica Vietnamica 22 (1), pp. 289–355. Cited by: Appendix A.
  • [21] P. Torr and A. Zisserman (1998) Robust computation and parametrization of multiple view relations. In

    Intl. Conf. on Computer Vision

    ,
    pp. 727–732. Cited by: §1.
  • [22] G. Xu, B. Hu, and J. Principe (2016) Robust C-loss kernel classifiers.

    IEEE Transactions on Neural Networks and Learning Systems

    29 (3), pp. 510–522.
    Cited by: §1.
  • [23] Y. Yu, M. Yang, L. Xu, M. White, and D. Schuurmans (2010) Relaxed clipping: a global training method for robust regression and classification. In Advances in Neural Information Processing Systems, pp. 2532–2540. Cited by: §1, §3.
  • [24] A. Yuille and A. Rangarajan (2003) The concave–convex procedure. Neural Computation 15 (4), pp. 915–936. Cited by: Appendix A.
  • [25] T. Zhang (2009) Multi-stage convex relaxation for learning with sparse regularization. In Advances in Neural Information Processing Systems, pp. 1929–1936. Cited by: §1.
  • [26] T. Zhang (2010) Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research 11 (Mar), pp. 1081–1107. Cited by: §1.

Appendix A Difference of convex formulation

In this section we make the observation that (1) can be expressed as a difference of convex (DC) programming problem.

Let . This (convex) function measures how far is above . We can express the th term in the sum as

since when , we have , and when , we have . Since and are convex, (1) can be expressed as the DC programming problem

(16)

with variable . We can apply then well-known algorithms like the convex-concave procedure [20, 24] to (approximately) solve (16).

Appendix B Minimal convex extension

If we replace each with any function such that when , we get an equivalent problem. One such is the minimal convex extension of , which is given by

In general, the minimal convex extension of a function is often hard to compute, but it can be represented analytically in some (important) special cases. For example, if , the minimal convex extension is the Huber penalty function, or

Using the minimal convex extension leads to an equivalent problem, but, depending on the algorithm, replacing with can lead to better numerical performance.