 # Deep Inverse Optimization

Given a set of observations generated by an optimization process, the goal of inverse optimization is to determine likely parameters of that process. We cast inverse optimization as a form of deep learning. Our method, called deep inverse optimization, is to unroll an iterative optimization process and then use backpropagation to learn parameters that generate the observations. We demonstrate that by backpropagating through the interior point algorithm we can learn the coefficients determining the cost vector and the constraints, independently or jointly, for both non-parametric and parametric linear programs, starting from one or multiple observations. With this approach, inverse optimization can leverage concepts and algorithms from deep learning.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The potential for synergy between optimization and machine learning is well-recognized

, with recent examples including [8, 18, 26]. Our work uses machine learning for inverse optimization. Consider a parametric linear optimization problem, PLP:

 minimizex c(u,w)′x (1) subject to A(u,w)x≤b(u,w),

where and , and are all functions of features and weights . Let be an optimal solution to PLP. Given a set of observed optimal solutions, , for observed conditions , the goal of inverse optimization (IO) is to determine values of optimization process parameters

that generated the observed optimal solutions. Applications of IO range from medicine (e.g., imputing the importance of treatment sub-objectives from clinically-approved radiotherapy plans

) to energy (e.g., predicting the behaviour of price-responsive customers ).

Fundamentally, IO problems are learning problems: each is a feature vector and is its corresponding target; the goal is to learn model parameters

that minimize some loss function. In this paper, we cast inverse optimization as a form of deep learning. Our method, called

deep inverse optimization, is to unroll an iterative optimization process and then use backpropagation to learn model parameters that generate the observations/targets.

Figure 1 shows the actual result of applying our deep IO method to three inverse optimization learning tasks. The top panel illustrates the non-parametric, single-point variant of model (1) — the case when exactly one is given — a classical problem in IO (see [1, 12]). In Figure 1 (i), only needs to be learned: starting from an initial cost vector , our method finds which makes an optimal solution of the LP by minimizing . In Figure 1 (ii), starting from , and , our approach finds , and which make an optimal solution of the learned LP through minimizing .

Figure 1 (iii) shows learning for the parametric problem instance

 minimizex cos(w0+w1u)x1+sin(w0+w1u)x2 (2) subject to −x1≤0.2w0u, −x2≤−0.2w1u, w0x1+(1+13w1u)x2≤w0+0.1u.

Starting from with a loss (mean squared error) of 0.45, our method is able to find with a loss of zero, thereby making optimal solutions of (2) for values . Given newly observed values, in this example would predict correct decisions. In other words, the learned model generalizes well.

The contributions of this paper are as follows. We propose a general framework for inverse optimization based on deep learning. This framework is applicable to learning coefficients of the objective function and constraints, individually or jointly; minimizing a general loss function; learning from a single or multiple observations; and solving both non-parametric and parametric problems. As a proof of concept, we demonstrate that our method obtains effectively zero loss on many randomly generated linear programs for all three types of learning tasks shown in Figure 1, and always improves the loss significantly. Such a numerical study on randomly generated non-parameteric and parametric linear programs with multiple learnable parameters has not previously been published for any IO method in the literature. Finally, to the best of our knowledge, we are the first to use unrolling and backpropagation for constrained inverse optimization.

We explain how our approach differs from methods in inverse optimization and machine learning in Section 2. We present our deep IO framework in Section 3 and our experimental results in Section 4. Section 5 discusses both the generality and the limitations of our work, and Section 6 concludes the paper.

## 2 Related Work

The goal of our paper is to develop a general-purpose IO approach that is applicable to problems for which theoretical guarantees or efficient exact optimization approaches are difficult or impossible to develop. Naturally, such a general-purpose approach will not be the method of choice for all classes of IO problems. In particular, for non-parametric linear programs, closed-form solutions for learning the vector (Figure 1 (i)) and for learning the constraint coefficients have been derived by Chan et al. [12, 14] and Chan and Kaw , respectively. However, learning objective and constraint coefficients jointly (Figure 1 (ii)) has, to date, received little attention. To the best of our knowledge, this task has been investigated only by Troutt et al. [36, 37], who referred to it as linear system identification, using a maximum likelihood approach. However, their approach was limited to two dimensions  or required the coefficients to be non-negative .

In the parametric optimization setting, Keshavarz et al.  develop an optimization model that encodes KKT optimality conditions for imputing objective function coefficients of a convex optimization problem. Aswani et al.  focus on the same problem under the assumption of noisy measurements, developing a bilevel problem and two algorithms which are shown to maintain statistical consistency. Saez-Gallego and Morales  address the case of learning and jointly in a parametric setting where the vector is assumed to be an affine function of a regressor. The general case of learning the weights of a parametric linear optimization problem (1) where , and are functions of (Figure 1 (iii)) has not been addressed in the literature.

Recent work in machine learning [4, 5, 16] views inverse optimization through the lens of online learning, where new observations appear over time rather than as one batch. Our approach may be applicable in online settings, but we focus on generality in the batch setting and do not investigate real-time cases.

Methodologically, our unrolling strategy is similar to McLaurin et al. 

who directly optimize the hyperparameters of a neural network training procedure with gradient descent. Conceptually, the closest papers to our work are by Amos and Kolter

 and Donti, Amos and Kolter . However, these papers are written independently of the inverse optimization literature. Amos and Kolter  present the OptNet framework, which integrates a quadratic optimization layer in a deep neural network. The gradients for updating the coefficients of the optimization problem are derived through implicit differentiation. This approach involves taking matrix differentials of the KKT conditions for the optimization problem in question, while our strategy is based on allowing a deep learning framework to unroll an existing optimization procedure. Their method has efficiency advantages, while our unrolling approach is easily applicable, including to processes for which the KKT conditions may not hold or are difficult to implicitly differentiate. We include a more in-depth discussion in Section 5.

## 3 Deep Learning Framework for Inverse Optimization

The problems studied in inverse optimization are learning problems: given features and corresponding targets , the goal is to learn parameters of a forward optimization model that generate as its optimal solutions. A complementary view is that inverse optimization is a learning technique specialized to the case when the observed data is coming from an optimization process. Given this perspective on inverse optimization and motivated by the success of deep learning for a variety of learning tasks in recent years (see ), this paper develops a deep learning framework for inverse optimization problems.

Deep learning is a set techniques for training the parameters of a sequence of transformations (layers) chained together. The more intermediate layers, the ‘deeper’ the architecture. We refer the reader to the textbook by Goodfellow, Bengio and Courville  for additional details about deep learning. The features of the intermediate layers can be trained/learned through backpropagation, an automatic differentiation technique that computes the gradient of an output with respect to its input through the layers of a neural network, starting from the final layer all the way to the initial one. This method efficiently computes an update to the weights of the model 

. Importantly, current machine learning libraries such as PyTorch provide built-in backpropagation capabilities

 that allow for wider use of deep learning. Thus, our deep inverse optimization

framework iterates between solving the forward optimization problem using an iterative optimization algorithm and backpropagating through the steps (layers) of that algorithm to improve the estimates of learnable parameters (weights) of the forward process.

Our approach, shown in Algorithm 1, takes the pairs , , as input, and starts by initializing . For each , the forward optimization problem (FO) is solved with the current weights (line 5), and the loss between the resulting optimal solution and is computed (line 6). The gradient of the loss function with respect to is computed by backpropagation through the layers of the forward process. In line 9, line search is used to determine the step size, , for updating the weights: is reduced by half if infeasibility or unboundedness is encountered until a value is found that will lead to loss reduction or , in which case early algorithm termination is triggered. Finally, in line 10, the weights are updated using the average gradient, step size , and , a vector representing the component-wise learning rates for .

Importantly, our framework is applicable in the context of any differentiable, iterative forward optimization procedure. In principle, parameter gradients are automatically computable even with non-linear constraints or non-linear objectives, so long as they can be expressed through standard differentiable primitives. Our particular implementation uses the barrier interior point method (IPM) as described by Boyd and Vandenberghe , as our forward optimization solver. The IPM forward process is illustrated in Figure 2 (i): the central path taken by IPM is illustrated for the current and , which define both the current feasible region and the current . As shown in Figure 2 (ii), backpropagation starts from the computation of the loss function between a (near) optimal forward optimization solution and the target and proceeds backward through all the steps of IPM, i.e., to , the starting point of IPM, to the forward instance parameters and finally to compute . In practice, backpropagating all the way to may not be necessary for computing accurate gradients; see Section 5.

The framework requires setting three main hyperparameters: , the initial weight vector; max_steps, the total number of steps allotted to the training; and , the learning rates for the different components of . The number of additional hyperparameters depends on the forward optimization process.

## 4 Experimental Results

In this section, we demonstrate the application of our framework on randomly-generated LPs for the three types of problems shown in Figure 1: learning in the non-parametric case; learning , and together in the non-parametric case; and learning in the parametric case.

#### 4.0.1 Implementation

Our framework is implemented in Python, using PyTorch version 0.4.1 and its built-in backpropagation capabilities 

. All numerical operations are carried out with PyTorch tensors and standard PyTorch primitives, including the matrix inversion at the heart of the Newton step.

#### 4.0.2 Hyperparameters

We limit learning to in all experiments. Four additional hyperparameters are set in each experiment:

• , which controls the precision and termination of IPM;

• : the initial value of the barrier IPM sharpness parameter ;

• : the factor by which is increased along the IPM central path;

• : the vector of per-parameter learning rates, which in some experiments is broken down into and .

In all experiments, the hyperparameter is either a constant or decays exponentially from to during learning. The decay is a form of graduated optimization , and tends to help performance when using the MSE loss.

#### 4.0.3 Baseline LPs

To generate problem instances, we first create a set of baseline LPs with variables and constraints by sampling at least random points from , and then construct the convex hull via the scipy.spatial.convexhull package . We generate 50 LP instances for each of the following six problem sizes: and and , . Our experiments focus on inequality constraints. We observed that our method can work for equality constrained instances, but we did not systematically evaluate equality constraints and we leave that for future work.

### 4.1 Non-Parametric

We first demonstrate the performance of our method for learning only, and learning , and jointly, on the single-point variant of model (1), i.e., when a single optimal target is given, a classical problem in IO . We use two loss functions, absolute duality gap (ADG) and squared error (SE), defined as follows:

 ADG = c′lrn|xtru−xlrn|, (3) SE = ∥xtru−xlrn∥22, (4)

the first of which is a classical performance metric in IO  and the second is a standard metric in machine learning.

#### 4.1.1 Learning c only

To complete instance generation for this experiment, we randomly select one vertex of the convex hull to be for each of the 50 baseline LP instances and for each of the six combinations.

Initialization is done by sampling each parameter of from . We implement a randomized grid search by sampling 20 random combinations of the following three hyperparameter sets: , , and . As in other applications of deep learning, it is not clear which hyperparameters will work best for a particular problem instance. For each instance we run our algorithm with the same 20 hyperparameter combinations, reporting the best final error values.

Figure 3 (i) shows the results of this experiment for ADG and SE loss. In both cases, our method is able to reliably learn : in fact, for all instances, the final error is under , while the majority of initial errors are above . There is no clear pattern in the performance of the method as and change for ADG; for SE, the final loss is slightly bigger for higher .

#### 4.1.2 Learning c, A, b jointly

Our approach to instance generation here is to start with each baseline LP and generate a strictly feasible or infeasible target within some reasonable proximity of an existing vertex. The algorithm is then forced to learn a new that generate the target, which is not an optimum for the initial LP. To make this task more challenging, we also perturb so that it is not initialized too close to the optimal direction.

For each of the 50 baseline LP feasible regions, we generate a and compute its optimal solution . To generate an infeasible target we set where . We similarly generate a challenging by corrupting with noise from . To generate a strictly feasible target near , we set where is a uniformly random point within the feasible region generated by Dirichlet-weighted combination of all vertices; this method was used because adding noise in 10 dimensions almost always results in an infeasible target.

In summary, we generate new LP instances with the same feasible region as the baseline LPs but a corrupted and one feasible and one infeasible target. The goal is to demonstrate the ability of our algorithm to detect the change and also move the constraints and objective so that the feasible/infeasible target becomes a vertex optimum. For each of the six problem sizes, we randomly split the 50 instances into two subsets, one with feasible and the other with infeasible targets. For ADG loss we set and for SE we use the decay strategy. In practice, this decay strategy is similar to putting emphasis on learning in the initial iterations and ending with emphasis on constraint learning.

The values of hyperparameters and are independently selected from and concatenated into one learning rate vector . We generate 20 different hyperparameter combinations. We run our algorithm on each instance with all hyperparameter combinations and record the value of the best trial.

Figure 3 (ii) shows the results of this experiment for ADG and SE loss. In both cases, our method is able to learn model parameters that result in median loss of under . For ADG, our method performs equally well for all problem sizes, and there is not much difference in the final loss for feasible and infeasible targets. For SE, however, the final loss is larger for higher but decreases as increases. Furthermore, there is a visible difference in performance of the method on feasible and infeasible points for 10-dimensional instances: learning from infeasible targets becomes a more difficult task.

### 4.2 Parametric

Several aspects of the experiment for parametric LPs are different from the non-parametric case. First, we train by minimizing , defined as

 MSE(w) = 1NN∑n=1∥x(un,wtru)−x(un,w)∥22. (5)

We chose the mean of SE loss instead of the mean of ADG loss for the parametric experiments because it is only zero if the targets are all feasible, which is not necessarily required for ADG to be zero. This makes the SE loss more difficult from a learning point of view, but also leads to more intuitive notion of success. See Section 5 for discussion. In the parametric case, we also assess how well the learned PLP generalizes, by evaluating its MSE on a held-out test set.

To generate parametric problem instances, we again started from the baseline LP feasible regions. To generate a true PLP, we used six weights to define linear functions of for all elements of , all elements of , and one random element in each row of . For example, for 2-dimensional problems with four constraints, our instances have the following form:

 minimizex (c1+w1+w2u)x1+(c2+w1+w2u)x2 (6) subject to ⎡⎢ ⎢ ⎢⎣a11a12+w3+w4ua21a22+w3+w4ua31+w3+w4ua32a41a42+w3+w4u⎤⎥ ⎥ ⎥⎦≤⎡⎢ ⎢ ⎢⎣b1+w5+w6ub2+w5+w6ub3+w5+w6ub4+w5+w6u⎤⎥ ⎥ ⎥⎦

Specifically, the “true PLP” instances are generated by setting and . This ensures that when the feasible region of the true PLP matches the baseline LP. For each true PLP, we find a range over which the resulting PLP remains bounded and feasible. To find this ‘safe’ range we evaluate at increasingly large values and try to solve the corresponding LP, expanding if successful. For each true PLP, we generate 20 equally spaced training points spanning . We also sample 20 test points sampled uniformly from . We then initialize learning from a corrupted PLP by setting where each element of .

Hyperparameters are sampled as , and , and is then chosen to be a factor of times , i.e., a relative learning rate. Here, and control the learning rate of parameters within that determine and , respectively. In total, we generate 20 different hyperparameter combinations. We run our algorithm on each instance with all hyperparameter combinations and record the best final error value. A constant value of is used.

We demonstrate the performance of our method on learning parametric LPs of the form shown in (6) with , , and , . In Figure 4, we report two metrics evaluated on the training set, namely MSE and MSE, and one metric for the test set, MSE. Figure 4 (iii) shows an example of an instance with , from the training set. We see that, overall, our deep learning method works well on 2-dimensional problems with the training and testing error both being much smaller than the initial error. In the vast majority of cases the test error is also comparable to training error, though there are a few cases where it is worse, which indicates a failure to generalize well. For 10D instances, the algorithm significantly improves MSE over the initialization MSE, but in most cases fails to drive the loss to zero, either due to local minima or slow convergence. Again, performance on the test set is similar to that on training set.

## 5 Discussion

The conceptual message that we wish to reinforce is that inverse optimization should be viewed as a form of deep learning, and that unrolling gives easy access to the gradients of any parameter used directly or indirectly in the forward optimization process. There are many aspects to this view that merit further exploration. What kind of forward optimization processes can be inversely optimized this way? Which ideas and algorithms from the deep learning community will help? Are there aspects of IO that make gradient-based learning more challenging than in deep learning at large? Conclusive answers are beyond the scope of this paper, but we discuss these and other questions below.

Generality and applicability.  As a proof of concept, this paper uses linear programming for the forward problems and IPM with barrier method as the forward optimization process. In principle, the framework is applicable to any forward process for which automatic differentiation can be applied. This observation does not mean that ours is the best approach for a specialized IO problem, such as learning from a single point  or multiple points within the same feasible region , but it provides a new strategy.

The practical message of our paper is that, when faced with novel classes or novel parameterizations of IO problems, the unrolling strategy provides convenient access to a suite of general-purpose gradient-based algorithms for solving the IO problem at hand. This strategy is made especially easy by deep learning libraries that support dynamic ‘computation graphs’ such as PyTorch. Researchers working within this framework can rapidly apply IO to many differentiable forward optimization processes, without having to derive the algorithm for each case. Automatic differentiation and backpropagation have enabled a new level of productivity for deep learning research, and they may do the same for inverse optimization research. Applying deep inverse optimization does not require expertise in deep learning itself.

We chose IPM as a forward process because the inner Newton step is differentiable and because we expected the gradient to temperature parameter to have a stabilizing effect on the gradient. For non-differentiable optimization processes, it may still be possible to develop differentiable versions. In deep learning, many advances have been made by developing differentiable versions of traditionally discrete operations, such as memory addressing  or sampling from a discrete distribution . We believe the scope of differentiable forward optimization processes may similarly be expanded over time.

Limitations and possible improvements.  Deep IO inherits the limitations of most gradient-based methods. If learning is initialized to the right “basin of attraction”, it can proceed to a global optimum. Even then, the choice of learning algorithm may be crucial. When implemented within a steepest descent framework, as we have here, the learning procedure can get trapped in local minima or exhibit very slow convergence. Such effects are why most instances in Figure 4 (ii) failed to achieve zero loss.

In deep learning with neural networks, poor local minima become exponentially rare as the dimension of the learning problem increases [15, 33]. A typical strategy for training neural networks is therefore to over-parameterize (use a high search dimension) and then use regularization to avoid over-fitting to the data. In deep IO, natural parameterizations of the forward process may not permit an increase in dimension, or there may not be enough observations for regularization to compensate, so local minima remain a potential obstacle. We believe training and regularization methods specialized to low-dimensional learning problems such as by Sahoo et al.  may be applicable here.

We expect that other techniques from deep learning, and from gradient-based optimization in general, will translate to deep IO. For example, optimization techniques with second-order aspects such as momentum  and L-BFGS  are readily available in deep learning frameworks. Other deep learning ‘tricks’ may be applicable to stabilizing deep IO. For example, we observe that, when is normal to a constraint, the gradient with respect to

can suddenly grow very large. We stabilized this behaviour with line search, but a similar ‘exploding gradient’ phenomenon exists when training deep recurrent neural networks, and gradient clipping

 is a popular way to stabilize training. A detailed investigation of applicable deep learning techniques is outside the scope of this paper.

Deep IO may be more successful when the loss with respect to the forward process can be annealed or ‘smoothed’ in a manner akin to graduated non-convexity . Our -decay strategy is an example of this, as discussed below.

Finally, it may be possible to develop hybrid approaches, combining gradient-based learning with closed-form solutions or combinatorial algorithms.

Loss function and metric of success.  One advantage of the deep inverse optimization approach is that it is can accommodate various loss functions, or combinations of loss functions, without special development or analysis. For example one could substitute other

-norms, or losses that are robust to outliers, and the gradient will be automatically available. This flexibility may be valuable. Special loss functions have been important in machine learning, especially for structured output problems

. The decision variables of optimization processes are likewise a form of structured output.

In this study we chose two classical loss functions: absolute duality gap and squared error. The behaviour of our algorithm varied depending on the loss function used. Looking at Figure 3 (ii) it appears that deep IO performs better with ADG loss than with SE loss when learning jointly. However, this performance is due to the theoretical property that ADG can be zero even when the observed target point is arbitrarily infeasible . With ADG, all the IO solver needs to do is adjust so that is orthogonal to , which in no way requires the learned model to be capable of generating as an optimum. In other words, ADG is meaningful mainly when the true feasible region is known, as in Figure 3 (i). When the true region is unknown, SE prioritizes solutions that directly generate the observations , and may therefore be a more meaningful loss function. That is why we used it for our parametric experiments depicted in Figure 4. Figure 5: Loss surfaces for the feasible region and target shown in Figure 1 (i).

Minimizing the SE loss also appears to be more challenging for steepest descent. To get a sense for the characteristics of ADG versus SE from the point of view of varying , consider Figure 5, which depicts the loss for the IO problem in Figure 1 (i) using both high precision ( and low precision () for IPM. Because the ADG loss is directly dependent on , the loss varies smoothly even as the corresponding optimum stays fixed. The SE loss, in contrast, is piece-wise constant; an instantaneous perturbation of will almost never change the SE loss in the limit of . Note that the gradients derived by implicit differentiation  indicate everywhere in the linear case, which would mean cannot be learned by gradient descent. IPM can learn nonetheless because the barrier sharpness parameter smooths the loss, especially at low values. The precision parameter limits the maximal sharpness during forward optimization, and so the gradient is not zero in practice, especially when is weak. Notice that the SE loss surface becomes qualitatively smoother, whereas ADG is not fundamentally changed. Also notice that when is normal to a constraint (when the optimal point is about to transition from one point to another) the gradient explodes even when the problem is smoothed.

Computational efficiency.  Our paper is conceptual and focuses on flexibility and the likelihood of success, rather than computational efficiency. Many applications of IO are not real-time, and so we expect methods with running times on the order of seconds or minutes to be of practical use. Still, we believe the framework can be both flexible and fast.

Deep learning frameworks are GPU accelerated and scale well with the size of an individual forward problem, so large instances are not a concern. A bigger issue for GPUs is solving many small or moderate instances efficiently. Amos and Kolter  developed a batch-mode GPU forward solver to address this.

What is more concerning for the unrolling strategy is that forward optimization processes can be very deep, with hundreds or thousands of iterations. Backpropagation requires keeping all the intermediate values of the forward pass resident in memory, for later use in the backward pass. The computational cost of backpropagation is comparable to that of the forward process, so there is no asymptotic advantage to skipping the backwards pass. Although memory usage was small in our instances, if the memory usage is linear with depth, then at some depth the unrolling strategy will cease to be practical compared to Amos and Kolter’s  implicit differentiation approach. However, we observed that for IPM most of the gradient contribution comes from the final ten Newton steps before termination. In other words, there is a vanishing gradient with depth, which means the gradient can be well-approximated in practice with truncated backpropagation through time (see  for review), which uses a small constant pool of memory regardless of depth.

In practice, we suggest that the unrolling approach is convenient during the development and exploration phase of IO research. Once an IO model is proven to work, it can potentially be made more efficient by deriving the implicit gradients  and comparing them to the unrolled implementation as a reference. Still, more important than improving any of these constants is to use asymptotically faster learning algorithms actively being developed in the deep learning community.

## 6 Conclusion

We developed a deep learning framework for inverse optimization based on backpropagation through an iterative forward optimization process. We illustrate the potential of this framework via an implementation where the forward process is the interior point barrier method. Our results on linear non-parametric and parametric problems show promising performance. To the best of our knowledge, this paper is the first to explicitly connect deep learning and inverse optimization.