# Learning Linear Programs from Optimal Decisions

We propose a flexible gradient-based framework for learning linear programs from optimal decisions. Linear programs are often specified by hand, using prior knowledge of relevant costs and constraints. In some applications, linear programs must instead be learned from observations of optimal decisions. Learning from optimal decisions is a particularly challenging bi-level problem, and much of the related inverse optimization literature is dedicated to special cases. We tackle the general problem, learning all parameters jointly while allowing flexible parametrizations of costs, constraints, and loss functions. We also address challenges specific to learning linear programs, such as empty feasible regions and non-unique optimal decisions. Experiments show that our method successfully learns synthetic linear programs and minimum-cost multi-commodity flow instances for which previous methods are not directly applicable. We also provide a fast batch-mode PyTorch implementation of the homogeneous interior point algorithm, which supports gradients by implicit differentiation or backpropagation.

## Authors

• 2 publications
• 3 publications
• 4 publications
• ### Deep Inverse Optimization

Given a set of observations generated by an optimization process, the go...
12/03/2018 ∙ by Yingcong Tan, et al. ∙ 0

• ### Minimum Cost Flows, MDPs, and ℓ_1-Regression in Nearly Linear Time for Dense Instances

In this paper we provide new randomized algorithms with improved runtime...
01/14/2021 ∙ by Jan van den Brand, et al. ∙ 0

• ### Using Differentiable Programming for Flexible Statistical Modeling

Differentiable programming has recently received much interest as a para...
12/07/2020 ∙ by Maren Hackenberg, et al. ∙ 0

• ### An Online-Learning Approach to Inverse Optimization

In this paper, we demonstrate how to learn the objective function of a d...
10/30/2018 ∙ by Andreas Bärmann, et al. ∙ 0

• ### Using machine learning to make constraint solver implementation decisions

Programs to solve so-called constraint problems are complex pieces of so...
05/19/2010 ∙ by Lars Kotthoff, et al. ∙ 0

• ### Backprop-Q: Generalized Backpropagation for Stochastic Computation Graphs

In real-world scenarios, it is appealing to learn a model carrying out s...
07/25/2018 ∙ by Xiaoran Xu, et al. ∙ 0

• ### A Closer Look at Double Backpropagation

In recent years, an increasing number of neural network models have incl...
06/16/2019 ∙ by Christian Etmann, et al. ∙ 8

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In linear programming, the goal is to make a optimal decision under a linear objective and subject to linear constraints. Traditionally, a linear program is designed using knowledge of relevant costs and constraints. More recently, methodologies that are data-driven have emerged. For example, in the “predict-then-optimize” paradigm [Elmachtoub19], linear programs are learned from direct observations of previous costs or constraints.

Inverse optimization (IO) [Burton92, Troutt95, Ahuja01], in contrast, learns linear programs from observations of optimal decisions rather than of the costs or constraints themselves. The IO approach is particularly important when observations come from optimizing agents (e.g., experts [Chan14, Barmann17] or customers [Dong18]) who make near-optimal decisions with respect to their internal (unobserved) optimization models.

From a machine learning perspective, the IO setup is as follows: we are given feature vectors

representing conditions (e.g., time, prices, weather) and we observe the corresponding decision targets (e.g., quantities, actions) determined by an unknown optimization process, which in our case is assumed linear. We view IO as the problem of inferring a constrained optimization model that gives identical (or equivalent) decisions, and which generalizes to novel conditions . The family of candidate models is assumed parametrized by some vector .

Learning a constrained optimizer that makes the observations both feasible and optimal poses multiple challenges that have not been explicitly addressed. For instance, parameter setting in Figure 1 makes the observed decision optimal but not feasible, produces exactly the opposite result, and some values (black-hatched region in Figure 1) are not even admissible because they will result in empty feasible regions. Finding a parameter such as that is consistent with the observations can be difficult. We formulate the learning problem in a novel way, and tackle it with gradient-based methods despite the inherent bi-level nature of learning. Using gradients from backpropagation or implicit differentiation, we successfully learn linear program instances of various sizes as well as learning the costs and right-hand coefficients of a minimum-cost multi-commodity flow problem.

Parametric Linear Programs   In a linear program (LP), decision variables may vary, and the cost coefficients , inequality constraint coefficients , , and equality constraint coefficients , are all constant. In a parametric linear program (PLP), the coefficients (and therefore the optimal decisions) may depend on features . In order to infer a PLP from data, one may define a suitable hypothesis space parametrized by . We refer to this hypothesis space as the form of our forward optimization problem (FOP).

(LP) s.t.
(PLP) s.t.
(FOP) s.t.

A choice of hypothesis in (FOP) identifies a PLP, and a subsequent choice of conditions identifies an LP. The LP can then be solved to yield an optimal decision under the model. These predictions of optimal decisions can be compared to observations at training time, or can be used to anticipate optimal decisions under novel conditions at test time.

## 2 Related Work

Inverse optimization

IO has focused on developing optimization models for minimally adjusting a prior estimate of

to make a single feasible observation optimal [Ahuja01, Heuberger04] or for making minimally sub-optimal to (LP) without a prior [Chan14, Chan19]. Recent work [Babier19]

develops exact approaches for imputing non-parametric

given multiple potentially infeasible solutions to (LP), and to finding non-parametric and/or [Chan18c, Ghobadi20]. In the parametric setting, joint estimation of and via a maximum likelihood approach was developed by Troutt05, Troutt08 when only is a function of . Gallego17 jointly learn and which are affine functions of . Barmann17, Barmann20 and Dong18 study online versions of inverse linear and convex optimization, respectively, learning a sequence of cost functions where the feasible set for each observation are assumed to be fully-specified. tan2019dio proposed a gradient-based approach for learning cost and constraints of a PLP, by ‘unrolling’ a barrier interior point solver and backpropagating through it. Their formulation does not aim to avoid situations where a training target is infeasible, like the one shown in Figure 1 for .

In inverse convex optimization, the focus has been in imputing parametric cost functions while assuming that the feasible region is known for each  [Keshavarz11, Bertsimas15, Aswani18, Esfahani18], usually under assumptions of a convex set of admissible , the objective and/or constraints being convex in , and uniqueness of the optimal solution for every . Furthermore, since the feasible region is fixed for each , it is simply assumed to be non-empty and bounded, unlike for our work. Although our work focuses on linear programming, it is otherwise substantially more general, allowing for learning of all cost and constraint coefficients simultaneously with no convexity assumptions related to , no restrictions on the existence of multiple optima, and explicit handling of empty or unbounded feasible regions.

introduces the concept of directed regression, where the goal is to fit a linear regression model while minimizing the decision loss, calculated with respect to an unconstrained quadratic optimization model.

Donti2017

use a neural network approach to minimize a task loss which is calculated as a function of the optimal decisions in the context of stochastic programming.

Elmachtoub19 propose the “Smart Predict-then-Optimize” framework in which the goal is to predict the cost coefficients of a linear program with a fixed feasible region given past observations of features and true costs, i.e., given . Note that knowing in this case implies we can solve for , so our framework can in principle be applied in their setting but not vice versa. Our framework is still amenable to more ‘direct’ data-driven prior knowledge: if in addition to we have partial or complete observations of or of constraint coefficients, regressing to these targets can easily be incorporated into our overall learning objective.

Structured prediction   In structured output prediction [Taskar05, Bakir07, Nowozin14, Daume15], each prediction is for an objective and known output structure . In our work the structure is also learned, parametrized as , and the objective is linear . In structured prediction the loss is typically a function of and a target , whereas in our setting it is important to consider a parametric loss .

Differentiating through an optimization   Our work involves differentiating through an LP. bengio2000gradient

proposed gradient-based tuning of neural network hyperparameters and, in a special case, backpropagating through the Cholesky decomposition computed during training (suggested by Léo Bottou).

Stoyanov11 proposed backpropagating through a truncated loopy belief propagation procedure. domke2012generic, Domke13 proposed automatic differentiation through truncated optimization procedures more generally, and maclaurin2015gradient proposed a similar approach for hyperparameter search. The continuity and differentiability of the optimal solution set of a quadratic program has been extensively studied [lee2006quadratic]. Amos2017 recently proposed integrating a quadratic optimization layer in a deep neural network, and used implicit differentiation to derive a procedure for computing parameter gradients. As part of our work we specialize their approach, providing an expression for LPs. Even more general is recent work on differentiating through convex cone programs [agrawal2019differentiating], submodular optimization [djolonga2017differentiable], and arbitrary constrained optimization [gould2019deep]. There are also versatile perturbation-based differentiation techniques [papandreou2011perturb, berthet2020learning].

## 3 Methodology

Here we introduce our new bi-level formulation and methodology for learning parametric linear programs. Unlike previous approaches (e.g. Aswani18), we do not transform the problem to a single-level formulation, and so we do not require simplifying assumptions. We propose a technique for tackling our bi-level formulation with gradient-based non-linear programming methods.

### 3.1 Inverse Optimization as PLP Model Fitting

Let denote the training set. A loss function penalizes discrepancy between prediction and target under conditions for the PLP hypothesis identified by . Note that if is optimal under conditions , then must also be feasible. We therefore propose the following bi-level formulation of the inverse linear optimization problem (ILOP):

 minimizew∈W 1N∑Ni=1ℓ(x∗i,xobsi,ui,w)+r(w) (ILOP) subject to A(ui,w)xobsi≤b(ui,w),G(ui,w)xobsi=h(ui,w), i=1,…,N (1a) i=1,…,N (1b)

where simply denotes an optional regularization term such as and denotes additional problem-specific prior knowledge, if applicable (similar constraints are standard in the IO literature [Keshavarz11, Chan19]). The ‘inner’ problem (1b) generates predictions by solving independent LPs. The ‘outer’ problem tries to make these predictions consistent with the targets while also satisfying target feasibility (1a).

Difficulties may arise, in principle and in practice. An inner LP may be infeasible or unbounded for certain , making undefined. Even if all produce feasible and bounded LPs, an algorithm for solving (ILOP) may still attempt to query . The outer problem as a whole may be subject to local minima due to non-convex objective and/or constraints, depending on the problem-specific parametrizations. We propose gradient-based techniques for the outer problem (Section 3.2), but may not exist or may be non-unique at certain and (Section 3.3).

Nonetheless, we find that tackling this formulation leads to practical algorithms. To the best of our knowledge, (ILOP) is the most general formulation of inverse linear parametric programming. It subsumes the non-parametric cases that have received much interest in the IO literature.

Choice of loss function   The IO literature considers decision error, which penalizes difference in decision variables, and objective error, which penalizes difference in optimal objective value [Babier19]. A fundamental issue with decision error, such as squared decision error (SDE) , is that when is non-unique the loss is also not unique; this issue was also a motivation for the “Smart Predict-then-Optimize” paper [Elmachtoub19]. An objective error, such as absolute objective error (AOE) , is unique even if is not. We evaluate AOE using imputed cost during training; this usually requires at least some prior knowledge to avoid trivial cost vectors, as in Keshavarz11.

Target feasibility   Constraints (1a) explicitly enforce target feasibility in any learned PLP. The importance of these constraints can be understood through Figure 1, where hypothesis achieves since and

are on the same hyperplane, despite

being infeasible. Chan19 show that if the feasible region is bounded then for any infeasible there exists a cost vector achieving .

Unbounded or infeasible subproblems   Despite (1a), an algorithm for solving (ILOP) may query a for which an LP in (1b) is itself infeasible and/or unbounded, in which case a finite is not defined. We can extend (ILOP) to explicitly account for these special cases (by penalizing a measure of infeasibility [murty2000infeasibility], and penalizing unbounded directions when detected) but in our experiments simply evaluating the (large) loss for an arbitrary returned by our interior point solver worked nearly as well at avoiding such regions of , so we opt to keep the formulation simple.

Noisy observations   Formulation (ILOP) can be extended to handle measurement noise. For example, individually penalized non-negative slack variables can be added to the right-hand sides of (1a) as in a soft-margin SVM [cortes1995support]. Alternatively, a norm-penalized group of slack variables can be added to each on the left-hand side of (1a), softening targets in decision space. We leave investigation of noisy data and model-misspecification as future work.

### 3.2 Learning Linear Programs with Sequential Quadratic Programming

We treat (ILOP) as a non-linear programming (NLP) problem, making as few assumptions as possible. We focus on sequential quadratic programming (SQP), which aims to solve NLP problems iteratively. Given current iterate , SQP determines a search direction and then selects the next iterate via line search on . Direction is the solution to a quadratic program.

 minimizew f(w) minimizeδ ∇f(wk)Tδ+δTBkδ subject to g(w)≤0(NLP) subject to ∇g(wk)Tδ+g(wk)≤0(SQP) h(w)=0 ∇h(wk)Tδ+h(wk)=0

Each instance of subproblem (SQP) requires evaluating constraints111NLP constraint vector is not the same as FOP right-hand side , despite same symbol. and their gradients at , as well as the gradient of the objective. Matrix approximates the Hessian of the Lagrange function for (NLP), where is typically determined from the gradients by a BFGS-like update. Our experiments use an efficient variant called sequential least squares programming (SLSQP) [schittkowski1982nonlinear2, kraft1988software] which exploits a stable factorization of .

The NLP formulation of (ILOP) has inequality and equality constraints from (1a):

 g(w)=[A(ui,w)xobsi−b(ui,w)]M1i=1, h(w)=[G(ui,w)xobsi−h(ui,w)]M2i=1.

plus any constraints needed to enforce . The NLP constraint residuals and their gradients can be directly evaluated. Evaluating requires solving each LP in (1b). Finally, evaluating requires evaluating vector-Jacobian product for each , which requires differentiating through the LP optimization that produced from and . That is exactly what we do, and this approach allows us to tackle (ILOP) directly in its bi-level form, using powerful gradient-based NLP optimizers like SQP as the ‘outer’ solver. Section 3.3 compares methods for the differentiating through LP optimization.

Redundant NLP constraints   When PLP model parameters have fixed dimension, the NLP formulation of (ILOP) can involve many redundant constraints, roughly in proportion to . Indeed, if and the equality constraints may appear to over-determine , treating (NLP) as a feasibility problem; but, due to redundancy is not uniquely determined. The ease or difficulty of removing redundant constraints from (NLP) depends on the domain-specific parametrizations of PLP constraints and . Equality constraints that are affinely-dependent on can be eliminated from (NLP) by a simple pseudoinverse technique, resulting in a lower-dimensional problem; this also handles the case where (NLP) is not strictly feasible in (either due to noisy observations or model misspecification) by automatically searching only among that exactly minimize the sum of squared residuals . If equality constraints are polynomially-dependent on , we can eliminate redundancy by Gröbner basis techniques [cox2013ideals] although, unlike the affine case, it may not be possible or beneficial to reparametrize-out the new non-redundant basis constraints from the NLP. Redundant inequality constraints can be either trivial or costly to identify [telgen1983identifying], but are not problematic. See Appendix for details.

Benefit over gradient-free methods   Evaluating is expensive in our NLP because it requires solving linear programs. To understand why access to is important in this scenario, it helps to contrast SQP with a well-known gradient-free NLP optimizer such as COBYLA [powell1994direct]. For -dimensional NLP, COBYLA maintains samples of and uses them as a finite-difference approximation to where is the current iterate (best sample). The next iterate is computed by optimizing over a trust region centered at . COBYLA recycles past samples to effectively estimate ‘coarse’ gradients, whereas SQP uses gradients directly. Figure 2 shows SLSQP and COBYLA running on the example from Figure 1.

### 3.3 Computing Loss Function Gradients

If, at a particular point , each corresponding vector-Jacobian product exists, is unique, and can be computed, then we can construct (SQP) at each step. For convenience, we assume that are expressed in terms of within an automatic differentiation framework such as PyTorch, so all that remains is to compute Jacobians at each as an intermediate step at the outset of backpropagation. We consider three approaches:

1. [itemsep=-.05em,topsep=-.18em,leftmargin=1.75cm]

2. backpropagate through the steps of the homogeneous interior point algorithm for LPs,

3. specialize the implicit differentiation procedure of Amos2017 to LPs, and

4. evaluate gradients directly, in closed form (for objective error only).

We implemented a batch PyTorch version of the homogeneous interior point algorithm [andersen2000mosek, xu1996simplified] developed for the MOSEK optimization suite and currently the default linear programming solver in SciPy [2020SciPy]. Our implementation is also efficient in the backward pass, for example re-using the decomposition222Cholesky decomposition is also supported and re-used, but we use decomposition in experiments. from each Newton step.

For implicit differentiation we follow Amos2017 by forming the system of linear equations that result from differentiating the KKT conditions and then inverting that system to compute the needed vector-Jacobian products. For LPs this system can be poorly conditioned, especially at strict tolerances on the LP solver, but in practice it provides useful gradients.

For direct evaluation (in the case of objective error), we use Theorem 1. When

is AOE loss, by chain rule we can multiply each quantity by

to get the needed Jacobians.

###### Theorem 1.

Let be an optimal solution to (LP) and let be an optimal solution to the associated dual linear program. If is non-degenerate then the objective error is differentiable and the total derivatives333In slight abuse of notation, we ignore leading singleton dimension of . are

 ∂z∂c =(xobs−x∗)T ∂z∂A =λ∗x∗T ∂z∂b =−λ∗T ∂z∂G =ν∗x∗T ∂z∂h =−ν∗T.

Gradients and for the right-hand sides are already well-known as shadow prices. If is degenerate then the relationship between shadow prices and dual variables breaks down, resulting in two-sided shadow prices [Strum69, Aucamp82].

We use degeneracy in the sense of tijssen1998balinski, where a point on the relative interior of the optimal face need not be degenerate, even if there exists a degenerate vertex on the optimal face. This matters when is non-unique because interior point methods typically converge to the analytical center of the relative interior of the optimal face [Zhang94]. Tijssen and Sierskma also give relations between degeneracy of and uniqueness of , which we apply in Corollary 1. When the gradients are non-unique, this corresponds to the subdifferentiable case.

###### Corollary 1.

In Theorem 1, both and are unique, is unique if and only if is unique, and both and are unique if and only if is unique or .

## 4 Experiments

We evaluate our approach by learning a range of synthetic LPs and parametric instances of minimum-cost multi-commodity flow. Use of synthetic instances is common in IO (e.g., Ahuja01, Keshavarz11, Dong18) and there are no community-established and readily-available benchmarks, especially for more general formulations. Our experimental study considers instances not directly addressable by previous IO work, either because we learn all coefficients jointly or because the parametrization results in non-convex NLP.

We compare three versions444For completeness we also evaluated finite-differences () which, unsurprisingly, was not competitive. of our gradient-based method (, , ) with two gradient-free methods: random search (RS) and COBYLA. The main observation is that the gradient-based methods perform similarly and become superior to gradient-free methods as the dimension  of parametrization increases. We find that including a black-box baseline like COBYLA is important for assessing the practical difficulty of an IO instance (and encourage future papers to do so) because such methods work reasonably well in low-dimensional problems. A second observation is that generalization to testing conditions is difficult because the discontinuous nature of LP decision space creates an underfitting phenomenon. This may explain why many previous works in IO require a surprising amount of training data for so few model parameters (see end of Section 4

). A third observation is that there are instances for which no method succeeds at minimizing training error 100% of the time. Our method can therefore be viewed as a way to significantly boost the probability of successful training, when combined with simple global optimization strategies such as multi-start.

Experiments used PyTorch v1.6 nightly build, the COBYLA and SLSQP wrappers from SciPy v1.4.1, and were run on an Intel Core i7 with 16GB RAM. (We do not use GPUs, though our PyTorch interior point solver inherits GPU acceleration.) We do not regularize nor have any other hyperparameters.

Learning linear programs   We used the LP generator of tan2019dio, modifying it to create a more challenging variety of feasible regions; their code did not perform competitively in terms of runtime or success rate on these harder instances, and cannot be applied to AOE loss. Fig. 3 shows the task of learning (, , ) with a dimensional parametrization , a dimensional decision space , and 20 training observations. RS fails; COBYLA ‘succeeds’ on  25% of instances; SQP succeeds on 60-75%, which is substantially better. The success curve of slightly lags those of and due to the overhead of backpropagating through the steps of the interior point solver. See Appendix for five additional problem sizes, where overall the conclusions are the same. Surprisingly, works slightly better than and in problems with higher . We observe similar performance on instances with equality constraints, where and also need to be learned (see Appendix). Note that each RS trial returns the best of (typically) thousands of settings evaluated during the time budget, all sampled uniformly from the same from which the ‘true’ synthetic PLP was sampled. Most random (and thus initial) points do not satisfy (1a).

Learning directly, so that comprises all LP coefficients, results in a high-dimensional NLP problem (which is why, to date, the IO literature has focused on special cases of this problem, either with a single [Chan18, Chan19] or fewer coefficients to learn [Ghobadi20]). For example, an instance with has adjustable parameters. , and consistently achieve zero AOE training loss, while RS and COBYLA consistently fail to make learning progress given the same time budget (see Appendix).

Learning minimum-cost multi-commodity flow problems   Fig. 4 shows a visualization of our experiment on the Nguyen-Dupuis graph [Nguyen84]. We learn a periodic arc cost and an affine arc capacity , based on global feature (time of day) and arc-specific features (length) and (toll price). To avoid trivial solutions, we set . Results on 100 instances are shown in Fig. 5. The SQP methods outperform RS and COBYLA in training and testing loss. From an IO perspective the fact that we are jointly learning costs and capacities in a non-convex NLP formulation is already quite general. Again, for higher-dimensional parametrizations, we can expect the advantage of gradient-based methods to get stronger.

We report both the mean and median loss over the testing points in each trial. The difference in mean and median testing error is due to the presence of a few ‘outliers’ among the test set errors. Fig.

6 shows the nature of this failure to generalize: the decision map of a PLP has discontinuities, so the training data can easily under-specify the set of learned models that can achieve zero training loss, similar to the scenario that motivates max-margin learning in SVMs. It is not clear what forms of regularization will reliably improve generalization in IO. Fig. 6 also suggests that training points which closely straddle discontinuities are much more ‘valuable’ from a learning perspective.

## 5 Conclusion

In this paper, we propose a novel bi-level formulation and gradient-based framework for learning linear programs from optimal decisions. The methodology learns all parameters jointly while allowing flexible parametrizations of costs, constraints, and loss functions—a generalization of the problems typically addressed in the inverse linear optimization literature.

Our work facilitates a strong class of inductive priors, namely parametric linear programs, to be imposed on a hypothesis space for learning. A major motivation for ours and for similar works is that, when the inductive prior is suited to the problem, we can learn a much better (and more interpretable) model, from far less data, than by applying general-purpose machine learning methods. In settings spanning economics, commerce, and healthcare, data on decisions is expensive to obtain and to collect, so we hope that our approach will help to build better models and to make better decisions.

## Appendix

### Appendix A: Forward Optimization Problem for Figure 1

Forward optimization problem for Figure 1. The FOP formulation used is shown in (2) below.

 minimizex1,x2 cos(w1+w2u)x1+sin(w1+w2u)x2 (2) subject to (1+w2u)x1≥w1 (1+w1)x2≥w2u x1+x2≤1+w1+w2u

For a fixed and weights it is an LP. The observation was generated using with true parameters .

For illustrative clarity, the panels in Figure 1 depicting the specific feasible regions for are slightly adjusted and stylized from the actual PLP (2), but are qualitatively representative.

### Appendix B: Redundancy Among Target-Feasibility Constraints

Redundant constraints in (1a) are not problematic in principle. Still, removing redundant constraints may help overall performance, either in terms of speed or numerical stability of the ‘outer’ solver. Here we discuss strategies for automatically removing redundant constraints, depending on assumptions. In this section, when we use or it should be understood to represent some target or .

Constraints that are equivalent.   There may exist indices and for which the corresponding constraints and are identical or equivalent. For example, when a constraint is independent of this often results in identical training targets and that produce identical constraints. The situation for equality constraints is similar.

Constraints independent of .   If an individual constraint is independent of then either:

1. [itemsep=-.05em,topsep=-.18em,leftmargin=0.7cm]

2. for all so the constraint can be omitted; or,

3. for some so the (ILOP) formulation is infeasible due to model misspecification, either in structural assumptions, or assumptions about noise.

The same follows for any equality constraint that is independent of . For example, in our minimum-cost multi-commodity flow experiments, the flow conservation constraints (equality) are independent of and so are omitted from (1a) in the corresponding ILOP formulation.

Constraints affinely-dependent in .   Constraints may be affinely-dependent on parameters . For example, this is a common assumption in robust optimization [zhen2018adjustable]. Let and represent the constraints that are affinely dependent on . We can write

 A(u,w)=A0(u)+K∑k=1wkAk(u) and b(u,w)=b0(u)+K∑k=1wkbk(u)

for some matrix-valued functions and vector-valued functions . It is easy to show that we can then rewrite the constraints as where

 ~A(u,x) =[A1(u)x−b1(u)⋯AK(u)x−bK(u)] ~b(u,x) =b0(u)−A0(u)x.

Similarly if are affine in we can rewrite them as . If we apply these functions across all training samples , and stack their coefficients as

then the corresponding ILOP constraints (1a) reduce to a set of linear ‘outer’ constraints and where . These reformulated constraint matrices are the system within which we eliminate redundancy in the affinely-dependent case, continued below.

Equality constraints affinely-dependent in .   We can eliminate affinely-dependent equality constraint sets by reparametrizing the ILOP search over a lower-dimensional space; this is what we do for the experiments with equality constraints shown in Figure 8, although the conclusions do not change with or without this reparametrization. To reparametrize the ILOP problem, compute a Moore-Penrose pseudoinverse to get a direct parametrization of constrained vector in terms of an unconstrained vector :

 w(w′)=~G+~h+(I−~G+~G)w′. (3)

By reparametrizing (ILOP) in terms of we guarantee is satisfied and can drop equality constraints from (1a) entirely. There are three practical issues with (3):

1. [itemsep=-.05em,topsep=-.18em,leftmargin=0.7cm]

2. Constrained vector only has degrees of freedom, so we would like to re-parametrize over a lower-dimensional .

3. To search over we need to specify and such that is equivalent to .

4. Given initial we need a corresponding to initialize our search.

To address the first issue, we can let the final components of in (3) be zero, which corresponds to using a lower-dimensional . As shorthand let matrix be

 P ≡(IK×K−~G+~G)IK×K′=IK×K′−(~G+~G)1:K,1:K′

where denotes as in torch.eye(K, K’) and denotes the first columns of matrix . Then we have where the full dimension of matches the degrees of freedom in subject to and we have for any choice of .

To address the second issue, simplifying gives inequality constraints with and .

To address the third issue we must solve for in the linear system . Since the solution exists and is unique.

Consider also the effect of this reparametrization when is an infeasible system, for example due to noisy observations or misspecified constraints. In that case searching over automatically restricts the search to that satisfy in a least squares sense, akin to adding an infinitely-weighted term to the ILOP objective.

Inequality constraints affinely-dependent in .   After transforming affinely-dependent inequality constraints to , detecting redundancy among these constraints can be as hard as solving an LP [telgen1983identifying]. Generally, inequality constraint is redundant with respect to if and only if the optimal value of the following LP is non-negative:

 minimizew bj−aTjw (4) subject to A{j′≠j}w≤b{j′≠j}

Here is the row of and is all the rows of except the . If the optimal value to (4) is non-negative then it says “we tried to violate the constraint, but the other constraints prevented it, and so the constraint must be redundant.” However, telgen1983identifying reviews much more efficient methods of identifying redundant linear inequality constraints, by analysis of basic basic variables in a simplex tableau. zhen2018adjustable proposed a ‘redundant constraint identification’ (RCI) procedure proposed by that is directly analogous to (4

) along with another heuristic RCI procedure.

Constraints polynomially-dependent in .   Similar to the affinely-dependent case, when the coefficients of constraints and are polynomially-dependent on , we can rewrite the constraints in terms of . Redundancy among equality constraints of the resulting system can be simplified by computing a minimal Gröbner basis [cox2013ideals], for example by Buchberger’s algorithm which is a generalization of Gaussian elimination; see the paper by lim2012groebner for a review of Gröbner basis techniques applicable over a real field. Redundancy among inequality constraints for nonlinear programming has been studied [caron2009redundancy, obuchowska1995minimal]. Simplifying polynomial systems of equalities and inequalities is a subject of semialgebraic geometry and involves generalizations of Fourier-Motzkin elimination. Details are beyond the scope of this manuscript.

### Appendix C: Proofs of Theorem 1 and Corollary 1

###### Proof of Theorem 1.

The dual linear program associated with (LP) is

 maximizeλ,ν bTλ+hTν (DP) subject to ATλ+GTν=c λ≤0,

where are the associated dual variables for the primal inequality and equality constraints, respectively.

Since is optimal to (LP) and are optimal to (DP), then satisfy the KKT conditions (written specialized to the particular LP form we use):

 Ax ≤b (KKT) Gx =h ATλ+GTν =c λ ≤0 D(λ)(Ax−b) =0

where is the diagonal matrix having on the diagonal. The first two constraints correspond to primal feasibility, the next two to dual feasibility and the last one specifies complementary slackness. From here forward it should be understood that satisfy KKT even when not emphasized by .

As in the paper by Amos2017, implicitly differentiating the equality constraints in (KKT) gives

 Gdx =dh−dGx (DKKT) ATdλ+GTdν =dc−dATλ−dGTν D(λ)Adx+D(Ax−b)dλ =D(λ)(db−dAx)

where are parameter differentials and are solution differentials, all having the same dimensions as the variables they correspond to. Because (KKT) is a second-order system, (DKKT) is a system of linear equations. Because the system is linear, a partial derivative such as can be determined (if it exists) by setting and all other parameter differentials to , then solving the system for solution differential , as shown by Amos2017.

We can assume (KKT) is feasible in . In each case of the main proof it will be important to characterize conditions under which (DKKT) is then feasible in . This is because, if (DKKT) is feasible in at least , then by substitution we have

and this substitution is what gives the total derivatives their form. In (5) the substitution holds because feasible in (KKT) implies in (DKKT), where is the th row of . Whenever is feasible in (DKKT) we have for any , where is the th row of differential .

Note that (5) holds even if (DKKT) is not feasible in and/or . In other words, it does not require the KKT point to be differentiable with respect to and/or .

Given a KKT point let be a partition of inequality indices where

 I ={i:λ∗i<0,Aix∗=bi} J ={i:λ∗i=0,Aix∗

and the corresponding submatrices of are . Then (DKKT) in matrix form is

 (6)

The pattern of the proof in each case will be to characterize feasibility of (6) in and then apply (5) for the result.

Evaluating .   Consider . To evaluate the term, set and all other parameter differentials to . Then the right-hand side of (6) becomes

 (7)

where denotes the vector with for component and elsewhere. System (7) is feasible in (not necessarily unique) so we can apply (5) to get . The result for then follows from .

Evaluating .   Consider . Set and all other parameter differentials to . Then the right-hand side of (6) becomes

 (8)

Since is non-degenerate in the sense of tijssen1998balinski, then there are at most active constraints (including equality constraints) and the rows of are also linearly independent. Since active constraints are linearly independent, system (8) is feasible in across all . We can therefore apply (5) to get . The result for then follows from .

Evaluating .   Consider . Set and all other parameter differentials to . For the right-hand side of (6) becomes

 (9)

Since is non-degenerate, then system (9) is feasible in for all by identical reasoning as for . For the right-hand side of (6) is zero and so the system is feasible in . System (9) is therefore feasible in across all . We can therefore apply (5) to get . The result for then follows from .

Evaluating .   Consider . Set and all other parameter differentials to . Then the right-hand side of (6) becomes

 (10)

Since is non-degenerate, then (10) is feasible in for all and by same reasoning as . Applying (5) gives