1 Introduction
In linear programming, the goal is to make a optimal decision under a linear objective and subject to linear constraints. Traditionally, a linear program is designed using knowledge of relevant costs and constraints. More recently, methodologies that are datadriven have emerged. For example, in the “predictthenoptimize” paradigm [Elmachtoub19], linear programs are learned from direct observations of previous costs or constraints.
Inverse optimization (IO) [Burton92, Troutt95, Ahuja01], in contrast, learns linear programs from observations of optimal decisions rather than of the costs or constraints themselves. The IO approach is particularly important when observations come from optimizing agents (e.g., experts [Chan14, Barmann17] or customers [Dong18]) who make nearoptimal decisions with respect to their internal (unobserved) optimization models.
From a machine learning perspective, the IO setup is as follows: we are given feature vectors
representing conditions (e.g., time, prices, weather) and we observe the corresponding decision targets (e.g., quantities, actions) determined by an unknown optimization process, which in our case is assumed linear. We view IO as the problem of inferring a constrained optimization model that gives identical (or equivalent) decisions, and which generalizes to novel conditions . The family of candidate models is assumed parametrized by some vector .Learning a constrained optimizer that makes the observations both feasible and optimal poses multiple challenges that have not been explicitly addressed. For instance, parameter setting in Figure 1 makes the observed decision optimal but not feasible, produces exactly the opposite result, and some values (blackhatched region in Figure 1) are not even admissible because they will result in empty feasible regions. Finding a parameter such as that is consistent with the observations can be difficult. We formulate the learning problem in a novel way, and tackle it with gradientbased methods despite the inherent bilevel nature of learning. Using gradients from backpropagation or implicit differentiation, we successfully learn linear program instances of various sizes as well as learning the costs and righthand coefficients of a minimumcost multicommodity flow problem.
Parametric Linear Programs In a linear program (LP), decision variables may vary, and the cost coefficients , inequality constraint coefficients , , and equality constraint coefficients , are all constant. In a parametric linear program (PLP), the coefficients (and therefore the optimal decisions) may depend on features . In order to infer a PLP from data, one may define a suitable hypothesis space parametrized by . We refer to this hypothesis space as the form of our forward optimization problem (FOP).
A choice of hypothesis in (FOP) identifies a PLP, and a subsequent choice of conditions identifies an LP. The LP can then be solved to yield an optimal decision under the model. These predictions of optimal decisions can be compared to observations at training time, or can be used to anticipate optimal decisions under novel conditions at test time.
2 Related Work
Inverse optimization
IO has focused on developing optimization models for minimally adjusting a prior estimate of
to make a single feasible observation optimal [Ahuja01, Heuberger04] or for making minimally suboptimal to (LP) without a prior [Chan14, Chan19]. Recent work [Babier19]develops exact approaches for imputing nonparametric
given multiple potentially infeasible solutions to (LP), and to finding nonparametric and/or [Chan18c, Ghobadi20]. In the parametric setting, joint estimation of and via a maximum likelihood approach was developed by Troutt05, Troutt08 when only is a function of . Gallego17 jointly learn and which are affine functions of . Barmann17, Barmann20 and Dong18 study online versions of inverse linear and convex optimization, respectively, learning a sequence of cost functions where the feasible set for each observation are assumed to be fullyspecified. tan2019dio proposed a gradientbased approach for learning cost and constraints of a PLP, by ‘unrolling’ a barrier interior point solver and backpropagating through it. Their formulation does not aim to avoid situations where a training target is infeasible, like the one shown in Figure 1 for .In inverse convex optimization, the focus has been in imputing parametric cost functions while assuming that the feasible region is known for each [Keshavarz11, Bertsimas15, Aswani18, Esfahani18], usually under assumptions of a convex set of admissible , the objective and/or constraints being convex in , and uniqueness of the optimal solution for every . Furthermore, since the feasible region is fixed for each , it is simply assumed to be nonempty and bounded, unlike for our work. Although our work focuses on linear programming, it is otherwise substantially more general, allowing for learning of all cost and constraint coefficients simultaneously with no convexity assumptions related to , no restrictions on the existence of multiple optima, and explicit handling of empty or unbounded feasible regions.
Optimization taskbased learning Kao09
introduces the concept of directed regression, where the goal is to fit a linear regression model while minimizing the decision loss, calculated with respect to an unconstrained quadratic optimization model.
Donti2017use a neural network approach to minimize a task loss which is calculated as a function of the optimal decisions in the context of stochastic programming.
Elmachtoub19 propose the “Smart PredictthenOptimize” framework in which the goal is to predict the cost coefficients of a linear program with a fixed feasible region given past observations of features and true costs, i.e., given . Note that knowing in this case implies we can solve for , so our framework can in principle be applied in their setting but not vice versa. Our framework is still amenable to more ‘direct’ datadriven prior knowledge: if in addition to we have partial or complete observations of or of constraint coefficients, regressing to these targets can easily be incorporated into our overall learning objective.Structured prediction In structured output prediction [Taskar05, Bakir07, Nowozin14, Daume15], each prediction is for an objective and known output structure . In our work the structure is also learned, parametrized as , and the objective is linear . In structured prediction the loss is typically a function of and a target , whereas in our setting it is important to consider a parametric loss .
Differentiating through an optimization Our work involves differentiating through an LP. bengio2000gradient
proposed gradientbased tuning of neural network hyperparameters and, in a special case, backpropagating through the Cholesky decomposition computed during training (suggested by Léo Bottou).
Stoyanov11 proposed backpropagating through a truncated loopy belief propagation procedure. domke2012generic, Domke13 proposed automatic differentiation through truncated optimization procedures more generally, and maclaurin2015gradient proposed a similar approach for hyperparameter search. The continuity and differentiability of the optimal solution set of a quadratic program has been extensively studied [lee2006quadratic]. Amos2017 recently proposed integrating a quadratic optimization layer in a deep neural network, and used implicit differentiation to derive a procedure for computing parameter gradients. As part of our work we specialize their approach, providing an expression for LPs. Even more general is recent work on differentiating through convex cone programs [agrawal2019differentiating], submodular optimization [djolonga2017differentiable], and arbitrary constrained optimization [gould2019deep]. There are also versatile perturbationbased differentiation techniques [papandreou2011perturb, berthet2020learning].3 Methodology
Here we introduce our new bilevel formulation and methodology for learning parametric linear programs. Unlike previous approaches (e.g. Aswani18), we do not transform the problem to a singlelevel formulation, and so we do not require simplifying assumptions. We propose a technique for tackling our bilevel formulation with gradientbased nonlinear programming methods.
3.1 Inverse Optimization as PLP Model Fitting
Let denote the training set. A loss function penalizes discrepancy between prediction and target under conditions for the PLP hypothesis identified by . Note that if is optimal under conditions , then must also be feasible.
We therefore propose the following bilevel formulation of the inverse linear optimization problem (ILOP):
(ILOP)  
subject to  (1a)  
(1b) 
where simply denotes an optional regularization term such as and denotes additional problemspecific prior knowledge, if applicable (similar constraints are standard in the IO literature [Keshavarz11, Chan19]). The ‘inner’ problem (1b) generates predictions by solving independent LPs. The ‘outer’ problem tries to make these predictions consistent with the targets while also satisfying target feasibility (1a).
Difficulties may arise, in principle and in practice. An inner LP may be infeasible or unbounded for certain , making undefined. Even if all produce feasible and bounded LPs, an algorithm for solving (ILOP) may still attempt to query . The outer problem as a whole may be subject to local minima due to nonconvex objective and/or constraints, depending on the problemspecific parametrizations. We propose gradientbased techniques for the outer problem (Section 3.2), but may not exist or may be nonunique at certain and (Section 3.3).
Nonetheless, we find that tackling this formulation leads to practical algorithms. To the best of our knowledge, (ILOP) is the most general formulation of inverse linear parametric programming. It subsumes the nonparametric cases that have received much interest in the IO literature.
Choice of loss function The IO literature considers decision error, which penalizes difference in decision variables, and objective error, which penalizes difference in optimal objective value [Babier19]. A fundamental issue with decision error, such as squared decision error (SDE) , is that when is nonunique the loss is also not unique; this issue was also a motivation for the “Smart PredictthenOptimize” paper [Elmachtoub19]. An objective error, such as absolute objective error (AOE) , is unique even if is not. We evaluate AOE using imputed cost during training; this usually requires at least some prior knowledge to avoid trivial cost vectors, as in Keshavarz11.
Target feasibility Constraints (1a) explicitly enforce target feasibility in any learned PLP. The importance of these constraints can be understood through Figure 1, where hypothesis achieves since and
are on the same hyperplane, despite
being infeasible. Chan19 show that if the feasible region is bounded then for any infeasible there exists a cost vector achieving .Unbounded or infeasible subproblems Despite (1a), an algorithm for solving (ILOP) may query a for which an LP in (1b) is itself infeasible and/or unbounded, in which case a finite is not defined. We can extend (ILOP) to explicitly account for these special cases (by penalizing a measure of infeasibility [murty2000infeasibility], and penalizing unbounded directions when detected) but in our experiments simply evaluating the (large) loss for an arbitrary returned by our interior point solver worked nearly as well at avoiding such regions of , so we opt to keep the formulation simple.
Noisy observations Formulation (ILOP) can be extended to handle measurement noise. For example, individually penalized nonnegative slack variables can be added to the righthand sides of (1a) as in a softmargin SVM [cortes1995support]. Alternatively, a normpenalized group of slack variables can be added to each on the lefthand side of (1a), softening targets in decision space. We leave investigation of noisy data and modelmisspecification as future work.
3.2 Learning Linear Programs with Sequential Quadratic Programming
We treat (ILOP) as a nonlinear programming (NLP) problem, making as few assumptions as possible. We focus on sequential quadratic programming (SQP), which aims to solve NLP problems iteratively. Given current iterate , SQP determines a search direction and then selects the next iterate via line search on . Direction is the solution to a quadratic program.
Each instance of subproblem (SQP) requires evaluating constraints^{1}^{1}1NLP constraint vector is not the same as FOP righthand side , despite same symbol. and their gradients at , as well as the gradient of the objective. Matrix approximates the Hessian of the Lagrange function for (NLP), where is typically determined from the gradients by a BFGSlike update. Our experiments use an efficient variant called sequential least squares programming (SLSQP) [schittkowski1982nonlinear2, kraft1988software] which exploits a stable factorization of .
The NLP formulation of (ILOP) has inequality and equality constraints from (1a):
plus any constraints needed to enforce . The NLP constraint residuals and their gradients can be directly evaluated. Evaluating requires solving each LP in (1b). Finally, evaluating requires evaluating vectorJacobian product for each , which requires differentiating through the LP optimization that produced from and . That is exactly what we do, and this approach allows us to tackle (ILOP) directly in its bilevel form, using powerful gradientbased NLP optimizers like SQP as the ‘outer’ solver. Section 3.3 compares methods for the differentiating through LP optimization.
Redundant NLP constraints When PLP model parameters have fixed dimension, the NLP formulation of (ILOP) can involve many redundant constraints, roughly in proportion to . Indeed, if and the equality constraints may appear to overdetermine , treating (NLP) as a feasibility problem; but, due to redundancy is not uniquely determined. The ease or difficulty of removing redundant constraints from (NLP) depends on the domainspecific parametrizations of PLP constraints and . Equality constraints that are affinelydependent on can be eliminated from (NLP) by a simple pseudoinverse technique, resulting in a lowerdimensional problem; this also handles the case where (NLP) is not strictly feasible in (either due to noisy observations or model misspecification) by automatically searching only among that exactly minimize the sum of squared residuals . If equality constraints are polynomiallydependent on , we can eliminate redundancy by Gröbner basis techniques [cox2013ideals] although, unlike the affine case, it may not be possible or beneficial to reparametrizeout the new nonredundant basis constraints from the NLP. Redundant inequality constraints can be either trivial or costly to identify [telgen1983identifying], but are not problematic. See Appendix for details.
Benefit over gradientfree methods Evaluating is expensive in our NLP because it requires solving linear programs. To understand why access to is important in this scenario, it helps to contrast SQP with a wellknown gradientfree NLP optimizer such as COBYLA [powell1994direct]. For dimensional NLP, COBYLA maintains samples of and uses them as a finitedifference approximation to where is the current iterate (best sample). The next iterate is computed by optimizing over a trust region centered at . COBYLA recycles past samples to effectively estimate ‘coarse’ gradients, whereas SQP uses gradients directly. Figure 2 shows SLSQP and COBYLA running on the example from Figure 1.
3.3 Computing Loss Function Gradients
If, at a particular point , each corresponding vectorJacobian product exists, is unique, and can be computed, then we can construct (SQP) at each step. For convenience, we assume that are expressed in terms of within an automatic differentiation framework such as PyTorch, so all that remains is to compute Jacobians at each as an intermediate step at the outset of backpropagation. We consider three approaches:

[itemsep=.05em,topsep=.18em,leftmargin=1.75cm]

backpropagate through the steps of the homogeneous interior point algorithm for LPs,

specialize the implicit differentiation procedure of Amos2017 to LPs, and

evaluate gradients directly, in closed form (for objective error only).
We implemented a batch PyTorch version of the homogeneous interior point algorithm [andersen2000mosek, xu1996simplified] developed for the MOSEK optimization suite and currently the default linear programming solver in SciPy [2020SciPy]. Our implementation is also efficient in the backward pass, for example reusing the decomposition^{2}^{2}2Cholesky decomposition is also supported and reused, but we use decomposition in experiments. from each Newton step.
For implicit differentiation we follow Amos2017 by forming the system of linear equations that result from differentiating the KKT conditions and then inverting that system to compute the needed vectorJacobian products. For LPs this system can be poorly conditioned, especially at strict tolerances on the LP solver, but in practice it provides useful gradients.
For direct evaluation (in the case of objective error), we use Theorem 1. When
is AOE loss, by chain rule we can multiply each quantity by
to get the needed Jacobians.Theorem 1.
Let be an optimal solution to (LP) and let be an optimal solution to the associated dual linear program. If is nondegenerate then the objective error is differentiable and the total derivatives^{3}^{3}3In slight abuse of notation, we ignore leading singleton dimension of . are
Gradients and for the righthand sides are already wellknown as shadow prices. If is degenerate then the relationship between shadow prices and dual variables breaks down, resulting in twosided shadow prices [Strum69, Aucamp82].
We use degeneracy in the sense of tijssen1998balinski, where a point on the relative interior of the optimal face need not be degenerate, even if there exists a degenerate vertex on the optimal face. This matters when is nonunique because interior point methods typically converge to the analytical center of the relative interior of the optimal face [Zhang94]. Tijssen and Sierskma also give relations between degeneracy of and uniqueness of , which we apply in Corollary 1. When the gradients are nonunique, this corresponds to the subdifferentiable case.
Corollary 1.
In Theorem 1, both and are unique, is unique if and only if is unique, and both and are unique if and only if is unique or .
4 Experiments
We evaluate our approach by learning a range of synthetic LPs and parametric instances of minimumcost multicommodity flow. Use of synthetic instances is common in IO (e.g., Ahuja01, Keshavarz11, Dong18) and there are no communityestablished and readilyavailable benchmarks, especially for more general formulations. Our experimental study considers instances not directly addressable by previous IO work, either because we learn all coefficients jointly or because the parametrization results in nonconvex NLP.
We compare three versions^{4}^{4}4For completeness we also evaluated finitedifferences () which, unsurprisingly, was not competitive. of our gradientbased method (, , ) with two gradientfree methods: random search (RS) and COBYLA. The main observation is that the gradientbased methods perform similarly and become superior to gradientfree methods as the dimension of parametrization increases. We find that including a blackbox baseline like COBYLA is important for assessing the practical difficulty of an IO instance (and encourage future papers to do so) because such methods work reasonably well in lowdimensional problems. A second observation is that generalization to testing conditions is difficult because the discontinuous nature of LP decision space creates an underfitting phenomenon. This may explain why many previous works in IO require a surprising amount of training data for so few model parameters (see end of Section 4
). A third observation is that there are instances for which no method succeeds at minimizing training error 100% of the time. Our method can therefore be viewed as a way to significantly boost the probability of successful training, when combined with simple global optimization strategies such as multistart.
Experiments used PyTorch v1.6 nightly build, the COBYLA and SLSQP wrappers from SciPy v1.4.1, and were run on an Intel Core i7 with 16GB RAM. (We do not use GPUs, though our PyTorch interior point solver inherits GPU acceleration.) We do not regularize nor have any other hyperparameters.
Learning linear programs We used the LP generator of tan2019dio, modifying it to create a more challenging variety of feasible regions; their code did not perform competitively in terms of runtime or success rate on these harder instances, and cannot be applied to AOE loss. Fig. 3 shows the task of learning (, , ) with a dimensional parametrization , a dimensional decision space , and 20 training observations. RS fails; COBYLA ‘succeeds’ on 25% of instances; SQP succeeds on 6075%, which is substantially better. The success curve of slightly lags those of and due to the overhead of backpropagating through the steps of the interior point solver. See Appendix for five additional problem sizes, where overall the conclusions are the same. Surprisingly, works slightly better than and in problems with higher . We observe similar performance on instances with equality constraints, where and also need to be learned (see Appendix). Note that each RS trial returns the best of (typically) thousands of settings evaluated during the time budget, all sampled uniformly from the same from which the ‘true’ synthetic PLP was sampled. Most random (and thus initial) points do not satisfy (1a).
Learning directly, so that comprises all LP coefficients, results in a highdimensional NLP problem (which is why, to date, the IO literature has focused on special cases of this problem, either with a single [Chan18, Chan19] or fewer coefficients to learn [Ghobadi20]). For example, an instance with has adjustable parameters. , and consistently achieve zero AOE training loss, while RS and COBYLA consistently fail to make learning progress given the same time budget (see Appendix).
Learning minimumcost multicommodity flow problems Fig. 4 shows a visualization of our experiment on the NguyenDupuis graph [Nguyen84]. We learn a periodic arc cost and an affine arc capacity , based on global feature (time of day) and arcspecific features (length) and (toll price). To avoid trivial solutions, we set . Results on 100 instances are shown in Fig. 5. The SQP methods outperform RS and COBYLA in training and testing loss. From an IO perspective the fact that we are jointly learning costs and capacities in a nonconvex NLP formulation is already quite general. Again, for higherdimensional parametrizations, we can expect the advantage of gradientbased methods to get stronger.
We report both the mean and median loss over the testing points in each trial. The difference in mean and median testing error is due to the presence of a few ‘outliers’ among the test set errors. Fig.
6 shows the nature of this failure to generalize: the decision map of a PLP has discontinuities, so the training data can easily underspecify the set of learned models that can achieve zero training loss, similar to the scenario that motivates maxmargin learning in SVMs. It is not clear what forms of regularization will reliably improve generalization in IO. Fig. 6 also suggests that training points which closely straddle discontinuities are much more ‘valuable’ from a learning perspective.5 Conclusion
In this paper, we propose a novel bilevel formulation and gradientbased framework for learning linear programs from optimal decisions. The methodology learns all parameters jointly while allowing flexible parametrizations of costs, constraints, and loss functions—a generalization of the problems typically addressed in the inverse linear optimization literature.
Our work facilitates a strong class of inductive priors, namely parametric linear programs, to be imposed on a hypothesis space for learning. A major motivation for ours and for similar works is that, when the inductive prior is suited to the problem, we can learn a much better (and more interpretable) model, from far less data, than by applying generalpurpose machine learning methods. In settings spanning economics, commerce, and healthcare, data on decisions is expensive to obtain and to collect, so we hope that our approach will help to build better models and to make better decisions.
References
Appendix
Appendix A: Forward Optimization Problem for Figure 1
Appendix B: Redundancy Among TargetFeasibility Constraints
Redundant constraints in (1a) are not problematic in principle. Still, removing redundant constraints may help overall performance, either in terms of speed or numerical stability of the ‘outer’ solver. Here we discuss strategies for automatically removing redundant constraints, depending on assumptions. In this section, when we use or it should be understood to represent some target or .
Constraints that are equivalent. There may exist indices and for which the corresponding constraints and are identical or equivalent. For example, when a constraint is independent of this often results in identical training targets and that produce identical constraints. The situation for equality constraints is similar.
Constraints independent of . If an individual constraint is independent of then either:

[itemsep=.05em,topsep=.18em,leftmargin=0.7cm]

for all so the constraint can be omitted; or,

for some so the (ILOP) formulation is infeasible due to model misspecification, either in structural assumptions, or assumptions about noise.
The same follows for any equality constraint that is independent of . For example, in our minimumcost multicommodity flow experiments, the flow conservation constraints (equality) are independent of and so are omitted from (1a) in the corresponding ILOP formulation.
Constraints affinelydependent in . Constraints may be affinelydependent on parameters . For example, this is a common assumption in robust optimization [zhen2018adjustable]. Let and represent the constraints that are affinely dependent on . We can write
and 
for some matrixvalued functions and vectorvalued functions . It is easy to show that we can then rewrite the constraints as where
Similarly if are affine in we can rewrite them as . If we apply these functions across all training samples , and stack their coefficients as
then the corresponding ILOP constraints (1a) reduce to a set of linear ‘outer’ constraints and where . These reformulated constraint matrices are the system within which we eliminate redundancy in the affinelydependent case, continued below.
Equality constraints affinelydependent in . We can eliminate affinelydependent equality constraint sets by reparametrizing the ILOP search over a lowerdimensional space; this is what we do for the experiments with equality constraints shown in Figure 8, although the conclusions do not change with or without this reparametrization. To reparametrize the ILOP problem, compute a MoorePenrose pseudoinverse to get a direct parametrization of constrained vector in terms of an unconstrained vector :
(3) 
By reparametrizing (ILOP) in terms of we guarantee is satisfied and can drop equality constraints from (1a) entirely. There are three practical issues with (3):

[itemsep=.05em,topsep=.18em,leftmargin=0.7cm]

Constrained vector only has degrees of freedom, so we would like to reparametrize over a lowerdimensional .

To search over we need to specify and such that is equivalent to .

Given initial we need a corresponding to initialize our search.
To address the first issue, we can let the final components of in (3) be zero, which corresponds to using a lowerdimensional . As shorthand let matrix be
where denotes as in torch.eye(K, K’) and denotes the first columns of matrix . Then we have where the full dimension of matches the degrees of freedom in subject to and we have for any choice of .
To address the second issue, simplifying gives inequality constraints with and .
To address the third issue we must solve for in the linear system . Since the solution exists and is unique.
Consider also the effect of this reparametrization when is an infeasible system, for example due to noisy observations or misspecified constraints. In that case searching over automatically restricts the search to that satisfy in a least squares sense, akin to adding an infinitelyweighted term to the ILOP objective.
Inequality constraints affinelydependent in . After transforming affinelydependent inequality constraints to , detecting redundancy among these constraints can be as hard as solving an LP [telgen1983identifying]. Generally, inequality constraint is redundant with respect to if and only if the optimal value of the following LP is nonnegative:
(4)  
Here is the row of and is all the rows of except the . If the optimal value to (4) is nonnegative then it says “we tried to violate the constraint, but the other constraints prevented it, and so the constraint must be redundant.” However, telgen1983identifying reviews much more efficient methods of identifying redundant linear inequality constraints, by analysis of basic basic variables in a simplex tableau. zhen2018adjustable proposed a ‘redundant constraint identification’ (RCI) procedure proposed by that is directly analogous to (4
) along with another heuristic RCI procedure.
Constraints polynomiallydependent in . Similar to the affinelydependent case, when the coefficients of constraints and are polynomiallydependent on , we can rewrite the constraints in terms of . Redundancy among equality constraints of the resulting system can be simplified by computing a minimal Gröbner basis [cox2013ideals], for example by Buchberger’s algorithm which is a generalization of Gaussian elimination; see the paper by lim2012groebner for a review of Gröbner basis techniques applicable over a real field. Redundancy among inequality constraints for nonlinear programming has been studied [caron2009redundancy, obuchowska1995minimal]. Simplifying polynomial systems of equalities and inequalities is a subject of semialgebraic geometry and involves generalizations of FourierMotzkin elimination. Details are beyond the scope of this manuscript.
Appendix C: Proofs of Theorem 1 and Corollary 1
Proof of Theorem 1.
The dual linear program associated with (LP) is
(DP)  
where are the associated dual variables for the primal inequality and equality constraints, respectively.
Since is optimal to (LP) and are optimal to (DP), then satisfy the KKT conditions (written specialized to the particular LP form we use):
(KKT)  
where is the diagonal matrix having on the diagonal. The first two constraints correspond to primal feasibility, the next two to dual feasibility and the last one specifies complementary slackness. From here forward it should be understood that satisfy KKT even when not emphasized by .
As in the paper by Amos2017, implicitly differentiating the equality constraints in (KKT) gives
(DKKT)  
where are parameter differentials and are solution differentials, all having the same dimensions as the variables they correspond to. Because (KKT) is a secondorder system, (DKKT) is a system of linear equations. Because the system is linear, a partial derivative such as can be determined (if it exists) by setting and all other parameter differentials to , then solving the system for solution differential , as shown by Amos2017.
We can assume (KKT) is feasible in . In each case of the main proof it will be important to characterize conditions under which (DKKT) is then feasible in . This is because, if (DKKT) is feasible in at least , then by substitution we have
(5)  
and this substitution is what gives the total derivatives their form. In (5) the substitution holds because feasible in (KKT) implies in (DKKT), where is the ^{th} row of . Whenever is feasible in (DKKT) we have for any , where is the ^{th} row of differential .
Note that (5) holds even if (DKKT) is not feasible in and/or . In other words, it does not require the KKT point to be differentiable with respect to and/or .
Given a KKT point let be a partition of inequality indices where
and the corresponding submatrices of are . Then (DKKT) in matrix form is
(6) 
The pattern of the proof in each case will be to characterize feasibility of (6) in and then apply (5) for the result.
Evaluating . Consider . To evaluate the term, set and all other parameter differentials to . Then the righthand side of (6) becomes
(7) 
where denotes the vector with for component and elsewhere. System (7) is feasible in (not necessarily unique) so we can apply (5) to get . The result for then follows from .
Evaluating . Consider . Set and all other parameter differentials to . Then the righthand side of (6) becomes
(8) 
Since is nondegenerate in the sense of tijssen1998balinski, then there are at most active constraints (including equality constraints) and the rows of are also linearly independent. Since active constraints are linearly independent, system (8) is feasible in across all . We can therefore apply (5) to get . The result for then follows from .
Evaluating . Consider . Set and all other parameter differentials to . For the righthand side of (6) becomes
(9) 
Since is nondegenerate, then system (9) is feasible in for all by identical reasoning as for . For the righthand side of (6) is zero and so the system is feasible in . System (9) is therefore feasible in across all . We can therefore apply (5) to get . The result for then follows from .
Evaluating . Consider . Set and all other parameter differentials to . Then the righthand side of (6) becomes
(10) 
Comments
There are no comments yet.