Differentiable Convex Optimization Layers

10/28/2019 ∙ by Akshay Agrawal, et al. ∙ 78

Recent work has shown how to embed differentiable optimization problems (that is, problems whose solutions can be backpropagated through) as layers within deep learning architectures. This method provides a useful inductive bias for certain problems, but existing software for differentiable optimization layers is rigid and difficult to apply to new settings. In this paper, we propose an approach to differentiating through disciplined convex programs, a subclass of convex optimization problems used by domain-specific languages (DSLs) for convex optimization. We introduce disciplined parametrized programming, a subset of disciplined convex programming, and we show that every disciplined parametrized program can be represented as the composition of an affine map from parameters to problem data, a solver, and an affine map from the solver's solution to a solution of the original problem (a new form we refer to as affine-solver-affine form). We then demonstrate how to efficiently differentiate through each of these components, allowing for end-to-end analytical differentiation through the entire convex program. We implement our methodology in version 1.1 of CVXPY, a popular Python-embedded DSL for convex optimization, and additionally implement differentiable layers for disciplined convex programs in PyTorch and TensorFlow 2.0. Our implementation significantly lowers the barrier to using convex optimization problems in differentiable programs. We present applications in linear machine learning models and in stochastic control, and we show that our layer is competitive (in execution time) compared to specialized differentiable solvers from past work.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

Code Repositories

cvxpylayers

Differentiable convex optimization layers


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent work has shown how to differentiate through specific subclasses of convex optimization problems, which can be viewed as functions mapping problem data to solutions (amos2017optnet; djolonga2017differentiable; barratt2018differentiability; agrawal2019derivative; amos2019differentiable). These layers have found several applications (geng2019coercing; amos2017optnet; donti2017task; de2018end; amos2018differentiable; ling2018game; wilder2018melding; lee2019meta; barratt2019least; barratt2019fitting), but many applications remain relatively unexplored (see, e.g., (amos2019differentiable, §8)).

While convex optimization layers can provide useful inductive bias in end-to-end models, their adoption has been slowed by how difficult they are to use. Existing layers (e.g., (amos2017optnet; agrawal2019derivative)) require users to transform their problems into rigid canonical forms by hand. This process is tedious, error-prone, and time-consuming, and often requires familiarity with convex analysis. Domain-specific languages (DSLs) for convex optimization abstract away the process of converting problems to canonical forms, letting users specify problems in a natural syntax; programs are then lowered to canonical forms and supplied to numerical solvers behind-the-scenes (agrawal2018rewriting). DSLs enable rapid prototyping and make convex optimization accessible to scientists and engineers who are not necessarily experts in optimization.

The point of this paper is to do what DSLs have done for convex optimization, but for differentiable convex optimization layers. In this work, we show how to efficiently differentiate through disciplined convex programs (grant2006disciplined). This is a large class of convex optimization problems that can be parsed and solved by most DSLs for convex optimization, including CVX (grant2014cvx), CVXPY (diamond2016cvxpy; agrawal2018rewriting), Convex.jl (udell2014cvxjl), and CVXR (fu2017cvxr). Concretely, we introduce disciplined parametrized programming (DPP), a grammar for producing parametrized disciplined convex programs. Given a program produced by DPP, we show how to obtain an affine map from parameters to problem data, and an affine map from a solution of the canonicalized problem to a solution of the original problem. We refer to this representation of a problem — i.e., the composition of an affine map from parameters to problem data, a solver, and an affine map to retrieve a solution — as affine-solver-affine (ASA) form.

Our contributions are three-fold:

1. We introduce DPP, a new grammar for parametrized convex optimization problems, and ASA form, which ensures that the mapping from problem parameters to problem data is affine. DPP and ASA-form make it possible to differentiate through DSLs for convex optimization, without explicitly backpropagating through the operations of the canonicalizer. We present DPP and ASA form in §4.

2. We implement the DPP grammar and a reduction from parametrized programs to ASA form in CVXPY 1.1. We also implement differentiable convex optimization layers in PyTorch (paszke2017automatic) and TensorFlow 2.0 (agrawal2019tensorflow)

. Our software substantially lowers the barrier to using convex optimization layers in differentiable programs and neural networks

5).

3. We present applications to sensitivity analysis for linear machine learning models, and to learning control-Lyapunov policies for stochastic control (§6). We also show that for quadratic programs (QPs), our layer’s runtime is competitive with OptNet’s specialized solver qpth (amos2017optnet)7).

2 Related work

DSLs for convex optimization.

DSLs for convex optimization allow users to specify convex optimization problems in a natural way that follows the math. At the foundation of these languages is a ruleset from convex analysis known as disciplined convex programming (DCP) (grant2006disciplined). A mathematical program written using DCP is called a disciplined convex program, and all such programs are convex. Disciplined convex programs can be canonicalized to cone programs by expanding each nonlinear function into its graph implementation (grant2008graph). DPP can be seen as a subset of DCP that mildly restricts the way parameters (symbolic constants) can be used; a similar grammar is described in (chu2013code). The techniques used in this paper to canonicalize parametrized programs are similar to the methods used by code generators for optimization problems, such as CVXGEN (mattingley2012cvxgen), which targets QPs, and QCML, which targets second-order cone programs (SOCPs) (chu2013code; chu2017qcml).

Differentiation of optimization problems.

Convex optimization problems do not in general admit closed-form solutions. It is nonetheless possible to differentiate through convex optimization problems by implicitly differentiating their optimality conditions (when certain regularity conditions are satisfied) (fiacco1968; robinson1980; amos2017optnet). Recently, methods were developed to differentiate through convex cone programs in (busseti2018solution; agrawal2019derivative) and (amos2019differentiable, §7.3). Because every convex program can be cast as a cone program, these methods are general. The software released alongside (agrawal2019derivative), however, requires users to express their problems in conic form. Expressing a convex optimization problem in conic form requires a working knowledge of convex analysis. Our work abstracts away conic form, letting the user differentiate through high-level descriptions of convex optimization problems; we canonicalize these descriptions to cone programs on the user’s behalf. This makes it possible to rapidly experiment with new families of differentiable programs, induced by different kinds of convex optimization problems.

Because we differentiate through a cone program by implicitly differentiating its solution map, our method can be paired with any algorithm for solving convex cone programs. In contrast, methods that differentiate through every step of an optimization procedure must be customized for each algorithm (e.g., (domke2012generic; diamond2017unrolled; mardani2018neural)). Moreover, such methods only approximate the derivative, whereas we compute it analytically (when it exists).

3 Background

Convex optimization problems.

A parametrized convex optimization problem can be represented as

(1)

where is the optimization variable and

is the parameter vector

(boyd2004convex, §4.2). The functions are convex, and the functions are affine. A solution to (1) is any vector that minimizes the objective function, among all choices that satisfy the constraints. The problem (1) can be viewed as a (possibly multi-valued) function that maps a parameter to solutions. In this paper, we consider the case when this solution map is single-valued, and we denote it by . The function maps a parameter to a solution . From the perspective of end-to-end learning, (or parameters it depends on) is learned in order to minimize some scalar function of . In this paper, we show how to obtain the derivative of with respect to , when (1) is a DPP-compliant program (and when the derivative exists).

We focus on convex optimization because it is a powerful modeling tool, with applications in control (boyd1994linear; bertsekas2005dynamic; todorov2012mujoco), finance (markowitz1952portfolio; boyd2017multi), energy management (moehle2018dynamicnot), supply chain bertsimas2004robust; ben2005retailer, physics kanno2011nonsmooth; angeris2019, computational geometry (van2000computational), aeronautics (hoburg2014geometric), and circuit design (hershenson2001opamp; boyd2005digital), among other fields.

Disciplined convex programming.

DCP is a grammar for constructing convex optimization problems (grant2006disciplined; grant2008graph). It consists of functions, or atoms, and a single rule for composing them. An atom is a function with known curvature (affine, convex, or concave) and per-argument monotonicities. The composition rule is based on the following theorem from convex analysis. Suppose is convex, nondecreasing in arguments indexed by a set , and nonincreasing in arguments indexed by . Suppose also that are convex for , concave for , and affine for . Then the composition is convex. DCP allows atoms to be composed so long as the composition satisfies this composition theorem. Every disciplined convex program is a convex optimization problem, but the converse is not true. This is not a limitation in practice, because atom libraries are extensible (i.e., the class corresponding to DCP is parametrized by which atoms are implemented). In this paper, we consider problems of the form (1) in which the functions and are constructed using DPP, a version of DCP that performs parameter-dependent curvature analysis (see §4.1).

Cone programs.

A (convex) cone program is an optimization problem of the form

(2)

where is the variable (there are several other equivalent forms for cone programs). The set is a nonempty, closed, convex cone, and the problem data are , , and . In this paper we assume that (2) has a unique solution.

Our method for differentiating through disciplined convex programs requires calling a solver (an algorithm for solving an optimization problem) in the forward pass. We focus on the special case in which the solver is a conic solver. A conic solver targets convex cone programs, implementing a function mapping the problem data to a solution .

DCP-based DSLs for convex optimization can canonicalize disciplined convex programs to equivalent cone programs, producing the problem data , , and (agrawal2018rewriting); depend on the parameter and the canonicalization procedure. These data are supplied to a conic solver to obtain a solution; there are many high-quality implementations of conic solvers (e.g., (odonoghue2017scs; mosek; domahidi2013ecos)).

4 Differentiating through disciplined convex programs

We consider a disciplined convex program with variable , parametrized by ; its solution map can be viewed as a function that maps parameters to the solution (see §3). In this section we describe the form of and how to evaluate , allowing us to backpropagate through parametrized disciplined convex programs. (We use the notation to denote the derivative of a function evaluated at , and to denote the adjoint of the derivative at .) We consider the special case of canonicalizing a disciplined convex program to a cone program. With little extra effort, our method can be extended to other targets.

We express as the composition ; the canonicalizer maps parameters to cone problem data , the cone solver solves the cone problem, furnishing a solution , and the retriever maps to a solution of the original problem. A problem is in ASA form if and are affine.

By the chain rule, the adjoint of the derivative of a disciplined convex program is

The remainder of this section proceeds as follows. In §4.1, we present DPP, a ruleset for constructing disciplined convex programs reducible to ASA form. In §4.2, we describe the canonicalization procedure and show how to represent as a sparse matrix. In §4.3, we review how to differentiate through cone programs, and in §4.4, we describe the form of .

4.1 Disciplined parametrized programming

DPP is a grammar for producing parametrized disciplined convex programs from a set of functions, or atoms, with known curvature (constant, affine, convex, or concave) and per-argument monotonicities. A program produced using DPP is called a disciplined parametrized program. Like DCP, DPP is based on the well-known composition theorem for convex functions, and it guarantees that every function appearing in a disciplined parametrized program is affine, convex, or concave. Unlike DCP, DPP also guarantees that the produced program can be reduced to ASA form.

A disciplined parametrized program is an optimization problem of the form

(3)

where is a variable, is a parameter, the are convex, are concave, and are affine, and the expressions are constructed using DPP. An expression can be thought of as a tree, where the nodes are atoms and the leaves are variables, constants, or parameters. A parameter is a symbolic constant with known properties such as sign but unknown numeric value. An expression is said to be parameter-affine if it does not have variables among its leaves and is affine in its parameters; an expression is parameter-free if it is not parametrized, and variable-free if it does not have variables.

Every DPP program is also DCP, but the converse is not true. DPP generates programs reducible to ASA form by introducing two restrictions on expressions involving parameters:

  1. In DCP, we classify the curvature of each subexpression appearing in the problem description as convex, concave, affine, or constant. All parameters are classified as constant. In DPP, parameters are classified as affine, just like variables.

  2. In DCP, the product atom is affine if or is a constant (i.e., variable-free). Under DPP, the product is affine when at least one of the following is true:

    • or is constant (i.e., both parameter-free and variable-free);

    • one of the expressions is parameter-affine and the other is parameter-free.

The DPP specification can (and may in the future) be extended to handle several other combinations of expressions and parameters.

Example.

Consider the program

(4)

with variable and parameters , , and . If , the product, negation, and the sum are atoms, then this problem is DPP-compliant:

  • is affine because the atom is affine ( is parameter-affine and is parameter-free) and and are affine;

  • is affine because and are affine and the sum of affine expressions is affine;

  • is convex because is convex and convex composed with affine is convex;

  • is convex because the product is affine ( is parameter-affine, is parameter-free), it is increasing in (because is nonnegative), and is convex;

  • the objective is convex because the sum of convex expressions is convex.

Non-DPP transformations of parameters.

It is often possible to re-express non-DPP expressions in DPP-compliant ways. Consider the following examples, in which the are parameters:

  • The expression is not DPP because both of its arguments are parametrized. It can be rewritten in a DPP-compliant way by introducing a variable , replacing with the expression , and adding the constraint .

  • Let be an expression. The quotient is not DPP, but it can be rewritten as , where is a new parameter representing .

  • The expression is not DPP because is concave and increasing but is convex. It can be rewritten as where is a new parameter representing .

  • If is a parameter representing a (symmetric) positive semidefinite matrix and is a variable, the expression is not DPP. It can be rewritten as , where is a new parameter representing .

4.2 Canonicalization

The canonicalization of a disciplined parametrized program to ASA form is similar to the canonicalization of a disciplined convex program to a cone program. All nonlinear atoms are expanded into their graph implementations (grant2008graph), generating affine expressions of variables. The resulting expressions are also affine in the problem parameters due to the DPP rules. Because these expressions represent the problem data for the cone program, the function from parameters to problem data is affine.

As an example, the DPP program (4) can be canonicalized to the cone program

(5)

where is the variable, is the -dimensional second-order cone, and is the nonnegative orthant. When rewritten in the standard form (2), this problem has data

with blank spaces representing zeros and the horizontal line denoting the cone boundary. In this case, the parameters , and are just negated and copied into the problem data.

The canonicalization map.

The full canonicalization procedure (which includes expanding graph implementations) only runs the first time the problem is canonicalized. When the same problem is canonicalized in the future (e.g., with new parameter values), the problem data can be obtained by multiplying a sparse matrix representing by the parameter vector (and reshaping); the adjoint of the derivative can be computed by just transposing the matrix. The naïve alternative — expanding graph implementations and extracting new problem data every time parameters are updated (and differentiating through this algorithm in the backward pass) — is much slower (see §7). The following lemma tells us that can be represented as a sparse matrix.

Lemma 1.

The canonicalizer map for a disciplined parametrized program can be represented with a sparse matrix

and sparse tensor

, where is the dimension of the constraints. Letting denote the concatenation of and the scalar offset , the problem data can be obtained as and .

The proof is given in Appendix A.

4.3 Derivative of a conic solver

By applying the implicit function theorem (fiacco1968; dontchev2009implicit) to the optimality conditions of a cone program, it is possible to compute its derivative . To compute , we follow the methods presented in (agrawal2019derivative) and (amos2019differentiable, §7.3). Our calculations are given in Appendix B.

If the cone program is not differentiable at a solution, we compute a heuristic quantity, as is common practice in automatic differentiation

(griewank2008evaluating, §14). In particular, at non-differentiable points, a linear system that arises in the computation of the derivative might fail to be invertible. When this happens, we compute a least-squares solution to the system instead. See Appendix B for details.

4.4 Solution retrieval

The cone program obtained by canonicalizing a DPP-compliant problem uses the variable , where is a slack variable. If is optimal for the cone program, then is optimal for the original problem (up to reshaping and scaling by a constant). As such, a solution to the original problem can be obtained by slicing, i.e., . This map is evidently linear.

5 Implementation

We have implemented DPP and the reduction to ASA form in version 1.1 of CVXPY, a Python-embedded DSL for convex optimization (diamond2016cvxpy; agrawal2018rewriting)

; our implementation extends CVXCanon, an open-source library that reduces affine expression trees to matrices

(miller2015). We have also implemented differentiable convex optimization layers in PyTorch and TensorFlow 2.0. These layers implement the forward and backward maps described in §4; they also efficiently support batched inputs (see §7).

We use the the diffcp package (agrawal2019derivative) to obtain derivatives of cone programs. We modified this package for performance: we ported much of it from Python to C++, added an option to compute the derivative using a dense direct solve, and made the forward and backward passes amenable to parallelization.

Our implementation of DPP and ASA form, coupled with our PyTorch and TensorFlow layers, makes our software the first DSL for differentiable convex optimization layers. Our software is open-source. CVXPY and our layers are available at

https://www.cvxpy.org,  https://www.github.com/cvxgrp/cvxpylayers.

Example.

Below is an example of how to specify the problem (4) using CVXPY 1.1.

1import cvxpy as cp
2
3m, n = 20, 10
4x = cp.Variable((n, 1))
5F = cp.Parameter((m, n))
6g = cp.Parameter((m, 1))
7lambd = cp.Parameter((1, 1), nonneg=True)
8objective_fn = cp.norm(F @ x - g) + lambd * cp.norm(x)
9constraints = [x >= 0]
10problem = cp.Problem(cp.Minimize(objective_fn), constraints)
11assert problem.is_dpp()

The below code shows how to use our PyTorch layer to solve and backpropagate through problem (the code for our TensorFlow layer is almost identical; see Appendix D).

1import torch
2from cvxpylayers.torch import CvxpyLayer
3
4F_t = torch.randn(m, n, requires_grad=True)
5g_t = torch.randn(m, 1, requires_grad=True)
6lambd_t = torch.rand(1, 1, requires_grad=True)
7layer = CvxpyLayer(
8    problem, parameters=[F, g, lambd], variables=[x])
9x_star, = layer(F_t, g_t, lambd_t)
10x_star.sum().backward()

Constructing layer in line 7-8 canonicalizes problem to extract and , as described in §4.2. Calling layer in line 9 applies the map from §4, returning a solution to the problem. Line 10 computes the gradients of summing x_star, with respect to F_t, g_t, and lambd_t.

6 Examples

In this section, we present two applications of differentiable convex optimization, meant to be suggestive of possible use cases for our layer. We give more examples in Appendix E.

valign=t

Figure 1: Gradients (black lines) of the logistic test loss with respect to the training data.

valign=t

Figure 2: Per-iteration cost while learning an ADP policy for stochastic control.

6.1 Data poisoning attack

We are given training data , where are feature vectors and are the labels. Suppose we fit a model for this classification problem by solving

(6)

where the loss function

is convex in and is a convex regularizer. We hope that the test loss is small, where is our test set.

Assume that our training data is subject to a data poisoning attack (biggio2018wild; jagielski2018manipulating), before it is supplied to us. The adversary has full knowledge of our modeling choice, meaning that they know the form of (6), and seeks to perturb the data to maximally increase our loss on the test set, to which they also have access. The adversary is permitted to apply an additive perturbation to each of the training points , with the perturbations satisfying .

Let be optimal for (6). The gradient of the test loss with respect to a training data point, .gives the direction in which the point should be moved to achieve the greatest increase in test loss. Hence, one reasonable adversarial policy is to set . The quantity is the predicted increase in our test loss due to the poisoning.

Numerical example.

We consider 30 training points and 30 test points in , and we fit a logistic model with elastic-net regularization. This problem can be written using DPP, with as parameters (see Appendix C for the code). We used our convex optimization layer to fit this model and obtain the gradient of the test loss with respect to the training data. Figure 2 visualizes the results. The orange (

) and blue (+) points are training data, belonging to different classes. The red line (dashed) is the hyperplane learned by fitting the the model, while the blue line (solid) is the hyperplane that minimizes the test loss. The gradients are visualized as black lines, attached to the data points. Moving the points in the gradient directions torques the learned hyperplane away from the optimal hyperplane for the test set.

6.2 Convex approximate dynamic programming

We consider a stochastic control problem of the form

(7)

where is the state, is the policy, is a convex set representing the allowed set of controls, and is a (random, i.i.d.) disturbance. Here the variable is the policy , and the expectation is taken over disturbances and the initial state . If is not an affine set, then this problem is in general very difficult to solve kalman1964linear; barratt2018stochastic.

ADP policy.

A common heuristic for solving (7) is approximate dynamic programming (ADP), which parametrizes and replaces the minimization over functions with a minimization over parameters. In this example, we take to be the unit ball and we represent as a quadratic control-Lyapunov policy wang2010fast. Evaluating corresponds to solving the SOCP

(8)

with variable and parameters , , , and

. We can run stochastic gradient descent (SGD) on

, , and to approximately solve (7), which requires differentiating through (8). Note that if were unconstrained, (7) could be solved exactly, via linear quadratic regulator (LQR) theory (kalman1964linear). The policy (8) can be written using DPP (see Appendix C for the code).

Numerical example.

Figure 2

plots the estimated average cost for each iteration of gradient descent for a numerical example, with

and , a time horizon of , and a batch size of . We initialize our policy’s parameters with the LQR solution, ignoring the constraint on . This method decreased the average cost by roughly 40%.

7 Evaluation

Our implementation substantially lowers the barrier to using convex optimization layers. Here, we show that our implementation substantially reduces canonicalization time. Additionally, for dense problems, our implementation is competitive (in execution time) with a specialized solver for QPs; for sparse problems, our implementation is much faster.

Canonicalization.

Logistic regression Stochastic control
CVXPY 1.0.23 18.9 1.75 12.5 0.72
CVXPY 1.1 1.49 0.02 1.39 0.02
Table 1: Time (ms) to canonicalize examples, across 10 runs.

Table 1 reports the time it takes to canonicalize the logistic regression and stochastic control problems from §6

, comparing CVXPY version 1.0.23 with CVXPY 1.1. Each canonicalization was performed on a single core of an unloaded Intel i7-8700K processor. We report the average time and standard deviation across 10 runs, excluding a warm-up run. Our extension achieves on average an order-of-magnitude speed-up since computing

via a sparse matrix multiply is much more efficient than going through the DSL.

Comparison to specialized layers.

(a) Dense QP, batch size of 128.
(b) Sparse QP, batch size of 32.
Figure 3: Comparison of our PyTorch CvxpyLayer to qpth, over 10 trials. For cvxpylayers, we separate out the canonicalization and solution retrieval times, to allow for a fair comparison.

We have implemented a batched solver and backward pass for our differentiable CVXPY layer that makes it competitive with the batched QP layer qpth from (amos2017optnet). Figure 3 compares the runtimes of our PyTorch CvxpyLayer and qpth on a dense and sparse QP. The sparse problem is too large for qpth to run in GPU mode. The QPs have the form

(9)

with variable , and problem data , , , , , and . The dense QP has , , and . The sparse QP has , , and and , , and each have 1% nonzeros (See Appendix E for the code). We ran this experiment on a machine with a 6-core Intel i7-8700K CPU, 32 GB of memory, and an Nvidia GeForce 1080 TI GPU with 11 GB of memory.

Our implementation is competitive with qpth for the dense QP, even on the GPU, and roughly 5 times faster for the sparse QP. Our backward pass for the dense QP uses our extension to diffcp; we explicitly materialize the derivatives of the cone projections and use a direct solve. Our backward pass for the sparse QP uses sparse operations and LSQR (paige1982lsqr), significantly outperforming qpth (which cannot exploit sparsity). Our layer runs on the CPU, and implements batching via Python multi-threading, with a parallel for loop over the examples in the batch for both the forward and backward passes. We used 12 threads for our experiments.

8 Discussion

Other solvers.

Solvers that are specialized to subclasses of convex programs are often faster than more general conic solvers. For example, one might use OSQP (stellato2017osqp) to solve QPs, or gradient-based methods like L-BFGS (liu1989limited) or SAGA (defazio2014saga) for empirical risk minimization. Because CVXPY lets developers add specialized solvers as additional back-ends, our implementation of DPP and ASA form can be easily extended to other problem classes. We plan to interface QP solvers in future work.

Nonconvex problems.

It is possible to differentiate through nonconvex problems, either analytically (fiacco1983nlp; pirnay2012optimal; amos2018differentiable) or by unrolling SGD (domke2012generic; belanger2017end; metz2016unrolled; goodfellow2013multi; stoyanov2011empirical; brakel2013training; finn2017model), Because convex programs can typically be solved efficiently and to high accuracy, it is preferable to use convex optimization layers over nonconvex optimization layers when possible. This is especially true in the setting of low-latency inference. The use of differentiable nonconvex programs in end-to-end learning pipelines, discussed in (gould2019deep), is an interesting direction for future research.

Acknowledgments

We gratefully acknowledge discussions with Eric Chu, who designed and implemented a code generator for SOCPs (chu2013code; chu2017qcml), Nicholas Moehle, who designed and implemented a basic version of a code generator for convex optimization in unpublished work, and Brendan O’Donoghue. We also would like to thank the anonymous reviewers, who provided us with useful suggestions that improved the paper. S. Barratt is supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518.

Appendix A The canonicalization map

In this appendix, we provide a proof of Lemma 1. We compute and via a reduction on the affine expression trees that represent the canonicalized problem. Let be the root node with arguments (descendants) . Then we obtain tensors representing the (linear) action of on each argument. We recurse on each subtree and obtain tensors . Due to the DPP rules, for , we either have for or for . We define an operation such that in the first case, , and in the second case . The tree rooted at then evaluates to .

The base case of the recursion corresponds to the tensors produced when a variable, parameter, or constant node is evaluated. (These are the leaf nodes of an affine expression tree.)

  • A variable leaf produces a tensor , where if maps to in the vector containing all variables, 0 otherwise.

  • A parameter leaf produces a tensor , where if maps to in the vector containing all parameters, 0 otherwise.

  • A constant leaf produces a tensor , where for .

Appendix B Derivative of a cone program

In this appendix, we show how to differentiate through a cone program. We first present some preliminaries.

Primal-dual form of a cone program.

A (convex) cone program is given by

(10)

Here is the primal variable, is the dual variable, and is the primal slack variable. The set is a nonempty, closed, convex cone with dual cone . We call a solution of the primal-dual cone program (10) if it satisfies the KKT conditions:

Every convex optimization problem can be reformulated as a convex cone program.

Homogeneous self-dual embedding.

The homogeneous self-dual embedding reduces the process of solving (10) to finding a zero of a certain residual map ye1994nl. Letting , the embedding uses the variable , which we partition as . The normalized residual map introduced in busseti2018solution is the function , defined by

where denotes projection onto , and

is the skew-symmetric matrix

(11)

If and , we can use to construct a solution of the primal-dual pair (10) as

(12)

where denotes the projection of onto . From here onward, we assume that . (If this is not the case, we can scale such that it is the case.)

Differentiation.

A conic solver is a numerical algorithm for solving (10). We can view a conic solver as a function mapping the problem data to a solution . (We assume that the cone is fixed.) In this section we derive expressions for the derivative of , assuming that is in fact differentiable. Interlaced with our derivations, we describe how to numerically evaluate the adjoint of the derivative map, which is necessary for backpropagation.

Following agrawal2019derivative and (amos2019differentiable, Section 7), we can express as the composition , where

  • maps the problem data to , given by (11),

  • solves the homogeneous self-dual embedding, which we can implicitly differentiate, and

  • maps to the primal-dual pair, given by (12).

To backpropagate through , we need to compute the adjoint of the derivative of at applied to the vector , or

Since our layer only outputs the primal solution , we can simplify the calculation by taking . By (12),

We can compute by implicitly differentiating the normalized residual map:

(13)

This gives

where . Computing via a direct method (i.e., materializing , factorizing it, and back-solving) can be impractical when is large. Instead, one might use a Krylov method like LSQR (paige1982lsqr) to solve

(14)

which only requires multiplication by and . Instead of computing as an outer product, we only obtain its nonzero entries. Finally, partitioning as

we obtain

Non-differentiability.

To implicitly differentiate the solution map in , we assumed that the was invertible. When is not invertible, we approximate as , where is a least-squares solution to .

Appendix C Examples

This appendix includes code for the examples presented in §6.

Logistic regression.

The code for the logistic regression problem is below:

1import cvxpy as cp
2from cvxpylayers.torch import CvxpyLayer
3
4beta = cp.Variable((n, 1))
5b = cp.Variable((1, 1))
6X = cp.Parameter((N, n))
7
8log_likelihood = (1. / N) * cp.sum(
9    cp.multiply(Y, X @ beta + b) - cp.logistic(X @ beta + b)
10)
11regularization = -0.1 * cp.norm(beta, 1) -0.1 * cp.sum_squares(beta)
12
13prob = cp.Problem(cp.Maximize(log_likelihood + regularization))
14fit_logreg = CvxpyLayer(prob, parameters=[X], variables=[beta, b])

Stochastic control.

The code for the stochastic control problem (7) is below:

1import cvxpy as cp
2from cvxpylayers.torch import CvxpyLayer
3
4x_cvxpy = cp.Parameter((n, 1))
5P_sqrt_cvxpy = cp.Parameter((m, m))
6P_21_cvxpy = cp.Parameter((n, m))
7q_cvxpy = cp.Parameter((m, 1))
8
9u_cvxpy = cp.Variable((m, 1))
10y_cvxpy = cp.Variable((n, 1))
11
12objective = .5 * cp.sum_squares(P_sqrt_cvxpy @ u_cvxpy) + x_cvxpy.T @ y_cvxpy + q_cvxpy.T @ u_cvxpy
13prob = cp.Problem(cp.Minimize(objective),
14  [cp.norm(u_cvxpy) <= .5, y_cvxpy == P_21_cvxpy @ u_cvxpy])
15
16policy = CvxpyLayer(prob,
17  parameters=[x_cvxpy, P_sqrt_cvxpy, P_21_cvxpy, q_cvxpy],
18  variables=[u_cvxpy])

Appendix D TensorFlow layer

In §5, we showed how to implement the problem (4) using our PyTorch layer. The below code shows how to implement the same problem using our TensorFlow 2.0 layer.

1import tensorflow as tf
2from cvxpylayers.tensorflow import CvxpyLayer
3
4F_t = tf.Variable(tf.random.normal(F.shape))
5g_t = tf.Variable(tf.random.normal(g.shape))
6lambd_t = tf.Variable(tf.random.normal(lambd.shape))
7layer = CvxpyLayer(problem, parameters=[F, g, lambd], variables=[x])
8with tf.GradientTape() as tape:
9  x_star, = layer(F_t, g_t, lambd_t)
10dF, dg, dlambd = tape.gradient(x_star, [F_t, g_t, lambd_t])

Appendix E Additional examples

In this appendix we provide additional examples of constructing differentiable convex optimization layers using our implementation. We present the implementation of common neural networks layers, even though analytic solutions exist for some of these operations. These layers can be modified in simple ways such that they do not have analytical solutions. In the below problems, the optimization variable is (unless stated otherwise). We also show how prior work on differentiable convex optimization layers such as OptNet amos2017optnet is captured by our framework.

The ReLU, defined by , can be interpreted as projecting a point onto the non-negative orthant as

We can implement this layer with:

1x = cp.Parameter(n)
2y = cp.Variable(n)
3obj = cp.Minimize(cp.sum_squares(x-y))
4cons = [y >= 0]
5prob = cp.Problem(obj, cons)
6layer = CvxpyLayer(prob, parameters=[x], variables=[y])

The sigmoid or logistic function, defined by , can be interpreted as projecting a point onto the interior of the unit hypercube as

where is the binary entropy function. This is proved, e.g., in (amos2019differentiable, Section 2.4). We can implement this layer with:

1x = cp.Parameter(n)
2y = cp.Variable(n)
3obj = cp.Minimize(-x.T*y - cp.sum(cp.entr(y) + cp.entr(1.-y)))
4prob = cp.Problem(obj)
5layer = CvxpyLayer(prob, parameters=[x], variables=[y])

The softmax, defined by , can be interpreted as projecting a point onto the interior of the -simplex as

where is the entropy function. This is proved, e.g., in (amos2019differentiable, Section 2.4). We can implement this layer with:

1x = cp.Parameter(d)
2y = cp.Variable(d)
3obj = cp.Minimize(-x.T*y - cp.sum(cp.entr(y)))
4cons = [sum(y) == 1.]
5prob = cp.Problem(obj, cons)
6layer = CvxpyLayer(prob, parameters=[x], variables=[y])

The sparsemax martins2016softmax does a Euclidean projection onto the simplex as

We can implement this layer with:

1x = cp.Parameter(n)
2y = cp.Variable(n)
3obj = cp.sum_squares(x-y)
4cons = [cp.sum(y) == 1, 0. <= y, y <= 1.]
5prob = cp.Problem(cp.Minimize(obj), cons)
6layer = CvxpyLayer(prob, [x], [y])

The constrained softmax martins2017learning solves the optimization problem

We can implement this layer with:

1x = cp.Parameter(n)
2y = cp.Variable(n)
3obj = -x*y-cp.sum(cp.entr(y))
4cons = [cp.sum(y) == 1., y <= u]
5prob = cp.Problem(cp.Minimize(obj), cons)
6layer = CvxpyLayer(prob, [x], [y])

The constrained sparsemax malaviya2018sparse solves the optimization problem

We can implement this layer with:

1x = cp.Parameter(n)
2y = cp.Variable(n)
3obj = cp.sum_squares(x-y)
4cons = [cp.sum(y) == 1., 0. <= y, y <= u]
5prob = cp.Problem(cp.Minimize(obj), cons)
6layer = CvxpyLayer(prob, [x], [y])

The Limited Multi-Label (LML) layer amos2019lml solves the optimization problem

We can implement this layer with:

1x = cp.Parameter(n)
2y = cp.Variable(n)
3obj = -x*y-cp.sum(cp.entr(y))-cp.sum(cp.entr(1.-y))
4cons = [cp.sum(y) == k]
5prob = cp.Problem(cp.Minimize(obj), cons)
6layer = CvxpyLayer(prob, [x], [y])

The OptNet QP.

We can re-implement the OptNet QP layer (amos2017optnet) in a few lines of code. The OptNet layer is a solution to a convex quadratic program of the form

where is the optimization variable, and the problem data are (which is positive semidefinite), , , , , and . We can implement this with:

1Q_sqrt = cp.Parameter((n, n))
2q = cp.Parameter(n)
3A = cp.Parameter((m, n))
4b = cp.Parameter(m)
5G = cp.Parameter((p, n))
6h = cp.Parameter(p)
7x = cp.Variable(n)
8obj = cp.Minimize(0.5*cp.sum_squares(Q_sqrt*x) + q.T @ x)
9cons = [A @ x == b, G @ x <= h]
10prob = cp.Problem(obj, cons)
11layer = CvxpyLayer(prob, parameters=[Q_sqrt, q, A, b, G, h], variables=[x])

Note that we take the matrix square-root of in PyTorch, outside the CVXPY layer, to get the derivative with respect to . DPP does not allow the quadratic form atom to be parametrized, as discussed in §4.1.