Deep learning models (LeCun2015)
have demonstrated remarkable success in tasks that require exploitation of subtle correlations, such as computer vision(Krizhevsky2012) and sequence learning (Sutskever2014). Typically, humans have strong prior knowledge about a task, e.g., based on symmetry, geometry, or physics. Learning such a priori assumptions in a purely data-driven manner is inefficient and, in some situations, may not be feasible at all. While certain prior knowledge was successfully imposed – for example translational symmetry through convolutional architectures (LeCun1998) – incorporating more general modeling assumptions in the training of deep networks remains an open challenge. Recently, generative neural networks have advanced significantly (Goodfellow2014; Kingma2014). With such models, controlling the generative process beyond a data-driven, black-box approach is particularly important.
In this paper, we present a method to impose prior knowledge in form of linear inequality constraints on the activations of deep learning models. We directly impose these constraints through a suitable parameterization of the feasible set. This has several advantages:
The constraints are hard-constraints in the sense that they are satisfied at any point during training.
There is no manual trade-off between constraint satisfaction and data representation.
The proposed method can easily be applied to constrain not only the network output, but also any intermediate activations.
In summary, the main contribution of our method is a reparameterization that incorporates linear inequality hard-constraints on neural network activations. The model can be optimized by standard variants of stochastic gradient descent. As an application in generative modeling, we demonstrate that our method is able to produce authentic samples from a variational autoencoder while satisfying the imposed constraints.
2 Related Work
Various works have introduced methods to impose some type of hard constraint on neural network activations. This differs from a classical constrained optimization problem (Nocedal2006) in that the constraints are on the image of a parameterized function rather than optimization variables, i.e., neural network parameters.
Marquez-Neila2017 formulated generic differentiable equality constraints as soft constraints and employed a Lagrangian approach to train their model. While this is a principled approach to constrained optimization, it does not scale well to practical deep neural network models with their vast number of parameters. To make their method computationally tractable, a subset of the constraints is selected at each training step. In addition, these constraints are locally linearized; thus, there is no guarantee that this subset will be satisfied after a parameter update.
Pathak2015 proposed an optimization scheme that alternates between optimizing the deep learning model and fitting a constrained distribution to these intermediate models. They deal with a classification task and the fitting step is in the Kullback-Leibler sense. However, this method involves solving a (convex) sub optimization problem at each training step. Furthermore, the overall convergence path depends on how the alternating optimization steps are combined, which introduces an additional hyperparameter that must be tuned.
OptNet, an approach to solve a generic quadratic program as a differentiable network layer, was proposed by Amos2017
. OptNet backpropagates through the first-order optimality conditions of the quadratic program, and linear inequality constraints can be enforced as a special case. The formulation is flexible; however, it scales cubically with the number of variables and constraints. Thus, it becomes prohibitively expensive to train large-scale deep learning models.
Finally, several works have proposed handcrafted solutions for specific applications, such as skeleton prediction (Zhou2016) and prediction of rigid body motion (Byravan2017). In contrast, to avoid laborious architecture design, we argue for the value of generically modeling constraint classes. In practice, this makes constraint methods more accessible for a broader class of problems.
In this work, we tackle the problem of imposing linear inequality constraints on neural network activations. Rather than solving a sub optimization problem during training, we split this task into a feasibility step at initialization and an optimality step during training. At initialization, we compute a suitable parameterization of the constraint set and use the neural network training algorithm to find a good solution within this feasible set. Conceptually, compared to an unconstrained model, we are trading-off computational cost during initialization to obtain a model that can be trained with nearly no overhead. The proposed method is implemented as a neural network layer that is specified by a set of linear inequalities and whose output parameterizes the feasible set.
3 Linear Inequality Constraints for Deep Learning Models
We consider a generic layer neural network with model parameters for inputs as follows:
where are affine functions, e.g., a fully-connected or convolutional layer, and is an elementwise non-linearity111Formally, maps between different spaces for different layers and may also be a different element-wise non-linearity for each layer. We omit such details in favor of notational simplicity.are known and a loss is minimized as a function of the network parameters . A typical loss for a classification task is the cross entropy between the network output and the empirical target distribution, while the mean-squared error is commonly used for a regression task. The proposed method can be applied to constrain any linear activations or non-linear activations . In many applications, one would like to constrain the output .
Here, we want to force neural network activations to satisfy a set of linear inequality constraints in dimensions, i.e., to be constrained within the convex polyhedron
We seek a parameterization of the constraint set (2) that works well with end-to-end gradient-based training. A suitable description of the convex polyhedron is obtained by the decomposition theorem for polyhedra.
Theorem 1 (Decomposition of polyhedra, Minkowski-Weyl).
A set is a convex polyhedron of the form (2) if and only if
for finitely many vertices and rays .
Furthermore, if and only if
for finitely many rays .
Such a polyhedron is shown in Figure 2. If the polyhedron is bounded (as in the figure), then it may be fully described by the convex hull of its vertices. If it is unbounded (not shown), then it has a conic contribution. Each polyhedral set of the form (2) can be expressed as follows
in the sense that if , then . We refer to this form as the homogeneous formulation of the problem. In other words, with an additional constraint, every convex polyhedron can be lifted by one dimension. For this description, the vertices can be considered as endpoints of rays. The homogeneous problem is often formulated with an equality constraint ; however, the relaxed formulation is numerically advantageous (Section 3.2).
The theorem states that an intersection of half-spaces (half-space or H-representation) can be written as the Minkowski sum of a convex combination of the polyhedron’s vertices and a conical combination of some rays (vertex or V-representation). One can switch algorithmically between these two viewpoints via the double description method (Motzkin1953; Fukuda1996), which we discuss in the following. Thus, the H-representation, which is natural when modeling inequality constraints, can be transformed into the V-representation, which can be incorporated into gradient-based neural network training.
3.1 Double Description Method
The double description method converts between the half-space and vertex representation of a system of linear inequalities. It was originally proposed by Motzkin1953 and further refined by Fukuda1996222In our experiments we use pycddlib, which is a Python wrapper of Fukuda’s cddlib.. Here, we are only interested in the conversion from H-representation to V-representation in homogeneous form (5),
The core algorithm proceeds as follows. Let the rows of define a set of homogeneous inequalities and let be the matrix whose columns are the rays of the corresponding cone. Here, form a double description pair. The algorithm iteratively builds a double description pair from in the following manner. The rows in represent a -subset of the rows of and thus define a convex polyhedron associated with . Adding a single row tofor two columns , of intersects with this hyperplane and is a face333 is a face of the convex set if it holds for all that . of , then this intersection point is added to . Existing rays that are cut-off by the additional hyperplane are removed from . The result is the double description pair . This procedure is shown in Figure 2.
Adding a hyperplane might drastically increase the number of rays in intermediate representations, which, in turn, contribute combinatorically in the subsequent iteration. In fact, there exist worst case polyhedra for which the algorithm has exponential run time as a function of the number of inequalities and the input dimension, as well as the number of rays (Dyer1983; Bremner1999). Under certain assumptions more efficient bounds are known. A convex polyhedron is degenerate if there exists such that fulfills more than inequalities with equality; otherwise, is nondegenerate. For nondegenerate polyhedra, the problem can be solved in time complexity, where is the number of rays in the final V-representation and the number of constraints (Avis1992). However, and may depend unfavorably on the dimension . An extreme example is the unit box where and ; thus, the algorithm has exponential run time in the dimension . Overall, one can expect the algorithm to be efficient only for problems with a reasonably small number of inequalities and dimension .
3.2 Integration in Neural Network Architectures
We parameterize the homogeneous form (5) of the problem via a neural network layer. This layer takes as input some (latent) representation of the data, which is mapped to activations satisfying the desired hard constraints. The algorithm is provided with the H-representation of linear inequality constraints, i.e., and to specify the feasible set (5). At initialization, we convert this to the V-representation via the double description method (Section 3.1). This corresponds to computing the set of rays to represent the polyhedral cone. During training, the neural network training algorithm is used to optimize within in the feasible set. There are two critical aspects in this procedure. First, as outlined in Section 3.1, the run-time complexity of the double description method may be prohibitive. Conceptually, the proposed approach allows for significant compute time at initialization to obtain an algorithm that is very efficient at training time. Second, we must ensure that the mapping from the latent representation to the parameters integrates well with the training algorithm. We assume that the model is trained with gradient-based backpropagation, as is common for current deep learning applications. The constraint layer comprises an affine mapping (fully-connected layer with biases) followed by the element-wise absolute value function that ensures the non-negativity required by the conical combination parameters. To ensure the constraint , we scale the entire vector by if . This is a valid operation on a cone, i.e., if , then . Therefore, our original constraints are not violated.
Several choices must be made to obtain a working backpropagation algorithm. Since adding a constraint to the network maps Euclidean space to a smaller subset, such a formulation is intuitively susceptible to a vanishing gradient problem. There must be points that are distant in Euclidean space, but are mapped to relatively close points in the feasible set. Thus, moving in Euclidean space induces a small change in the feasible set. A one-dimensional example of such a mapping is the sigmoid, which exhibits vanishing gradients for large positive and negative values. We build on this intuition to design our constraint layer. We do not work with the non-homogeneous formulation (2) directly. Instead, we lift the problem to the homogeneous formulation (5
). In fact, a natural way to enforce that the convex combination parameters of the homogeneous formulation are on the probability simplexis via a softmax mapping, as follows:
This function has vanishing gradients when one is significantly greater than the other vector entries. In this case, the softmax maps close to a vertex on the probability simplex. This is undesirable behavior for our task and leads to slow convergence, which we also observed numerically. In the homogeneous formulation, we must enforce an additional constraint. We enforce an inequality constraint rather than an equality constraint , which would be an equally valid description of the problem. However, with the inequality, the optimization algorithm is less constrained and we only need to interfere by scaling when . Finally, we selected the absolute value function to enforce non-negativity of the conical combination parameters. In theory, any function would fulfill this requirement; however, care must be taken to not interfere with backpropagated gradients. For example, the ReLU function , which is commonly used as a non-linearity in neural networks, has zero gradient for . This implies that certain conical combination parameters are zero and cannot become non-zero during optimization. The absolute value function interferes least with the backpropagated gradient in the sense that it preserves the magnitude of the backpropagated signal and at most changes its sign.
3.3 Combining Modeling and Domain Constraints
Domain constraints are often formulated as box constraints, , such as a pixel domain in computer vision applications. As indicated in Section 3.1, box constraints are particularly unfit to be converted using the double description method because the number of vertices is exponential in the dimension. Therefore, in this paper, we distinguish modeling constraints and domain constraints and only convert the former into V-representation while enforcing the latter through a joint projection step at test time. These two types of constraints differ conceptually. While the modeling constraints may conflict with the data-driven task, the domain constraints are in line with fitting the data. Consequently, the joint projection after successful training can be expected to result in only a small correction.
4 Numerical Results
We demonstrate the proposed constraint method in two different settings.
In an initial experiment, we project the input data onto a constraint set.
Here, the result can be compared to the optimal solution of the convex optimization problem.
The purpose of this simple setup is to show that, despite being non-convex, the proposed method does not interfere with the optimizability of the problem.
In a second experiment, consistent with our motivation to modify the output of generative models, we constrain the output of a variational autoencoder, and show samples drawn from this constrained model.
We used the MNIST dataset (Mnist) for both experiments ( training samples, validation and test samples).
We have used PyTorch(Paszke2017) for our implementation444Our implementation will be publicly available. and all experiments were performed on a single Nvidia Titan X GPU.
4.1 Orthogonal Projection onto a Constraint Set
Using a simple toy problem, we demonstrate that the proposed algorithm can find good solutions to a constrained learning problem. For given linear inequalities specified in H-representation, we solve the following problem
where is an MNIST image. Here, the problem is convex; therefore, the global optimum can be readily computed and can be compared to the performance on a held-out validation set. We impose a checkerboard constraint with tiles, where neighboring tiles are constrained to be either below or above the median intensity of the image domain on average. The pixel intensity domain in our experiments is and thus the tiles’ average intensity is positive or negative, respectively. In this setting, we can expect that training an unconstrained network with subsequent projection onto the constraint set at test time will yield good results. Here, let be the orthogonal projection onto the constraint set and denote the mean-squared error as . Both mappings are Lipschitz continuous with Lipschitz constant . Consequently, for output of an unconstrained model,
where, by definition, the term is the optimal value of problem (4.1). The training algorithm fits to ; therefore, it follows that projecting the unconstrained output onto the constraint set will yield an objective value that is close to the optimal value of the constrained optimization problem.
We provide results obtained using the constraint parameterization algorithm and the projection of an unconstrained solution onto the constraint set. To have a comparable number of parameters for these models, we use a single fully-connected layer in both cases. For the unconstrained model, we employ an layer, and for the constrained model we employ an layer with many rays to represent the constraint set in V-representation. Both models were optimized using the Adam optimizer (Kingma2015) and a learning rate of . Note that we are dealing with a mean-squared loss; thus, we expect the test time projection method to work well (Section 3). Figure LABEL:fig:convergence_mnist_projection shows that a validation objective for both algorithms converges to the average optimum over the validation set. As expected for a more constrained optimization procedure, convergence with the constraint parameterization method is slower. Figure LABEL:fig:samples_mnist_projection shows a test set sample and the output of the constraint parameterization network.