1 Introduction
Deep neural networks (DNNs) are capable of learning highlycomplex relationships between input data and the expected output. This permits training and validation of large models in robotics and medicine (djeumou_l4dc2022; kushner2020conformance; ode6_drone_landing), enabling designers to comfortably achieve small approximation errors. But the caveat that comes with this flexibility, is lack of generalization when pushed outside of the training distribution. We refer to the experiments in conformal_predictions. One of the instances it covers corresponds to that of Newton’s first law. The neural network dynamics model of a car should predict that, given zero throttle and when at rest, the car should continue to remain at rest. The model trained on real trajectory data in autorally failed to conform to this simple property. A very similar situation happens in the case of the glucoseinsulin dynamics model for an artificial pancreas, a device for patients with type1 diabetes. This property has been studied in kushner2020conformance, where it was found that deep neural network model could easily produce predictions which can be fatal for the patient.
However these challenges are much less prevalent in models which are typically informed by different scientific disciplines. Examples of this include models based on mechanical properties of robotic systems (vehicle), aerodynamic properties of drag and lift (all_about_quads), physiological models of the human body (DallaManModel; sanjian_jim_model_based) and alike. The advantage of using models (rather than atomic constraints) is that they encompass of wider range of desirable properties quite naturally. In robotics it is common to find such highfidelity physicsengine based simulators (carla; pybullet; todorov2012mujoco). In medical applications, examples include artificial pancreas simulators (DallaManModel; sanjian_jim_model_based). Unfortunately in practice, such models can be of blackbox nature. Allowing only samples to be observed. Our goal is to use such models to inviscate a deep neural network into conformal behavior.
In this work, we propose a method which guarantees the satisfaction of natural constraints by constructing a wrapper for the DNN based on symbolic information. This is achieved through a novel neural gas based partitioning technique and estimation of model’s output ranges. Such a guarantee does not come for free, but shows up as slightly higher approximation error. This is due to conservative estimates involved when dealing with blackbox models. Our contributions can be listed as: 1) A novel memorybased method to constrain neural network dynamics models with guarantees. 2) A theoretical guarantee that our memorybased constraining method guarantees conformance with only bounded increase in approximation error. 3) Results on three case studies demonstrating that we outperform augmented Lagrangian methods for constraint satisfaction by a few orders of magnitude.
2 Related Work
Enforcing constraints on neural networks: Imposing constraints on deep neural networks has been studied from various perspectives (djeumou_l4dc2022; constraints_neurips; constraints2; constraints3; constraints4; constraints5; constraints6; constraints7; constraints8). These include constraints of symmetry and contact forces for dynamical systems in djeumou_l4dc2022, suitable constraints for specific Lagrangian or Hamiltonian neural networks in constraints_neurips, human pose constraints in constraints2, path norm constraints on resnets in constraints3
, partial differential equation (PDE) constraints for inverse design in
constraints4, FockerPlanck constraints for fusion in constraints5, fairness constraints in constraints6, language label constraints in constraints7, and segmentation constraints in constraints8. All of these methods rely on the augmented Lagrangian method to train constrained neural networks. Solving the dual problem, i.e. converging to a stationary point for the minmax optimization is challenging with neural networks and nonconvex constraints (constraints2). Further, the process is datahungry and generalizes poorly in outofdistribution data (conformal_predictions; constraints2; constraints3). Our focus in this work is to leverage the benefits of the augmented Lagrangian approach (its flexible loss function) but constrain the neural network by design, and with a guarantee, to remain within desirable output bounds computed using models that encode all desired constraints. In the process, we obtain several orders of magnitude reduction in constraint loss and learn with very few gradient steps.
Physics informed neural networks for dynamics models: Although our focus is on enforcing constraints, we also briefly discuss related ideas in physicsinformed neural networks (pinn; constraints2; lu2021deepxde; physics_prior; lagrangian_nn; hamiltonian_nn). Physicsinformed architectures for dynamical systems in particular have been explored via specific Neural ODE structures for a class of systems (ode1_hamiltonian; ode2; ode3_lagrangian_mechanics; ode4; ode5; ode6_drone_landing)
or via a broader Neural ODE structure for a class of vector fields
(djeumou_l4dc2022), all towards learning continuoustime dynamics for robotics applications. Our constraining framework can be applied around any such Neural ODE. But moreover, our constraints can include blackbox models and scale quickly to any state and action space unlike NeuralODEs which are restricted to systems with rigorous mathematical models (pinn). Further, to present a general solution, we make no assumption on the architecture and to extend to applications beyond dynamics models in robotics (such as medicine, computing systems, and operations research), we learn discretetime dynamics models in our experiments rather than continuoustime dynamics models.3 Problem Formulation
Consider a discrete time nonlinear dynamical system , where , and is the state of the system, and is the control input. Thus, is the possibly unknown discrete time nonlinear map that captures the system dynamics. We assume access to a dataset drawn from distribution , such that . Usually, the goal is to estimate with a function , where is potentially the parameters of a neural network. Typically, the goal of an algorithm which estimates is usually to reduce approximation error on the training dataset . In addition to this, sometimes it is desirable that the estimated model satisfies physicsinformed constraints (lagrangian_nn). Next, we define a few relevant concepts.
[Model Constraint] Assume a model , and a parameter . Then the model constraint is True iff . Where, . Here we assume to be Lipschitz continuous with constant . We state our problem next.
Problem Statement 3.1 (Constrained Neural Network)
Find a function , which minimizes the approximation error on dataset , while satisfying the constraints given by . That is find , .
4 Overall Approach
To restate, we want our estimated model to approximate our training data while respecting the constraint imposed by the model . We use the following intuition in our approach: if restricted to a small enough input region the output of the model can be underapproximated by a set . If we can ensure that the predictions of stay within this interval then we can bound the difference between and , as being proportional to the size of the inputregion , which improves with finer partitioning of the input space.
Thus, to summarize our approach, we first partition an input space into small enough input regions and for each subregion, we estimate an interval underapproximation for the values of which can satisfy . Next, we train our function approximator to respect these interval constraints in each such subregion. This is accomplished using a constraining operator on . In Section 6 we explain a method for computing these sound underapproximations of . Next, in Section 7, we explain the constraining operator and bound the approximation error incurred due to this operator.
5 Preliminaries
We define the idea of a neural gas (growing_neural_gas; incrementally_growing_neural_gas; neural_gas). From a given set of points embedded in a metric space, a growing neural gas algorithm has the ability to learn important topological relations in the form of a graph of prototypical points. It uses a simple Hebblike learning rule to construct this graph. [Neural Gas] Neural Gas , is composed of the following two components,

A set of the nodes of a network. Each node is called a memory in this paper.

A set of edges among pairs of nodes, which inform about the topological structure of the data. The edges are unweighted.
The edges in preserve the neighborhood relations among the data, and is useful in achieving a Voronoilike partitioning of the data manifold. The graphical structure of a neural gas makes it much more appealing to algorithmically resolve neighborhood relations. For a given node , let us denote as the set of neighbors of according to . For most practical purposes in a control setting, the spaces and are embedded in Euclidean spaces , and respectively, where . Where, is the dimension of control input. Let be the cardinality of : . Then, we can define the Voronoi polyhedron (voronoi), around a given point in the following fashion : [Voronoi Polyhedron] For a point , the Voronoi polyhedron can be defined using the Euclidean distance function as,
In practice, constructing the Voronoi polyhedron can be achieved in the following way. Given points which are neighbors and , it is possible to compute a line segment which connects them. Let us denote the perpendicular bisector of as the linear inequality . For any point which is in the same side of as the inequality holds. The reverse is true for the half space constraint . This gives us an algorithm to compute . Thus, given a set of nodes the Voronoi tessellation induces a splitting of the space into a set of disjoint sets . We drop the subscript for the rest of the paper. Our guarantees of constraint satisfaction is over the union of these subsets.
6 Approximating Model Constraints
Assume a (relatively small) subset , and denote the th output of the model at input . We wish to compute the interval . Assume that , , and . Where is the interval bound on values of in . Then . Now, in practice it is hard to precisely compute the interval for blackbox models . Meaning that we would resort to estimating the min and max of using sampling based techniques. There exists a stochastic optimization algorithm to estimate the true maxima of a Lipschitz function on a bounded domain (Mladineo1991StochasticMO). Here we follow a simple sampling based rendition to estimate . We denote as the list of numbers from . Next, we note the following lemma.
Let be an Lipschitz continuous function on a closed and compact set , and and be its estimated lower and upper bounds. Then, , .
Proof :
The proof can be found in the Appendix.
With , let and be the estimated minima and maxima of . Thus, if , , then . Now, across all dimensions , let then, . Assume , to be the largest partition induced by the neural gas , then setting ensures satisfaction of model constraint in Definition 3. This bound can be made much tighter in practice if the model is known in an analytical form. Allowing tight computations of its limits possible using techniques like interval arithmetic and Taylor models (rino)
So, given a set and using neural gas , we have a partitioning of . Let us denote this set of partitions of as . Also, for each subset we can compute range estimate , which respects the constraint . In the following discussions, let us refer to this constraint map as . Where, is a dimensional interval in . For a subset in , returns the appropriate output range.
7 Function Approximation Error
In this section we define a constraining operator on a function, and analyze the error encountered in the process. The goal of a constraining operator is to threshold the values of the function to be within certain desirable limits. Assume an interval , and value , then we define a projection in the following fashion along each dimension , ; ; and otherwise.
[Constraining Operator] A constraining operator parameterized by the partition set  , modifies functions to respect the corresponding interval constraints. For a function , it can be defined in the following fashion,
Hence, the constraining operator ensures that our estimated model which attempts to approximate the true function , also respects the constraint . Even though we assume that , our approximation error in building the map can affect the model approximation error . This however as we show only leads to a bounded cost in approximation error. Which can be reduced by adopting finer partitions in , that is increasing the nodes in the neural gas .
[Approximation Error] Assume real and continuous functions , , if , then , where is some constant.
Proof : Assume a generic input , and for some . Additionally, let be the interval constraint imposed by on using the map . Since the sets and are embedded in the real spaces and respectively, we can analyze the error incurred along each dimension. Also, we drop the subscript and denote the constraining operator as simply since the partition remains fixed for the remainder of the results.
refers to the element of .
Let us pick a dimension , we define the lower correction set . Intuitively, this is the set of points in , which need a correction due to underflow. Let us denote the difference function as ,
(1) 
We can similarly define the upper correction set and the difference function as for and for anywhere in .
Now the following is true, for : . This is simply due to the bound respected by . Due to very similar reasons the following is true as well : . Next, we wish to bound the following quantity: . The difference between the constrained function and ground truth. Then,
The first equality is simply because can be expressed as a union of the following disjoint sets .Therefore, we can write the following,
Note, the R.H.S of the above equation is negative. Then using the bound on the upper limit of absolute values, we get the following,
(2)  
Thus, we can bound in the following fashion, for :
Now, setting , where is the global Lipschitz constant of , we can write,
8 Training a Constrained Neural Network Dynamics Model
We detail our Algorithm in this section. The inputs to the algorithm are a state transitions dataset (containing (state, control) and (next state) pairs  ), model , and an augmented dataset . consists of only inputs to the model , and is an unlabelled dataset sampled throughout, and with particular emphasis on relevant regions in .
Algorithm 1 First (Lines ), we use the unsupervised neural gas algorithm (neural_gas; growing_neural_gas) to obtain the neural gas graph . We utilize these memories and edges, to create partitions of the input space as voronoi cells with memories at their center. In each voronoi cell, we sample points, propagate them through the model and obtain the upper and lower limits along each dimension of the output space (lines 46). This creates the constraint map . Using this, we can find the lower and upper bounds of each point in and (lines 711). First, we locate the corresponding voronoi cell, and then use the bounds computed in Line . Finally, we can train the constrained neural network (denoted ) as follows,
(3) 
where is a parameterized function which maps from the input space to output space, and
is the sigmoid function. Equation
3 is but one realization of the constraining operator discussed in Section 7. Our loss function is the augmented Lagrangian loss (constraints4) itself and is given below where .(4) 
We can then train the neural network by backpropagating through the constrained neural network (lines 1216). We enhance gradient feedback under constrained outputs with an exponential schedule on the lower and upper bounds (line 13). We also intermittently update the slack variables through a schedule or as a gradient ascent step on the value of the constraint (lines 1719).
9 Experiments
Overview and baseline: We perform simulated experiments on three case studies. We create a dataset from highfidelity simulators that can closely represent reality in each case study. These are depicted in Figure 2. Our baseline is the augmented Lagrangian method which utilizes the loss function in equation 4 but uses a standard parameterization rather than the constrained model given in equation 3. The augmented Lagrangian method lacks guarantees on constraint satisfaction with deep neural networks and nonconvex constraints. We observe that augmented Lagrangian in fact fails to achieve conformance on indistribution transitions in the test set.
Case Study 1: CARLA – Conformance of a vehicle model to unicycle dynamics with emphasis on atrest condition. In the first case study, we collect trajectories of x position, y position, heading, velocity, yaw rate from the CARLA simulator (carla; codit) on a variety of terrains and environments (See Figure 2(a)) for our dataset. With previous work conformal_predictions having demonstrated the difficulty of learning a dynamics model that predicts no change in state when a vehicle is at rest, we uniformly sample atrest data for the augmenting dataset . Unicycle dynamics (vehicle; checkpointing_unicycle) are chosen as the model . This implicitly encodes the atrest condition. We have 15,000 training points, 2000 test points in each of and . We select 500, 1000 and 2500 memories to observe the performance with increasing partitions in the training distribution.
We observe, in Figure 3, that the approximation loss for constrained methods is either similar to or slightly higher than the Vanilla and augmented Lagrangian. This is expected in light of Theorem 7. The average constrained loss and max constrained loss on the augmenting dataset are significantly improved, by 4 and 3 orders of magnitude respectively for our method in comparison to Vanilla and augmented Lagrangian. Moreover, with increasing memories, the constraint loss, both average and maximum on , improve consistently. We also notice that constrained training is highly dataefficient, learning in less than 300 gradient steps unlike the 12000 required by the Augmented Lagrangian. In Figure 4, we analyze each of the models’ predictions starting from the origin at rest, and given zero control inputs for 20 timesteps. We clearly observe that both Vanilla and augmented Lagrangian models predict large drift to the topleft Constrained models, on the other hand, accurately predict little to no movement.
Case Study 2: Artificial Pancreas (AP) – Conformance of AP models to ARMAX model that encodes glucoseinsulin constraints. We collect traces of glucose, insulin and meal quantities for a patient with the UVA/Padova simulator (See Figure 2(c)) (DallaManModel) to create the dataset. The states consist of a 30 elements– 10 historical values of glucose, insulin and meals respectively. The model is expected to predict the glucose 5 steps in the future. Each timestep spans minutes. The intial value of glucose and carbohydrates are randomly chosen in respectively. We also uniformly sample the state space with emphasis on low glucose initial values in and low carbohydrates to create the dataset. We have 18,750 training points, 2500 test points in each of and Moreover, for our model , we train a constrained ARMAX model such that any increase in insulin, will reduce glucose. This is accomplished by constraining insulin weights to be negative in the ARMAX model. In Figure 5, we observe that approximation loss on is similar across all methods with a slight advantage in the favour of our constrained training. Yet, constrained neural networks outperform vanilla and Lagrangian by an order of magnitude in conforming to the ARMAX model on the and datasets.
Method  Max.  Avg. 

violation  violation  
Vanilla  3.8356  1.315 
Aug. Lagrangian  3.8072  1.245 
Constrained (1k)  0.9157  0.0092 
Constrained (1.5k)  0.2047  0.0027 
Constrained (2k)  0.1775  0.0026 
The deltamonotonicity property of such models in (kushner2020conformance), refers to the following  everything else remaining fixed, increasing insulin should lead to reduction in blood glucose prediction. In order to test this property we increase the insulin value in each input trace of test set by a random amount in and observe the prediction. We report this in Table 1. We observe that vanilla and Lagrangian models violate the constraint by a large margin, whereas constrained models increase the prediction by nearly zero amount.
Case Study 3: PyBullet Drones – Conformance of drone models to quadrotor dynamics with emphasis on hover. We collect circular flight trajectories of drones (See Figure 2(b)) with aerodynamics effects (drag, downwash, ground effect) included in the Pybullet Drones environment (gympybulletdrones) to create the dataset. The states consist of 20 items – x, y, z positions and velocities; roll, pitch, yaw and their rates; quaternions, and rpms of each of the four motors. The controls consist of 4 rpm commands. Our model is given by the quadrotor dynamics (all_about_quads; quads). For emphasis on hover, we uniformly sample states across the state distribution and uniformly sample controls for balancing gravity (and hence hovering inplace) to create the dataset. We have 15,000 training points, 2000 test points in each of and . We vary the number of memories from 800, 1000, to 2000. Similar to CARLA, we see (in Figure 5) that approximation loss on is similar across all methods but there is upto a 6 orderofmagnitude decrease in the average and maximum constraint loss on with our constrained training algorithm. We also observe a rather large increase in performance from 1000 to 2000 memories. We also plot the average constraint loss on for all case studies in the Appendix.
10 Conclusion
We demonstrate how DNN training can be constrained using symbolic information which enforces adherence to natural laws. We report experiments on three case studies where our method achieves many fold reductions in constraint loss when compared to the augmented Lagrangian. In future work, we plan to create safety constrained policies.
References
Appendix
10.1 Proof of Lemma 4
Let be an Lipschitz continuous function on a closed and compact set , and and be its estimated lower and upper bounds. Then, , . Proof : Note that is a real and continuous function on the connected set in the metric space . Since, there exists points and which map to and respectively, then by Theorem Rudin, for any there exists such that . Then we can write the following : . This completes the proof.
10.2 Additional Plots
10.3 Additional Details of Experiments
For the CARLA vehicle and PyBullet Drones models, we use a two layer MLP with 1024 neurons in each layer. In the UVA/Padova Artificial Pancreas case study, we use a three layer neural network with 20 neurons in each layer. We utilize the Adam optimizer in all case studies and choose a learning rate with grid search in
. We also utilize training batch sizes of 64 for both and datasets. Further, for the CARLA and Drones case studies, we set to 0. For Artificial Pancrease, we used .