1. Introduction
Endtoend learning is attractive for the realization of autonomous cyberphysical systems, thanks to the appeal of control systems based on a pure datadriven architecture. By taking advantage of the current advances in the field of reinforcement learning, several works in the literature showed how a well trained deep NN that is capable of controlling cyberphysical systems to achieve certain tasks
(bojarski2016end, ). Nevertheless, the current stateoftheart practices of designing these deep NNbased controllers is based on heuristics and handpicked hyperparameters (e.g., number of layers, number of neurons per layer, training parameters, training algorithm) without an underlying theory that guides their design. In this paper, we focus on the fundamental question of how to systematically choose the NN architecture (number of layers and number of neurons per layer) such that we guarantee the correctness of the chosen NN architecture.
In this paper, we will confine our attention to the statefeedback Model Predictive Control (MPC) of a Linear TimeInvariant (LTI) system with quadratic cost and under input and output constraints (see Section 2 for the specific MPC formulation). Importantly, this MPC control problem is known to have a solution that is Continuous and PiecewiseLinear (CPWL) ^{1}^{1}1Although these functions are in fact continuous, piecewiseaffine, the literature on the subject refer to them as piecewise “linear” functions, and hence we will conform to that standard. in the current system state (BemporadExplicitLinearQuadratic2002, ). This property renders optimal MPC controllers compatible with a ReLU NN implementation, as any ReLU NN defines a CPWL function of its inputs. For this reason, several recent papers focus on how to approximate an optimal MPC controller using a ReLU NN (ChenApproximatingExplicitModel2018, ).
However, unlike other work on the subject, AReN seeks to use knowledge of the underlying control problem to guide the design of datatrained NN controllers. One of the outstanding problems with datadriven approaches is that the architecture for the NN is chosen either according to heuristics or else via a computationally expensive iteration scheme that involves adapting the architecture iteratively and retraining the NN. Besides being computationally taxing, neither of these provide any assurances that the resultant architecture is sufficient to adequately control the underlying system, either in terms of performance or stability. In the context of controlling an LTI system, then, AReN partially addresses these shortcomings: AReN is a computationally pragmatic algorithm that returns a ReLU NN architecture that is at least sufficient to implement the optimal MPC controller described before. That is given an LTI system with quadratic cost and input/output constraints, AReN determines a ReLU NN architecture – both its structure and its size – with the guarantee that there exists an assignment of the weights such that the resultant NN exactly implements the optimal MPC controller. This architecture can then be trained on data to obtain the final controller, only now with the assurance that the training algorithm can choose the optimal MPC controller among all of the possible NN weight assignments available to it.
The algorithm we propose depends on two observations:

First, that any CPWL function may be translated into a ReLU NN with an architecture determined wholly by the number of linear regions in the function; this comes from a careful interpretation of the recent results in (AroraUnderstandingDeepNeural2016, )
, which are in turn based on the hinginghyperplane characterization of CPWL functions in
(ShuningWangGeneralizationHingingHyperplanes2005, ) and the lattice characterization of CPWL functions in (KahlertGeneralizedCanonicalPiecewiselinear1990, ). 
Second, that there is a computationally efficient way to overapproximate the number of linear regions in the optimal MPC controller without solving for the optimal controller explicitly. This involves converting the stateregion specification equation for the optimal controller into a single, stateindependent collection of linearinequality feasibility problems – at the expense of overcounting the number of affine regions that might be present in the optimal MPC controller. This requires an algorithmic solution rather than a closed form one, but the resultant algorithm is computationally efficient enough to treat much larger problems than are possible when the explicit optimal controller is sought.
Together these observations almost completely specify an algorithm that provides the architectural guidance we claim.
Related work: The idea of training neural networks to mimic the behavior of model predictive controllers can be traced back to the late 1990s where neural networks trained to imitate MPC controllers were used to navigate autonomous robots in the presence of obstacles (see for example (ortega1996mobile, ), and the references within) and to stabilize highly nonlinear systems (cavagnari1999neural, ). With the recent advances in both the fields of NN and MPC, several recent works have explored the idea of imitating the behavior of MPC controllers (hertneck2018learning, ; aakesson2006neural, ; pereira2018mpc, ; amos2018differentiable, ; claviere2019trajectory, ). The focus of all this work was to blindly mimic a base MPC controller without exploiting the internal structure of the MPC controller to design the NN structure systematically. The closest to our work are the results reported in (karg2018efficient, ; ChenApproximatingExplicitModel2018, ). In this line of work, the authors were motivated by the fact that both explicit state MPC and ReLU NNs are CPWL functions, and they studied how to compare the performance of trained NN and explicit state MPC controllers. Different than the results reported in (karg2018efficient, ; ChenApproximatingExplicitModel2018, ), we focus, in this paper, on how to bridge the insights of explicit MPC to provide a systematic way to design a ReLU NN architecture with correctness guarantees.
Another related line of work is the problem of Automatic Machine Learning (AutoML) and in particular the problem of hyperparameter (number of layers, number of neurons per layer, and learning algorithm parameters) optimization and tuning in deep NN, in general, and in deep reinforcement learning, in particular (see for example
(pedregosa2016hyperparameter, ; bergstra2012random, ; paul2019fast, ; baker2016designing, ; quanming2018taking, ) and the references within). In this line of work, an iterative and exhaustive search through a manually specified subset of the hyperparameter space is performed. Such a search procedure is typically followed by the evaluation of some performance metric that is used to select the best hyperparameters. Unlike the results reported in this line of work, AReN does not iterate over several designs to choose one. Instead, AReN directly generates an NN architecture that is guaranteed to control the underlying physical system adequately.2. Problem Formulation
2.1. Dynamical Model and Neural Network Controller
We consider a discretetime Linear, TimeInvariant (LTI) dynamical system of the form:
(1) 
where
is the state vector at time
, is the control vector, and is the output vector. The matrices , , and represents the system dynamics and the output map and have appropriate dimensions. Furthermore, we consider controlling (1) with a state feedback neural network controller :(2) 
while fulfilling the constraints:
(3) 
at all time instances where and are constant vectors of appropriate dimension with and (where is taken elementwise).
In particular, we consider a (layer) Rectified Linear Unit Neural Network (ReLU NN) that is specified by composing layer functions (or just layers). A layer with inputs and outputs is specified by a realvalued matrix of weights, , and a realvalued matrix of biases, , as follows:
(4) 
where the function is taken elementwise, and for brevity. Thus, a layer ReLU NN function as above is specified by layer functions whose input and output dimensions are composable: that is they satisfy . Specifically:
(5) 
When we wish to make the dependence on parameters explicit, we will index a ReLU function by a list of matrices ^{2}^{2}2That is is not the concatenation of the into a single large matrix, so it preserves information about the sizes of the constituent .. Also, it is common to allow the final layer function to omit the function altogether, and we will be explicit about this when it is the case.
Note that specifying the number of layers and the dimensions of the associated matrices specifies the architecture of the ReLU NN. Therefore, we will use:
(6) 
to denote the architecture of the ReLU NN . Note that our definition is general enough since it allows the layers to be of different sizes, as long as for .
2.2. Neural Network Architecture Specification
We are interested in finding an architecture for the such that it is guaranteed to have enough parameters to exactly mimic the inputoutput behavior of some base controller . Due to the popularity of using model predictive control schemes as a base controller (ortega1996mobile, ; cavagnari1999neural, ; hertneck2018learning, ; aakesson2006neural, ; pereira2018mpc, ; amos2018differentiable, ; claviere2019trajectory, ), we consider finitehorizon rollout Model Predictive Control (MPC) scheme as the base controller that the ReLU NN is trying to mimic its behavior.
Finitehorizon rollout MPC maps the current state, , to the first control input obtained from the solution to an optimal control problem over a finite time horizon with the first control actions chosen openloop and the remaining control actions determined by an apriorispecified constantgain state feedback. Since this control scheme involves solving an optimal control problem at each time (with initial state ), we will use the notation to denote the “predicted state” at time from the initial state supplied to the MPC controller (the same notation as in (BemporadExplicitLinearQuadratic2002, )). In particular, for fixed matrices , and , we define the cost function:
(7) 
as a function of an control variables matrix:
(8) 
Then the MPC control law is specified as:
(9) 
where
(10) 
subject to the constraints:
(11)  
(12)  
(13)  
(14)  
(15)  
(16) 
The matrix is typically chosen to reflect the quadratic costtogo (using matrices and ) resulting from the feedback control applied from timestep onwards (i.e. is the solution to the appropriate algebraic Ricatti equation). We will henceforth consider only this scenario, since this is the most common one; furthermore, since it doesn’t benefit from , we will henceforth assume that . For future reference, this problem then has:
(17)  decision variables; and  
(18)  inequality constraints. 
2.3. Main Problem
We are now in a position to state the problem that we will consider in this paper.
Problem 1 ().
Given system matrices , and (as in (1)); performance matrices (cost function matrices) , (as in (7)); constantgain feedback matrix (as in (16)); and integer horizon , choose a ReLU NN architecture , such that there exists a realvalue assignment for the elements in that renders:
(19) 
for all in some compact subset of .
3. Framework
As we have noted before, it is known that is a CPWL function (BemporadExplicitLinearQuadratic2002, ). However, CPWL functions are usually specified using both linear functions and regions as in this example:
(20) 
This specification is difficult to implement using ReLU NNs, though, because the structure of a ReLU neural networks intertwines the implementation of the linear functions and their active regions.
Fortunately, there are several representations of CPWL functions that avoid the explicit specification of regions by “encoding” them into the composition of nonlinear functions with linear ones. Recent work (AroraUnderstandingDeepNeural2016, , Theorem 2.1) considered one such representation based on hinging hyperplanes (ShuningWangGeneralizationHingingHyperplanes2005, ), and showed that this representation can be translated easily into a ReLU neural network implementation, whenever the CPWL function is known explicitly.
Given the computational cost of computing explicitly, the chief difficulty in Problem 1 thus lies in inferring the neural network architecture without access to the explicit MPC controller . Unfortunately, the hinging hyperplane representation employed in (AroraUnderstandingDeepNeural2016, , Theorem 2.1) cannot be easily used in this circumstance (for more about why this particular implementation is unsuitable when is not explicitly known, see also Section 6.)
However, every CPWL function also has a (twolevel) lattice representation (TarelaRegionConfigurationsRealizability1999, ) ^{3}^{3}3The lattice representation is in fact an intermediary representation used to construct the hinging hyperplane representation; see (ShuningWangGeneralizationHingingHyperplanes2005, ).: unlike the particular hinging hyperplane representation mentioned above, we will show that the lattice representation can be used to solve Problem 1 without explicitly solving for . In particular, the lattice representation of a CPWL function has two properties that facilitate this:

It has a structure that is amenable to implementation with a ReLU NN (by mechanisms similar to those used in (AroraUnderstandingDeepNeural2016, )); and

It is described purely in terms of the local linear functions and the number of uniqueorder regions (exact definitions of these terms are given in the next subsection) in the CPWL function, both of which we can efficiently overapproximate for .
Thus, a description of the lattice representation largely explains how to solve Problem 1; we follow this discussion by connecting it to a toplevel description of our algorithm.
3.1. The TwoLevel Lattice Representation of a CPWL Function
To understand the lattice representation of a CPWL, we first need the following definition. Throughout this subsection we will assume that is a CPWL function. All the subsequent discussion can be generalized directly to the case when .
Definition 0 (Local Linear Function).
Let be a CPWL function. Then is a local linear function of if there exists an open set such that for all :
(21) 
The set of all local linear functions will be denoted .
The CPWL function in (20) consists of the two local linear functions and , for example.
The lattice representation is based on the following idea: Consider the set of distinct local linear functions of namely along with the natural projections of this set and . It follows from the fact that the is continuous PWL function that at least two local linear functions intersect for each on the boundary between linear regions. Therefore, the ordering of the set by on the projection induces at least two different orderings of the projection (see Figure 1 for an example). It is a profound observation nevertheless, because it means that the relative ordering of the values can be used to decide which of the local linear function is “active” at a particular . This is illustrated in Figure 1; see also a similar figure in (TarelaRegionConfigurationsRealizability1999, , Figure 1).
This also suggests that we make the following definition, which allows us to talk about regions in the domain of over which the order of the local linear functions is the same.
Definition 0 (UniqueOrder Region (rephrasing of (TarelaRegionConfigurationsRealizability1999, , Definition 2.3))).
Let be a CPWL function with distinct local linear functions ; that is for all , for some . Then a uniqueorder region of is a region from the hyperplane arrangement in defined by those hyperplanes that are nonempty. In particular, for all in a uniqueorder region , for some permutation of of .
We are now in a position to describe the twolevel lattice representation of a CPWL function.
Theorem 3 (TwoLevel Lattice Forms From UniqueOrder Regions (TarelaRegionConfigurationsRealizability1999, , Theorem 4.1)).
Let be as in Definition 2 with the number of uniqueorder regions of in . Then there exists at most subsets , such that:
(22) 
3.2. Structure of the Main Algorithm
Having described in detail the lattice representation of a CPWL, we return to the specific claims we made about how it structures our solution to Problem 1.
We first note that the form (22) is well suited to implementation with a ReLU neural network: it is comprised of linear functions and / functions, so many observations from (AroraUnderstandingDeepNeural2016, ) apply to (22) as well. In particular, the twoargument maximum function can be implemented directly with a ReLU using the wellknown identity
(23) 
and the following ReLU implementations of its constituent expressions (AroraUnderstandingDeepNeural2016, ):
(24)  
(25) 
Thus can be implemented by a NN where:
(26) 
This implementation is illustrated in Figure 2. Using the variant of the identity (23), namely:
(27) 
leads to a similar ReLU implementation of the twoargument minimum function. In the previous notation, the architectures of these and networks are the same, i.e. = =
with no activation function on the last layer.
This implementation further suggests a natural way to implement the multielement (resp. ) operation with a ReLU network (AroraUnderstandingDeepNeural2016, ). Such an operation can be implemented by deploying the twoelement (resp. ) networks in a “divideandconquer” fashion: the elements of the set to be maximized (resp. minimized) are fed pairwise into a layer of twoelement (resp. ) networks; the output of that first (resp. ) layer is fed pairwise into a subsequent layer of twoelement (resp. ) networks, and so on and so forth until there is only one output. Note that this approach can also be used on sets whose cardinality is not a power of two while maintaining a ReLU structure of the neural network : the same value can be directed to multiple inputs as necessary. This structure is illustrated in Figure 3 for a network that computes the maximum of five realvalued inputs. Following this example, an input (or ) network (resp. ) is represented by a parameter list (resp. ) which has architecture:
(28) 
where means multiply every element in by ; nested lists are “flattened” as appropriate; and there is no activation function on the final layer.
Now, given these multielement / networks, the remaining structure for a ReLU network implementation of the lattice form (22) is clear: we need a neural network architecture capable of (i) implementing ’s local linear functions and (ii) handling the selection of the subsets . The implementation of the local linear functions is straightforward using a fully connected hidden layer. The selection can be handled by routing – and replicating, as needed – the output of those linear functions to a network. Since we do not know the exact size of the subsets in (22), and hence the number of input ports for each network, we must use networks with as many pairwise input ports as there are local linear functions. Then, for subsets of size less than , the architecture replicates some local functions multiple times to different input ports of the same network to achieve the correct output. As discussed before, such replication does not affect the correctness of the architecture. Moreover, there will be one such network for each uniqueorder region for a total of . This replicating and routing of signals can be accomplished by an auxiliary fully connected linear layer with inputs and outputs. Since the purpose of this layer is to allow the weights to only select subsets of the local linear functions, this layer should have the property that all of the weights are either zero or one, and each output of the layer should select exactly one input. That is the weight matrix for this layer should have a exactly one in each row with all of the other weights set to . For a realvalued CPWL function , this overall architecture is depicted in Figure 4. The selection and routing layer is depicted in red. The notation reflects the routing and (one possible) replication of values of the local linear functions, and it is defined as follows:
(29) 
(That is contains one copy of each of , and as many additional copies of as necessary to have total elements. In particular, .)
Finally, we note that it is straightforward to extend this architecture to vectorvalued functions like . The structure of (Section 2.1) means that a scalar pairwise (or ) network can trivially compute the elementwise minimum between two input vectors by simply allowing more inputs and applying the weights from Figure 2 in an elementwise (diagonal) fashion. The result is an architecture that looks exactly like the one in Figure 4, only with the number of outputs multiplying the size of most signals.
The structure of the above described ReLU implementation is general enough to implement any CPWL function with local linear functions and unique order regions. We state this as a theorem below.
Theorem 4 ().
Let by CPWL with distinct local linear functions and unique order regions. Then there is a parameter list with:
(30) 
such that there exist an assignments for that renders:
where means multiply every element in by , and where nested lists are “flattened” as needed. The final layers of the layer and the layer lack activation functions.
Proof.
The proof is constructive: the discussion above explains the construction, which is based on (TarelaRegionConfigurationsRealizability1999, , Theorem 4.1). ∎
Corollary 0 ().
Any CPWL controller (such as ) can be implemented by a ReLU network with architecture as described in Theorem 4, where is the number local linear functions of and is the number of uniqueorder regions of .
Note that in many cases it is hard to exactly know the parameters and exactly. The next result show that our correctness claims in Theorem 2 can be extended when an upper bound and is used to design the neural network architecture as explained in the next result.
Theorem 6 ().
Proof.
In order to implement the same function with a larger network, the extra linearlayer neurons can simply duplicate calculations carried out by neurons in the smaller network. For example, the extra neurons in the the first linear layer can duplicate the calculation of , and the extra neurons in the second linear layer can duplicate the calculation of the subset of . This will not change the output of the and layers. ∎
Note: when “embedding” a smaller network, , into a larger one, , it is incorrect to set the extra parameters in to zero, as this could affect the output of the and networks!
Thus, to use this framework to obtain an architecture that is capable of implementing , one needs to simply upperbound the number of local linear functions in (ultimately without solving the actual MPC problem) and upperbound the number of unique order regions in . This is precisely the AReN algorithm, as specified in Algorithm 1. The constituent functions EstimateRegionCount and EstimateUniqueOrder are described in detail in the subsequent sections Section 4 and Section 5, respectively. The implementation of the function InferArchitecture follows directly from the (constructive) discussion in this section.
4. Approximating the number of linear regions in the MPC controller
In this section, we will discuss our implementation of the function EstimateRegionCount from Algorithm 1. A natural means to approximate to the number of local linear functions of is to approximate the number of maximal linear regions in .
Definition 0 (Linear Region of ).
A linear region of is a subset of over which for some linear (affine) . A maximal linear region is a linear region that is strictly contained in no other linear regions. Two linear regions are said to be distinct if they correspond to different linear functions, .
Thus, the maximal linear regions of are in onetoone correspondence with the local linear functions in (Definition 1), so an upper bound on the number of maximal linear regions in is an upper bound on its number of local linear function, which in turn will provide an overapproximation of that can be used to generate a NN architecture.
To upper bound the number of maximal linear regions effectively, we need to consider in detail some specifics about how the piecewise linear property arises in the solution for . Ultimately, is piecewise linear because we have posed a problem for which (i) the gradient of the Lagrangian (34) is linear in both the Lagrange multipliers and the decision variable; and (ii) the dependence on the initial state is linear. Linearity is important in both (i) and (ii) because we are really not solving one optimization problem but a family of them: one for each initial state . Thus, the linearity of the Lagrangian together with the linearity of the inequality constraints in leads to an equation (a necessary optimality condition) that is linear in both the Lagrange multipliers and the initial state : hence the piecewiselinear controller .
Moreover, the distinct linear regions of – i.e. those with distinct linear functions – arise out of a particular aspect of the aforementioned linear equations. In particular, the Lagrange multipliers, , and the initial state, , appear together in a linear equation that has different solutions – and hence creates different linear regions for – based on which of the inequality constraints are active (at a particular optimizer) (BemporadExplicitLinearQuadratic2002, , Theorem 2). Since the linear regions obtained in this way partition the domain of (see also Proposition 3 below), this suggests that we can overapproximate the number of linear regions in by counting all of the possible constraints that can be active at the same time. Indeed, this is more or less how EstimateRegionCount
arrives at an estimate for
, although we do not simply overapproximate with .4.1. The Optimal MPC Controller
As preparation for the rest of the section, we begin by summarizing some further details regarding the solution of from (BemporadExplicitLinearQuadratic2002, ). In particular, the optimization problem specified by (7)(16) can be simplified by directly substituting the dynamics constraint (14) to get:
(32)  
with appropriately defined matrices , , , and of dimensions , , , and , respectively (where and are defined in (17) and (18), respectively). Then, completing the square by means of the change of variables provides the following, simplified quadratic program (BemporadExplicitLinearQuadratic2002, ):
(33) 
Note that this change of variables is only valid when is invertible, but this is assumed in (BemporadExplicitLinearQuadratic2002, ), and it will be assumed throughout.
A solution to the optimization problem in (33) can be easily formulated using the KKT (necessary) optimality conditions (LuenbergerOptimizationVectorSpace1997, ). This results in the following system of equations:
(34)  
(35)  
(36)  
(37) 
That is for a (local) minimizer to the optimization problem (33), there exists nonnegative Lagrange multipliers that solve the above system. Moreover, solving (34) for the minimizer, , demonstrates that such a minimizer has a particular structure. Indeed, under the assumption that is invertible, we may solve (34) for and substituted it into (37) to obtain:
(38) 
Now given the relevance of active constraints in (34)  (37) to(BemporadExplicitLinearQuadratic2002, , Theorem 2), we introduce the following notation.
Definition 0 (Selector Matrix).
Let , and define the auxiliary set (and associated vector):
Then the selector matrix, is defined as:
(39) 
where is the column of the identity matrix.
The selector matrix can thus be used to describe equality in (38) for the active constraints by removing inequalities associated with the inactive constraints. In particular, given a set of active constraints specified by a subset , the Lagrange multipliers for the active constraints, , will satisfy the following:
(40) 
In particular, (40) is a linear equation in and that ultimately specifies the region in space over which a single affine function characterizes . Indeed, back substituting solutions of (40) into (37) and specifies a set of linear inequalities in . These equivalently define an intersection of halfspaces in that characterize a (convex) linear region of (BemporadExplicitLinearQuadratic2002, , Theorem 2). Moreover, since every possible solution to the optimization problem admits some set of active constraints, these convex linear regions partition the state space.
For our purposes, (40) is the relevant consequence of this discussion, since it most readily suggests how to overapproximate the number of linear regions of . We describe how to do this in the next subsection.
4.2. Overapproximating the Number of Maximal Linear Regions
From the previous discussion, the problem of finding all possible sets of active constraints for (34)  (37) is a significant amount of the work in solving for the optimal controller . However, we are content not solving for exactly: thus, we only need to simplify (40) in such a way that we obtain a new equation with all of the same solutions plus some spurious ones (keep in mind that (40) is an equation in , too).
Before we begin counting regions, we need to state the following proposition, which is trivial given the observations in (BemporadExplicitLinearQuadratic2002, ).
Proposition 0 ().
Let be a maximal linear region for controller . Then there exists a finite collection of sets with the following property:

for every , there exists an and Lagrange multipliers such that:
(41)
In particular, any maximal linear region of can be partitioned into convex linear regions.
Proof.
Since we are considering a maximal linear region of , the quadratic program (33) is feasible for every by definition. Consequently, there is for every such , a unique solution ^{4}^{4}4This is because is positive definite (BemporadExplicitLinearQuadratic2002, , pp. 9)., and so by necessity, there is some set of constraints that is active at . Moreover, by the KKT necessary conditions, there exists Lagrange multipliers that satisfy (41). This proves the existential assertions for related to (41).
However, we have implicitly defined a function that associates to each a subset in :
(42) 
We define to be the range of act, so that equivalence modulo act partitions into disjoint regions. Finally, by (BemporadExplicitLinearQuadratic2002, , Theorem 2) and the discussion about degeneracy on (BemporadExplicitLinearQuadratic2002, , pp. 9), each of these regions is necessarily convex. ∎
Proposition 3 gives us a hint about how to overapproximate the number of maximal linear regions: in particular, we will simplify equation (41) in such a way that we still find solutions for each at the expense of including solutions for . This gives us our first main counting theorem:
Theorem 4 ().
Let such that for every there exists a vector such that:
(43) 
has a solution , . Then upper bounds the number of maximal linear regions for .
Proof.
If we show that for every maximal linear region , the , then the conclusion will follow.
Theorem 4 is significant because Farkas’ lemma (LuenbergerOptimizationVectorSpace1997, ) tells us how to describe the solutions of (43) using a linear inequality feasibility problem. In particular, (43) has a nontrivial solution if and only if the problem is feasible (and it can’t have only the trivial solution for nontrivial ). This reasoning is included in the proof of the subsequent Theorem, which connects the bound from Theorem 4 to the number of feasible “subproblems” defined by . First, we introduce the following definition.
Definition 0 (Nontrivially Feasible).
Let be a matrix and let . Then is a nontrivially feasible subset of if there exists a , such that . Such a set is maximal if adding any other row makes it infeasible.
Now we state the main theorem in this section.
Theorem 6 ().
Let be the set of maximal nontrivially feasible subsets of (see Definition 5). Then the number of maximal linear regions in is bounded above by:
(44) 
Proof.
This follows from Farkas’ lemma and Theorem 4.
In particular, let and . Now, by Farkas’ lemma,
(45) 
In particular, can have a nonnegative solution if and only if is a nontrivially feasible subset of .
Thus, by Theorem 4, we conclude that implies is a nontrivially feasible subset of , and hence, that the number of maximal linear regions is bounded by the number of nontrivially feasible subsets of . The conclusion of the theorem thus follows because every nontrivially feasible subset is a subset of some maximally nontrivially feasible subset. ∎
4.3. Implementing EstimateRegionCount
To find maximal nontrivially feasible subsets of , we start by introducing one Boolean variable for each column of the matrix . The first nontrivially feasible subset can then be found by solving the following problem:
(46)  
(47) 
where denotes the th column of the matrix . Such optimization problems can be solved efficiently using Satisfiability Modulo Convex programming (SMC) solvers (shoukry2018smc, ; shoukry2017smc, ). SMC solvers first use a pseudoBoolean Satisfiability (SAT) solver to find a valuation of the Boolean variables that maximizes the objective function (46
); this is then followed by a linear programming (LP) solver that finds solutions to the constraints in (
47). Indeed, the SAT solver may return an assignment for the Boolean variables for which the corresponding LP problem is infeasible. In such a case, we use the LP solver to search for a set of Irreducibly Infeasible Set (IIS) that explains the reason behind such infeasibility. This IIS is then encoded into a constraint that prevents the SAT solver from returning any assignment that can lead to the same IIS. We iterate between the SAT solver and the LP solver until one Boolean assignment is found for which the corresponding LP problem is feasible. It follows from maximizing the objective function that this set of active constraints is guaranteed to be a maximal nontrivially feasible subset of .Once a maximal nontrivially feasible subset of is found, we can add a blocking Boolean constraints to the SAT solver, thus preventing the SAT solver from producing any subsets of this maximal nontrivially feasible set. We continue this process until the SAT solver can not find any more feasible assignments to the Boolean variables, at which point our algorithm terminates and returns all of the nontrivially feasible subsets it has obtained. This discussion is summarized in Algorithm 2 whose correctness follows from the correctness of SMC solvers (shoukry2018smc, ; shoukry2017smc, ).
5. Approximating the Number of UniqueOrder Regions in the MPC Controller
In this section, we discuss our implementation of EstimateUniqueOrder from Algorithm 1. Unlike our implementation of EstimateRegionCount, which we could base of aspects of the MPC problem, the implementation of this function merely exploits a general bound on the number of possible regions in an arrangement of hyperplanes.
In particular, we noted in Definition 2 that the uniqueorder regions created by a set of local linear functions correspond to the regions in the hyperplane arrangement specified by nonempty hyperplanes of the form , each of which is a hyperplane in dimension of (when ).
There seems to be a wellknown – but rarely stated – upper bound on the number of regions that can be formed by a hyperplane arrangement of hyperplanes in dimension . The few places where it is stated (e.g (SerraBoundingCountingLinear2018, , Lemma 4)) seem to ambiguously quote Zaslavsky’s Theorem (StanleyIntroductionHyperplaneArrangements, , Thoerem 2.5) in their proofs. Thus, we state the bound, and sketch a proof.
Theorem 1 ().
Let be an arrangement of hyperplanes in dimension . Then the number of regions created by this arrangement, is bounded by:
(48) 
(with equality if and only if is in general position (StanleyIntroductionHyperplaneArrangements, , pp. 4)).
Proof.
First, we note that the bound holds with equality for arrangements in general position (defined on (StanleyIntroductionHyperplaneArrangements, , pp. 4)); this is from (StanleyIntroductionHyperplaneArrangements, , Proposition 2.4), a consequence of Zaslavsky’s theorem (StanleyIntroductionHyperplaneArrangements, , Theorem 2.5). Thus, the claim holds if every other arrangement has fewer regions than an arrangement in general position with the same number of hyperplanes.
This is indeed the case, but it helps to have a little bit of terminology first. In particular, we introduce the general formula for the number of regions in a hyperplane arrangement, , in terms of a triple of hyperplane arrangements (StanleyIntroductionHyperplaneArrangements, , pp. 13), namely (StanleyIntroductionHyperplaneArrangements, , Lemma 2.1):
(49) 
Such a triple is formed by choosing a distinguished hyperplane , and defining as and as the arrangement of hyperplanes . Note that characterizes the regions in that are split by .
From here, we will only provide a brief proof sketch. The proof proceeds by induction: first on the number of hyperplanes in , and then on by induction on the dimension, . For , the result can be shown for arrangements of size using (49), and noting that if and only if intersects all the other hyperplanes exactly once. This, together with the induction assumption, shows can satisfy the claim with equality only if is in general position. For , the proof proceeds similarly, using (49) to invoke the conclusion for as necessary. ∎
Thus, our implementation of EstimateUniqueOrderCount simply computes and returns the value in (48). In the worst case, this estimate is : this occurs for example when . But for , this bound clearly grows more slowly than exponentially in . This is extremely helpful in keeping the size of the second linear layer in Figure 4 of a reasonable size.
We conclude this section by noting that the result in Theorem 1 may be used to state Theorem 6 independently of entirely. In particular, we have the following theorem.
Theorem 2 ().
Let be a CPWL function, and let be an upperbound on the number of local linear functions in . Then for there exists a parameter list with as in (30) such that:
(50) 
6. Discussion: Hinging Hyperplane Implementations
For comparison, we will make some remarks about the hinging hyperplane representation used in (AroraUnderstandingDeepNeural2016, , Theorem 2.1).
Theorem 1 (Hinging Hyperplane Representation (ShuningWangGeneralizationHingingHyperplanes2005, )).
Let be a CPWL function. Then there exists a finite integer and nonnegative integers , each less than , such that
(51) 
for some collection of , , each of which is an affine function of its argument , and each constant . (Each beyond the first one is referred to as a “hinge”. That is is the number of “hinges” in the summand.)
The problem with creating a NN architecture from Theorem 1 is that its proof provides only an existential assertion – by means of explicit construction – and that construction relies heavily on particular knowledge of the actual local linear functions in the CPWL (ShuningWangGeneralizationHingingHyperplanes2005, , Theorem 1). Thus, the number in (51) – which is essentially the only architectural parameter in the ReLU – has a complicated dependence on the particular CPWL function (see (ShuningWangGeneralizationHingingHyperplanes2005, , Corollary 3) as it appears in the proof of (ShuningWangGeneralizationHingingHyperplanes2005, , Theorem 1)). This even makes it difficult to “replay” the proof of Theorem 1 and upperbound the size of the existential assertions in each step as necessary; there are naive upper bounds for each step, but they lead to an extravagant number of operations (exponentially many, in fact).
Moreover, a further complication is that hinging hyperplane representations need not even be unique. For example, consider : for this function, works because . But the same function could also be implemented as
(52) 
which has and five different linear functions . However, this clearly suggests an alternate approach to upperbounding the steps in Theorem 1: create an architecture derived from the CPWL with the largest minimal hinginghyperplane representation. We conjecture that such a maximin hinging hyperplane form would in fact lead to fewer units than the lattice representation used in AReN. Unfortunately, as far as we are aware, there exists no such minimal characterization in the literature (as a function of the number of local linear functions, say), so we leave consideration of this problem to future work.
7. Numerical Results
The function EstimateRegionCount (Algorithm 2) is the bottleneck in Algorithm 1. Therefore, we chose to benchmark Algorithm 2 to gage the overall performance of our proposed framework. We implemented EstimateRegionCount using SAT solver Z3 and convex solver CPLEX using their respective Python interfaces. We tested our implementation on singleinput,singleoutput MPC problems in two contexts: (1) with a varying number of states; and (2) with a varying prediction horizon . The computer used had an Intel Core i7 2.9GHz processor and 16 GB of memory.
Figure 5 (top) shows the performance of Algorithm 2 as a function of the number of plant states, , with all other parameters held constant. The estimated number of local linear functions N_est output by EstimateRegionCount is plotted on one axis; the maximum number of linear functions needed, , is also shown for reference. The other axis shows the execution time for each problem in seconds. It follows from Theorem 6 that the number of plant states doesn’t change the number of constraints and hence does not contribute to the complexity of Algorithm 2. Note that Algorithm 2 reported a number of local linear functions that is one order of magnitude less than the maximum number of linear functions needed, while taking less than 1.5 minutes of execution time.
Figure 5 (bottom) shows the performance of our algorithm (in semilog scale) as a function of the number of constraints, , with all other parameters held constant (). The estimated number of linear functions output by EstimateRegionCount is plotted on one axis; the maximum number of linear functions needed, , is also shown for reference. The other axis shows the execution time for each problem in seconds. Again, we notice an order of magnitude difference between the reported number of local linear functions versus the maximum number of linear functions needed, . Indeed, the execution time is affected by increasing the number of constraint, nevertheless, Algorithm 2 terminates in less than 1.5 hours for a system with more than 300,000 maximal linear regions.
References
 [1] Bernt M Akesson and Hannu T Toivonen. A neural network model predictive controller. Journal of Process Control, 16(9):937–946, 2006.
 [2] Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J Zico Kolter. Differentiable mpc for endtoend planning and control. In Advances in Neural Information Processing Systems, pages 8289–8300, 2018.
 [3] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding Deep Neural Networks with Rectified Linear Units. 2016.
 [4] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv:1611.02167, 2016.
 [5] Alberto Bemporad, Manfred Morari, Vivek Dua, and Efstratios N. Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3–20, 2002.
 [6] James Bergstra and Yoshua Bengio. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
 [7] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for selfdriving cars. arXiv:1604.07316, 2016.
 [8] L Cavagnari, Lalo Magni, and Riccardo Scattolini. Neural network implementation of nonlinear recedinghorizon control. Neural computing & applications, 8(1):86–92, 1999.
 [9] S. Chen, K. Saulnier, N. Atanasov, D. D. Lee, V. Kumar, G. J. Pappas, and M. Morari. Approximating Explicit Model Predictive Control Using Constrained Neural Networks. In 2018 Annual American Control Conference (ACC), pages 1520–1527, 2018.
 [10] Arthur Claviere, Souradeep Dutta, and Sriram Sankaranarayanan. Trajectory tracking control for robotic vehicles using counterexample guided training of neural networks. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 680–688, 2019.
 [11] Michael Hertneck, Johannes Köhler, Sebastian Trimpe, and Frank Allgöwer. Learning an approximate model predictive controller with guarantees. IEEE Control Systems Letters, 2(3):543–548, 2018.
 [12] C. Kahlert and L. O. Chua. A generalized canonical piecewiselinear representation. IEEE Transactions on Circuits and Systems, 37(3):373–383, 1990.
 [13] Benjamin Karg and Sergio Lucia. Efficient representation and approximation of model predictive control laws via deep learning. arXiv:1806.10644, 2018.
 [14] David G. Luenberger. Optimization by Vector Space Methods. John Wiley & Sons, 1997.
 [15] J Gomez Ortega and EF Camacho. Mobile robot navigation in a partially structured static environment, using neural predictive control. Control Engineering Practice, 4(12):1669–1679, 1996.
 [16] Supratik Paul, Vitaly Kurin, and Shimon Whiteson. Fast efficient hyperparameter tuning for policy gradients. arXiv:1902.06583, 2019.
 [17] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. arXiv:1602.02355, 2016.
 [18] Marcus Pereira, David D Fan, Gabriel Nakajima An, and Evangelos Theodorou. Mpcinspired neural network policies for sequential decision making. arXiv:1802.05803, 2018.
 [19] Yao Quanming, Wang Mengshuo, Jair Escalante Hugo, Guyon Isabelle, Hu YiQi, Li YuFeng, Tu WeiWei, Yang Qiang, and Yu Yang. Taking human out of learning applications: A survey on automated machine learning. arXiv:1810.13306, 2018.
 [20] Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and Counting Linear Regions of Deep Neural Networks. arXiv:1711.02114, 2018.
 [21] Yasser Shoukry, Pierluigi Nuzzo, Alberto L SangiovanniVincentelli, Sanjit A Seshia, George J Pappas, and Paulo Tabuada. SMC: Satisfiability Modulo Convex optimization. In Proceedings of the 20th International Conference on Hybrid Systems: Computation and Control (HSCC), pages 19–28. ACM, 2017.
 [22] Yasser Shoukry, Pierluigi Nuzzo, Alberto L SangiovanniVincentelli, Sanjit A Seshia, George J Pappas, and Paulo Tabuada. Smc: Satisfiability modulo convex programming. Proceedings of the IEEE, 106(9):1655–1679, 2018.
 [23] Shuning Wang and Xusheng Sun. Generalization of hinging hyperplanes. IEEE Transactions on Information Theory, 51(12):4425–4431, 2005.
 [24] Richard P Stanley. An Introduction to Hyperplane Arrangements. page 90.
 [25] J. M. Tarela and M. V. Martínez. Region configurations for realizability of lattice PiecewiseLinear models. Mathematical and Computer Modeling, 30(11):17–27, 1999.
Comments
There are no comments yet.