I Introduction and Background
The question of how to best generate autonomous control policies for mechanical systems is an important problem in robotics. Research in this field can be traced back to early work on optimal control by Pontryagin  and Bellman . Since this time, significant progress has been made in both the theory and application of autonomous control techniques [22, 24]. However, challenges remain in developing strategies that are valid without a priori knowledge of the system dynamics.
One possible solution is to use model-free policy generation techniques . These methods require no explicit model of the system dynamics and have been shown to be effective in numerous domains [14, 15]. However, model-free policy generation techniques often require massive amounts of data and are therefore difficult to evaluate on real-world robotic systems . An alternative option is to learn an explicit model of the system dynamics which can be incorporated into an optimal control algorithm. Model-based control methods are more data-efficient and often easier to apply in real-world scenarios [3, 6]. However, many optimal control algorithms require some notion of derivatives to compute a control policy [2, 16, 18, 25].
Computing the required derivatives can often prove challenging with complex modeling techniques like deep neural networks . Additionally, these black-box methods make it difficult to analyze the underlying dynamics of the system. There are, of course, alternative modeling techniques [1, 3, 8, 11, 13, 15]; however, there remains a desire to incorporate modern, deep neural networks into the optimization loop due to their ability to model challenging dynamic features (e.g., contacts) and scale to high-dimensional tasks [12, 19, 27]. In this work, we provide a method that combines the expressive power of neural network models with gradient-based optimal control algorithms. Our solution is based on a neural network architecture that enforces a linear structure in the state and control space, making it easier to analyze and incorporate into model-based control.
Ii Structured Neural Networks for
Model Predictive Control
In this section, we define our structured neural network architecture and then detail how the learned models can be integrated into model-based control algorithms.
Ii-a Structured Neural Network Architecture
Our neural network architecture is composed of two parallel subnetworks (see Figure 1). The architecture of the first subnetwork (-subnet) can be defined by any number of layers and parameters, and is only constrained such that the final layer must have parameters, where is the dimension of the system’s state space. Similarly, the second subnetwork (-subnet) is only constrained such that the final layer must have parameters, where is the dimension of the system’s control space. The network then combines (1) the dot product of the output from the -subnet and the state , with (2) the dot product of the output from the -subnet and the control , through an element-wise add operation. This architecture describes a single, global model of the form , which is trained with standard gradient-based techniques and can be evaluated and linearized anywhere in the state space. Here, the -subnet represents the linearization of the dynamics model with respect to the state variables (i.e., ), and the -subnet represents the linearization of the dynamics model with respect to the control variables (i.e., ).
Ii-B Integration with Model-based Control
Given a learned dynamics model, one can compute autonomous control policies through data-driven methods  or through integration with optimal control algorithms [2, 16, 25]. On the optimal control side, researchers have mostly explored sampling-based optimization methods. For example, researchers have proposed computing control trajectories with a random shooting method  and with model predictive path integral  control. The reason that sampling-based methods are appealing in this domain is that the solution does not depend on computing potentially costly gradients with respect to the state and control variables. However, the solution does require generating a large number of samples to cover a sufficient portion of the action space. The challenge, then, is to balance the number of samples generated at each time-step with the rate of the control loop. As the dimensionality of the action space grows, this becomes more and more challenging.
In contrast with sampling-based methods, gradient-based optimization techniques provide an efficient method of computing control trajectories. Additionally, these methods provide sensitivity information in the form of time-varying Jacobians. However, integrating neural network models with these optimization techniques can prove difficult. This is because it is unclear a priori how to compute the necessary Jacobians (, and ). By enforcing a linear structure on the neural network architecture (as described in Section II-A), we can efficiently predict the evolution of the dynamic system as well as the required Jacobians. Then, to generate an autonomous policy, we solve the following optimal control problem
where is the learned, structured system dynamics, and are the running and terminal cost, and and are the set of valid control and state values. The solution of this problem is the control sequence that minimizes the cost.
Iii Experimental Validation
We validate the efficacy of our described approach through experimentation on three standard control domains. Our first experimental environment is OpenAI’s implementation of the continuous mountain car problem  (Figure 1(a)). The mountain car is defined by a two dimensional state space () and one dimensional control space (). The second experimental environment is an implementation of the classic cart-pole swing up problem written from scratch (Figure 1(b)). The cart-pole is defined by a four dimensional state space () and a one dimensional control space (). The final experimental environment is a two-link arm written in the Bullet physics engine and described in a related CMU course (Figure 1(c)). The two-link arm exists in a four dimensional state space () and is controlled with a two dimensional signal (). All three environments are defined with continuous-valued state and control spaces.
Iii-a Model Learning Details
In this section, we describe our data collection method and the training procedure.
Iii-A1 Data collection
We collect data through observation of trajectories produced by the system using control inputs sampled uniformly at random. The data is collected in tuples of (), where is computed as and is the timestep. For each environment, we collect 500 trajectories, which are terminated at either 500 steps or when the system violates environment boundary or safety conditions.
Iii-A2 Training the model
Given a dataset of tuples (), we train the dynamics model by minimizing the following error function
We use the Adam optimizer with a learning rate of 0.001. Half the data is used for training and half for validation. We find that no data preprocessing is necessary.
Our evaluation consists of state plots which demonstrate that our defined neural network architecture can be used to solve model-based control problems. Each example solution depicts the initial state of the system (the start of the state trajectory, which is chosen at random), the time-varying state produced by our model-based control algorithm (red and blue), and the goal state (black). In Figures 1(d), 1(e), 1(f), we relay a single solution for each experimental environment, however, we note that our algorithm produced successful control trajectories (with respect to the desired goal state) from a variety of initial conditions. Additionally, our approach was able to successfully generate control trajectories that reached arbitrary goal states in the two-link arm environment.
These results suggest that our structured neural network can be used to learn a global model of the system dynamics, while simultaneously enforcing linearization constraints that make it possible to recover time-varying derivatives without additional computation. In contrast to approximation methods (e.g., numerical differentiation) and symbolic methods (e.g., automatic differentiation), our approach can be thought of as a prediction method for computing the required time-varying derivatives. Related work in this area includes the transformation network proposed in  which directly predicts the parameters of an A and B matrix in a latent space. In contrast, our approach does not explicitly learn parameters of a matrix; instead we learn nonlinear mappings (A-subnet, B-subnet) that we treat as linearizations of the global model in the structure of our network network. This allows us to learn a global model of the system dynamics, while simultaneously enforcing linearization constraints. A related call for the use of structure in neural networks has been explored in model-free policy generation. In , researchers describe a network architecture that combines linear and nonlinear policies into a single control model. In our work, we instead enforce structure that mimics linear time-varying systems, and incorporate these models into optimal control algorithms.
Iv-a Why We Think This Works
In this work, we address the bottleneck associated with computing gradients of the system model through the application of a structured neural network that explicitly encodes linearization constraints and therefore reduces the computational complexity necessary to recover the required Jacobians. However, without further study, it is not clear whether or not the learned and
-subnetworks actually approximate the required time-varying derivatives. Experimental evidence suggests that the vectors represented by these networks are, at a minimum, pointing in the direction of the gradient. This claim is based on the fact that (1) our model-based control algorithm produces successful policies in a variety of control domains, and (2) when we incorporate the learned system model into an MPC algorithm, we treat the output of the subnetworks as first order derivatives of the system dynamics.
Iv-B Open Questions
We now pose a number of open questions that we plan to address in future work. In particular, we are interested in exploring how our structured neural network model compares with alternative methods of computing time-varying derivatives. One such solution is to use a finite differences method for numerical differentiation. From a practical standpoint, we note that this method is prone to round-off errors and is computationally expensive in an iterative, receding-horizon framework. Another solution is to use automatic differentiation . This approach has been shown to work well, however it requires well formed expression graphs and derivatives computed at compile-time to work efficiently enough for online optimization . In future work, we plan to compare and contrast these methods in high-dimensional control spaces.
In this work, we propose a structured neural network that can be used to solve model-based control problems. The architecture makes it easy to integrate the learned models with gradient-based optimal control algorithms and simplifies the interpretation of a system model parameterized by a deep neural network. This idea is inline with other recent calls for simplification of data-driven control strategies such as [17, 21].
- Abraham et al.  Ian Abraham, Gerardo De La Torre, and Todd Murphey. Model-Based Control Using Koopman Operators. In Robotics: Science and Systems, 2017.
- Ansari and Murphey  Alex Ansari and Todd D Murphey. Sequential Action Control: Closed-Form Optimal Control for Nonlinear Systems. IEEE Transactions on Robotics, 32:1196 – 1214, Oct. 2016.
- Atkeson and Santamaria  Christopher G Atkeson and Juan Carlos Santamaria. A Comparison of Direct and Model-based Reinforcement Learning. In International Conference on Robotics and Automation, volume 4, pages 3557–3564. IEEE, 1997.
- Baydin et al.  Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic Differentiation in Machine Learning: A Survey. Journal of Machine Learning Research, 18(153):1–43, 2018.
- Bellman  Richard Bellman. Dynamic Programming. Princeton University Press, 1957.
- Broad et al.  Alexander Broad, Todd Murphey, and Brenna Argall. Learning Models for Shared Control of Human-Machine Systems with Unknown Dynamics. In Robotics: Science and Systems, 2017.
- Brockman et al.  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv:1606.01540, 2016.
Deisenroth and Rasmussen 
Marc Deisenroth and Carl E Rasmussen.
Model-Based and Data-Efficient Approach to Policy Search.
International Conference on Machine Learning, pages 465–472, 2011.
- Deisenroth et al.  Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A Survey on Policy Search for Robotics. Foundations and Trends in Robotics, 2(1–2):1–142, 2013.
- Giftthaler et al.  Markus Giftthaler, Michael Neunert, Markus Stäuble, Marco Frigerio, Claudio Semini, and Jonas Buchli. Automatic Differentiation of Rigid Body Dynamics for Optimal Control and Estimation. Advanced Robotics, 31(22):1225–1237, 2017.
- Gu et al.  Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous Deep Q-Learning with Model-based Acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
- Heess et al.  Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning Continuous Control Policies by Stochastic Value Gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
- Khansari-Zadeh and Billard  S Mohammad Khansari-Zadeh and Aude Billard. Learning Stable Nonlinear Dynamical Systems with Gaussian Mixture Models. IEEE Transactions on Robotics, 27(5):943–957, 2011.
- Kober and Peters  Jens Kober and Jan R Peters. Policy Search for Motor Primitives in Robotics. In Advances in Neural Information Processing Systems, pages 849–856, 2009.
- Levine and Abbeel  Sergey Levine and Pieter Abbeel. Learning Neural Network Policies with Guided Policy Search Under Unknown Dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
-  Weiwei Li and Emanuel Todorov. Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems. In International Conference on Informatics in Control, Automation and Robotics, pages 222–229.
- Mania et al.  Horia Mania, Aurelia Guy, and Benjamin Recht. Simple Sandom Search Provides a Competitive Approach to Reinforcement Learning. arXiv:1803.07055, 2018.
- Mayne  David Q Mayne. Differential Dynamic Programming–A Unified Approach to the Optimization of Dynamic Systems. In Control and Dynamic Systems, volume 10, pages 179–254. Elsevier, 1973.
- Nagabandi et al.  Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural Network Dynamics for Model-based Deep Reinforcement Learning with Model-free Fine-Tuning. arXiv:1708.02596, 2017.
- Pontryagin  Lev Semenovich Pontryagin. The Mathematical Theory of Optimal Processes. CRC Press, 1987.
- Rajeswaran et al.  Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards Generalization and Simplicity in Continuous Control. In Advances in Neural Information Processing Systems, pages 6553–6564, 2017.
- Sontag  Eduardo D Sontag. Mathematical Control Theory: Deterministic Finite Dimensional Systems, volume 6. Springer Science & Business Media, 2013.
- Srouji et al.  Mario Srouji, Jian Zhang, and Ruslan Salakhutdinov. Structured Control Nets for Deep Reinforcement Learning. International Conference on Machine Learning, 2018.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction, volume 2. MIT press Cambridge, 2017.
- Tassa et al.  Yuval Tassa, Nicolas Mansard, and Emo Todorov. Control-Limited Differential Dynamic Programming. In International Conference on Robotics and Automation, pages 1168–1175. IEEE, 2014.
- Watter et al.  Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images. In Advances in Neural Information Processing Systems, pages 2746–2754, 2015.
- Williams et al.  Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information Theoretic MPC for Model-based Reinforcement Learning. In International Conference on Robotics and Automation, pages 1714–1721. IEEE, 2017.