1 Introduction
In many sequential learning tasks, we often must iteratively collect new data and then learn from them, creating a dependency between learning and experiment design. For example, we can learn from previous experimental outcomes to choose the next experiment to optimize future experimental outcomes martinez2014bayesopt . This iterative interaction between learning and experiment design poses challenges, most notably how the updated model should inform new data collection. This challenge is further complicated when operating within a dynamical system, as both learning and experiment design must be integrated with dynamics modeling and controller design. We are further interested in the settings that require safety and stability guarantees of the closedloop controller.
Motivating Application. Consider a motivating problem of safely landing a drone at fast landing speeds (e.g., beyond a human expert’s piloting abilities). We typically only have partial knowledge of the underlying dynamics and thus need to collect data to learn a better dynamics model. However, collecting relevant training data poses safety risks as one needs to guarantee that landing quickly will not crash the drone before we have collected such data. We typically start with nominal controller designed based on a nominal dynamics model that has high uncertainty when flying close to the ground at high speeds. Figure 1
describes the setting, where the goal is to eventually learn to execute the orange trajectory while not being overconfident and execute the green trajectory instead (which crashes into the ground); the initial nominal controller may only be able to execute the blue trajectory. The goal then is to iteratively (i.e., episodically) execute increasingly aggressive trajectories, while certifying (with high probability) the safety each trajectory to be executed.
To date, the most popular methods for safe exploration in dynamical systems are based on Gaussian processes (GPs) rasmussen2003gaussian , mainly due to their straightforward uncertainty quantification mechanism akametalu2014reachability ; berkenkamp2016safe ; berkenkamp2017safe ; fisac2018general ; khalil
. However, GPs are sensitive to model selection. Safety constraints can be violated if the model (i.e., the kernel) is misspecified and hyperparameters are not welltuned. For instance, when equipped with a kernel that is overly optimistic in its ability to extrapolate, GPbased exploration can choose overly aggressive behaviors, which is highly undesirable for safetycritical tasks.
Our Contributions. In this paper, we propose a deep robust regression approach for safe exploration in modelbased control. We take the perspective that each episode of exploration can be viewed as a data shift problem, i.e., the “test” data in the proposed exploratory trajectory comes from a shifted distribution compared to the current training set. Our approach learns to quantify uncertainty under such covariate shift, which we then use to learn robust dynamics models to quantify uncertainty of entire trajectories for safe exploration. In the drone example above, we would aim to gradually increase the speed of landing by choosing controls for executing higher speed trajectories, while simultaneously reducing uncertainty of dynamics model.
Our approach builds upon robust linear regression under covariate shifts
chen2016robust, which we extend to training deep neural networks. The resulting method learns to directly predicts an uncertainty bound via minimizing a relative loss in terms of a base distribution using a minimax approach, where the adversarial player is a worstcase competing probabilistic estimator from source data samples. We analyze learning performance from both generalization and data perturbation perspectives, which show a relation between learning errors on the target data and source prediction variances. We utilize spectral normalized neural networks
bartlett2017spectrally ; shi2018neural , which guarantees certain stability properties, and derive corresponding bounds in the robust regression framework. Our results on robust regression are of independent interest beyond our application to control.We integrate our robust regression analysis to address exploration in modelbased control, and derive safety and stability bounds for control performance when learning robust dynamics models. We propose a novel safe exploration algorithm that guarantees safety during control deployment, and also derive convergence guarantees to optimal dynamics estimator and controller. We empirically show that our approach outperforms conventional GPbased safe exploration with much less tuning effort in two scenarios: (a) inverted pendulum trajectory tracking; and (b) fast drone landing using an aerodynamics simulation based on realworld flight data
shi2018neural .2 Robust Regression
2.1 Robust regression under covariate shifts
Covariate shift refers to the distribution shift caused only by the input variables , but not the conditional output distribution
. This assumption is valid in many cases, especially in dynamical systems. In our motivating safe landing example, there is a universal “true” aerodynamics model, but we typically only observe training data from a small part of the state space. We now briefly recap the original robust regression model. The method tries to robustly minimize a relative loss function defined as the difference in condition logloss between an estimator
and a baseline conditional distribution on the target data distribution . This loss essentially measures the amount of expected “surprise" in modeling true data distribution that comes from instead of : . We constrain the true condition probability to satisfy certain statistical properties of the source data distribution : , whereis a vector of statistics measured from the source data. We then seek to find the regression model that is robust to the “most surprising" distribution that can arise from covariate shift:
. The solution of this problem has the parametric form: , with parameters obtained maximum condition log likelihood estimation with respect to the target distribution: .2.2 Robust regression with deep neural networks
To apply the ideas in Section 2.1, we observe that one can directly compute the gradient of the original problem, since it is only associated with source training samples, without regularization: .
This sidesteps the need to explicitly evaluate the objective function, which we cannot do without ground truth on the target distribution by leveraging structural properties of the linear minimax problem. But it is a nonstandard gradient computation in deep learning. We provide a derivation that is straightforward for implementation using deep learning packages. We also present new theoretical results in Section
2.3.If we utilize a quadratic feature function on the last layer of the network and incorporate additional parameters
, we obtain a Gaussian distribution and the following form of mean and variance of the predicted output variable:
(1)  
(2) 
where is the mean and variance of the base distribution , is the output of the deep neural networks, the corresponding quadratic feature function is , and is the density ratio of a data point . Note that we regard as density ratio estimated independent from this framework. Then the gradient for the additional parameters becomes:
(3)  
(4) 
We then backpropagate the gradients to variables in NN. We show a gradient descent version for learning both the parameters in NN and in Algorithm 1.
Figure 2 shows an intermediate step using robust regression to explore stateless space. Intuitively, the method produces less certain predictions where there is less training data distribution support. The density ratio and base distribution determine the uncertainty, and uncertainty is only reduced when there are training data samples. Moreover, when the mean estimator is inaccurate, variance estimator is large to compensate the possible error in prediction with larger uncertainty. We refer a more detailed discussion of the stateless case to the Appendix A.2.6.
2.3 Learning performance analysis
The learning performance of our deep robust regression approach can be analyzed from two perspectives: generalization error under distribution shift and error bound under data perturbation based on Lipschitz continuity. We first establish a general form of the bounds and then derive concrete versions for both linear and deep predictors. The proofs are in the appendix.
Theorem 1.
[Generic generalization and perturbation bounds] Assume is a training set with i.i.d. data sampled from , is a regression function class satisfying , is the Rademacher complexity on , is the upper bound of true density ratio , , the weight estimation , base distribution variance is , and is the upperbound of all among the dimensions of . When learning a on , the following generalization error bound holds with probability at least ,
If we assume that target data samples ’s stay in a ball with diameter from the source data , , the true function is Lipschitz continuous with constant , and the robust regression mean estimator is also Lipschitz continuous with constant , then,
(5) 
We can further upper bound the Rademacher complexity if we know the function class.
Corollary 1.
[The linear case] If is linear function class with , i.e. we only use linear features for , and , the following holds with probability at least ,
The corresponding perturbation bounds for a linear function class is,
(6) 
For deep robust regression, we utilize spectral normalized deep neural networks bartlett2017spectrally and upper bound the Radermacher Complexity using the bounded spectral complexity bartlett2017spectrally .
Corollary 2.
[The neural network case] Let neural networks use fixed nonlinearities , which are Lipschitz and . Let reference matrices be given, as well as spectral norm bounds , and norm bounds . If , for robust regression using network with weight matrices and maximum dimension of each layer is at most obey , , and , the following holds with probability at least ,
(7) 
where is the spectral complexity of networks , ; The corresponding perturbation bounds for spectral normalized deep neural networks is,
(8) 
3 Control with a Robust Regression Dynamics Estimator
We propose to learn the nonlinear dynamics using deep robust regression under covariate shift, which is used to safely explore and collect data to improve model accuracy and derive improved control polices.^{1}^{1}1 Note that robust regression is also applicable to the standard experimental design setting (e.g., Bayesian optimization), and we conduct an empirical evaluation of that setting in the Appendix A.2.6. In order to connect learning to control, we need to adapt the general analysis of robust regression to fit the control context, i.e., to analyze entire trajectory behaviors. Instead of assuming the target data is IID samples from a static target data distribution, which can be too conservative considering robustness on a large target data distribution consisting of all possible unseen states, we propose to only consider one trajectory as the targe data distribution at a time, and to use robust dynamics regression to establish safety and stability guarantees. We then enlarge the set of safe trajectories episodically by collect training data along the executed trajectory and update the dynamics model accordingly. We now introduce the dynamics model and controller design. Note that all the norms in this section are the 2norm. All proofs are in the appendix.
3.1 A mixed model for robotic dynamics
Consider the following mixed model for continuous robotic (drone & pendulum in our experiments) dynamics:
(9) 
with generalized coordinates (and their first & second time derivatives, & ), control input , inertia matrix , centrifugal and Coriolis terms , gravitational forces , actuation matrix and some unknown residual dynamics . Note that the matrix is chosen to make skewsymmetric from the relationship between the Riemannian metric and Christoffel symbols. Here is general, which potentially captures both parametric and nonparametric unmodelled terms.
Definition of safety in trajectory tracking.
The state vector is denoted as , . The main objective for a robotic system is to track some timevarying desired trajectory . Simultaneously, we want to guarantee safety: , with high probability, where is some safety set. It is obvious that . However, because we do not know a priori, the tracking error may be large such that . There are two 1d examples following.
Example 1 (inverted pendulum with external wind).
In addition to the classical pendulum model, we consider some unknown external wind. Dynamics can be described as , where is external torque generated by the unknown wind. Our control goal is to track , and the safety set is .
Example 2 (drone landing with ground effect)
For this example, we consider drone landing with unknown ground effect. Dynamics is , where is the thrust coefficient. The control goal is smooth and quick landing, i.e., quickly driving . The safety set is , i.e., the drone cannot hit the ground with high velocity.
3.2 Model based nonlinear control
We design a nonlinear controller, which can leverage robust regression of in Eq. 9. Define the reference trajectory as , where , and the composite variable as , where is positive definite. The control objective is to drive to or a small error ball in the presence of bounded uncertainty. With the robust estimation of , , we propose the following nonlinear controller:
(10) 
where is a positive definite matrix, and denotes the MoorePenrose pseudoinverse.
With the control law Eq. 10, we will have the following closedloop dynamics:
(11) 
where is the approximation error between and .
3.3 Stability analysis and trajectory bound
To connect learning with control, we set to correspond with the bounds in Section 2.3. The first option is to connect to Theorem 1, where target data is only a single trajectory that deviates from those in the training data, which means W is further bounded. The second option is to use a perturbation bound, where . We omit rewriting of the bounds here and refer to appendix A.1.1, but emphasize that is upper bounded with when we define target data in a specific set and use robust regression for learning dynamics. We show (Euclidean distance between the desired trajectory and the real trajectory) is bounded when the error of the dynamics estimation is bounded. Again, recall that is our state, and is the desired trajectory.
Theorem 2.
Suppose is in some compact set , and . Then will exponentially converge to the following ball: , where
(12) 
3.4 Safe exploration algorithm
Theorem 2 indicates that if we can design a compact set and find the corresponding maximum error bound on it, we can use it to decide whether a trajectory in this set is safe or not by checking whether its worstcase possible tracking trajectory is in the safety set .
In practice, we design a pool of desired trajectories and use the current predictor of the dynamics to find the worstcase possible tracking trajectory for each of the desired ones. Note that the worstcase tracking trajectories can be computed by generating a “tube" using euclidean distance in Theorem 2. We then eliminate unsafe ones and choose the most “aggressive" one in terms of the primary objective function for the next iteration. Instead of evaluating the actual upper bound, we use for measuring as an approximation, since it is guaranteed that the error is within with high probability as long as the prediction is a Gaussian distribution, if the true function is drawn from the same distribution. We refer a detailed discussion to Appendix A.1.2. Algorithm 2 describes this safe exploration procedure.
3.5 Convergence analysis
We show that using Algorithm 2, we are able to reach optimality in terms of learning the dynamics model, i.e. converge to the optimal predictor in the function class.
Theorem 3.
If there exists an optimal predictor in function class , , with mean and variance estimates , the sequence of estimator from robust regression in each step consist of and , for any , the output of Algorithm 2 converge to , with as the smallest integer that satisfies the following with at least probability :
(13) 
Given the convergence of the dynamics model, we can prove the optimal desired trajectory is collected and tracked with good control performance at the end.
Corollary 3.
If there exist an optimal trajectory parameter for the controller to track safely and obtain minimal cost function among all the safe trajectories when the estimated dynamic model satisfies for all , is collected by Algorithm 2, as well as being tracked with for all , with in Theorem 2, where is the tracking trajectory using as dynamics estimation.
4 Experiments
(a)  (b)  (c)  (d) 
(e)  (f)  (g)  (h) 
We conduct experiments on simulation on the inverted pendulum and drone landing examples as discussed in Section 3.1
. We use kernel density estimation to estimate density ratios. We demonstrate that our approach can reliably and safely converge to optimal behavior. We also compare with a Gaussian process (GP) version of Algorithm
2. In general, we find that it is difficult to tune kernel parameters, and all GP models underperform compared to our approach.Example 1 (inverted pendulum with external wind).
Recall that the safety set is in the pendulum case, and the final control goal is to track . Therefore our desired trajectory pool is . The ground truth of wind comes from quadratic air drag model. We use the angle upper bound in trajectory as the reward function for choosing “most aggressive" trajectories. Figure 4 demonstrates the exploration process with selected desired trajectories, worstcase tracking trajectory under current dynamics model, tracking trajectories with the ground truth unknown dynamics model, and actual tracking trajectories. Here we use base distribution to start with and . As shown in Figure 4 (a) to (c), the algorithm selects small to guarantee safety at the beginning, and gradually is able to select larger values and track it with small error.
Example 2 (drone landing with ground effect)
Recall that the safety set is , which means the drone can not hit the ground with high velocity. Our desired trajectory pool is , which means the drone smoothly moves from to the desired height . If , the drone lands successfully. Note that greater means faster landing. We use smaller landing time as the reward function that determines the next “aggressive" trajectory. The ground truth of aerodynamics in landing comes from a dynamics simulator that is trained in shi2018neural , where
is a fourlayer ReLU neural network trained by real flying data. Here we use base distribution
for robust regression and . Results in Figure 4(e) to (g) demonstrate that, because of the lack of aerodynamics , and big may not be safe at the beginning. Starting from conservative desired trajectories, the safe exploration using robust regression is able to track more aggressive desired trajectory with and big while staying safe.Comparison with GPs. We examine here drone landing time, and defer examining the simpler pendulum setting to the appendix. We compare against five GP models with a wide range of kernel parameters, including both ones that are optimistic or conservative about their prediction uncertainty, by setting different bandwidths in the RBF kernel. Figure 5 shows that our approach outperforms all GP models. Modeling the ground effect is notoriously challenging shi2018neural , and the GP suffers from model misspecification. In contrast, our approach can fit general nonlinear function estimators such as deep neural networks adaptively to the available data, which leads to more flexible inductive bias and better fitting of the data and uncertainty quantification. Additional results for the drone landing setting as well as inverted pendulum are in Appendix A.5. The supplemental material also contains a video demoing the results.
5 Related Work
Safe Exploration. Safe exploration methods commonly use Gaussian processes (GPs) to quantify uncertainty sui2015safe ; sui2018stagewise ; kirschner2019adaptive ; akametalu2014reachability ; berkenkamp2016safe ; turchetta2016safe ; wachi2018safe ; berkenkamp2017safe ; fisac2018general ; khalil . These methods are related to bandit algorithms bubeck2012regret and typically employ upper confidence bounds auer2002using to balance exploration versus exploitation srinivas2009gaussian . As discussed above, GPbased approaches can be sensitive to model selection. One could blend GPbased modeling with general function approximations (such as deep learning) berkenkamp2017safe ; cheng2019end , but the resulting optimizationbased control problem can be challenging to solve. Other approaches either require having a safety model prespecified upfront alshiekh2018safe , are restricted to relatively simple models moldovan2012safe , have no convergence guarantees during learning taylor2019episodic , or have no guarantees at all garcia2012safe . Our work also shares some similarity with snoek2015scalable ; brookes2019conditioning , which use deep neural networks for samplingbased optimization; however, those approaches have no guarantees and so are unsuitable for safe exploration.
Distribution Shift. The study of data shift has seen increasing interest in recent years, owing to the widespread practical issue that real test distributions rarely match the training distribution. Our work is stylistically similar to liu2015shift ; chen2016robust ; liu2014robust ; liu2017robust , which also frame uncertainty quantification through the lens of covariate shift, although ours is the first to extend to deep neural networks with rigorous guarantees. More broadly, dealing with domain shift is a fundamental challenge in deep learning, as highlighted by their vulnerability to adversarial inputs goodfellow2014explaining , and the implied lack of robustness. Beyond robust estimation, the typical approaches are to either regularize srivastava2014dropout ; wager2013dropout ; simile ; bartlett2017spectrally ; miyato2018spectral ; shi2018neural ; benjamin2019measuring ; corerl or synthesize an augmented dataset that anticipates the domain shift prest2012learning ; zheng2016improving ; stewart2017label . We take the former approach by employing spectral normalization bartlett2017spectrally ; shi2018neural in conjunction with robust estimation.
6 Conclusion & Future Work
In this paper, we propose an algorithmic framework for safe exploration in modelbased control. To quantify uncertainty, we develop a robust deep regression method for dynamics estimation. Using robust regression, we explicitly deal with data shifts during episodic learning, and in particular can quantify uncertainty over entire trajectories. We prove the generalization and perturbation bounds for robust regression, and show how to integrate with control to derive safety bounds in terms of stability. These bounds explicitly translates the error in dynamics learning to the tracking error in control. From this, we design a safe exploration algorithm based on a finite pool of desired trajectories. We prove that the proposed safe exploration algorithm converges to the optimal dynamics estimator in its function class, as well as the optimal controller for tracking optimal desired trajectories.
There are many avenues for future work. For instance, our safety criterion was relatively simple, and one can consider employing more sophisticated criteria that require more sophisticated certification approaches such as reachability analysis fisac2018general . Our theoretical analysis can also be improved, since tighter safety guarantees can lead to dramatically improved performance. Directions to explore include incorporating other regularization techniques, relaxing from Gaussian observation noise, and incorporating more sophisticated density estimation techniques. Another interesting direction is to incorporate with deep kernel learning wilson2016deep for safe exploration, which does make stronger assumptions (uses a Gaussian process model) but might alleviate some issues with kernel parameter tuning. Finally, our deep robust regression approach is of independent interest beyond modelbased control, and can be incorporated in other applications as well.
References

(1)
Ruben MartinezCantin.
Bayesopt: A bayesian optimization library for nonlinear optimization,
experimental design and bandits.
The Journal of Machine Learning Research
, 15(1):3735–3739, 2014.  (2) Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
 (3) Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N Zeilinger, and Claire J Tomlin. Reachabilitybased safe learning with gaussian processes. In 53rd IEEE Conference on Decision and Control, pages 1424–1431. IEEE, 2014.
 (4) Felix Berkenkamp, Angela P Schoellig, and Andreas Krause. Safe controller optimization for quadrotors with gaussian processes. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 491–496. IEEE, 2016.

(5)
Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause.
Safe modelbased reinforcement learning with stability guarantees.
In Advances in neural information processing systems, pages 908–918, 2017.  (6) Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learningbased control in uncertain robotic systems. IEEE Transactions on Automatic Control, 2018.
 (7) Hassan Khalil and Jessy Grizzle. Nonlinear systems. Prentice hall, 2002.
 (8) Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In Artificial Intelligence and Statistics, pages 1270–1279, 2016.
 (9) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
 (10) Guanya Shi, Xichen Shi, Michael O’Connell, Rose Yu, Kamyar Azizzadenesheli, Animashree Anandkumar, Yisong Yue, and SoonJo Chung. Neural lander: Stable drone landing control using learned dynamics. International Conference on Robotics and Automation (ICRA), 2019.
 (11) Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning, pages 997–1005, 2015.
 (12) Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Stagewise safe bayesian optimization with gaussian processes. In International Conference on Machine Learning (ICML), 2018.
 (13) Johannes Kirschner, Mojmír Mutnỳ, Nicole Hiller, Rasmus Ischebeck, and Andreas Krause. Adaptive and safe bayesian optimization in high dimensions via onedimensional subspaces. In International Conference on Machine Learning (ICML), 2019.

(14)
Matteo Turchetta, Felix Berkenkamp, and Andreas Krause.
Safe exploration in finite markov decision processes with gaussian processes.
In Advances in Neural Information Processing Systems, pages 4312–4320, 2016.  (15) Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of constrained mdps using gaussian processes. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 (16) Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 (17) Peter Auer. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
 (18) Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), 2010.
 (19) Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks. In Conference on Artificial Intelligence (AAAI), 2019.
 (20) Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 (21) Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. In International Conference on Machine Learning (ICML), 2012.
 (22) Andrew J Taylor, Victor D Dorobantu, Hoang M Le, Yisong Yue, and Aaron D Ames. Episodic learning with control lyapunov functions for uncertain robotic systems. arXiv preprint arXiv:1903.01577, 2019.
 (23) Javier Garcia and Fernando Fernández. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45:515–564, 2012.
 (24) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180, 2015.
 (25) David H Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In International Conference on Machine Learning (ICML), 2019.

(26)
Anqi Liu, Lev Reyzin, and Brian D Ziebart.
Shiftpessimistic active learning using robust biasaware prediction.
In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.  (27) Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in neural information processing systems, pages 37–45, 2014.
 (28) Anqi Liu and Brian D Ziebart. Robust covariate shift prediction with general losses and feature views. arXiv preprint arXiv:1712.10043, 2017.
 (29) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 (30) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 (31) Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in neural information processing systems, pages 351–359, 2013.

(32)
Hoang M. Le, Andrew Kang, Yisong Yue, and Peter Carr.
Smooth imitation learning for online sequence prediction.
In International Conference on Machine Learning (ICML), 2016.  (33) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 (34) Ari S Benjamin, David Rolnick, and Konrad Kording. Measuring and regularizing networks in function space. In International Conference on Learning Representations (ICLR), 2019.
 (35) Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, and Joel Burdick. Control regularization for reduced variance reinforcement learning. In International Conference on Machine Learning (ICML), 2019.

(36)
Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and
Vittorio Ferrari.
Learning object class detectors from weakly annotated video.
In
2012 IEEE Conference on Computer Vision and Pattern Recognition
, pages 3282–3289. IEEE, 2012.  (37) Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 4480–4488, 2016.
 (38) Russell Stewart and Stefano Ermon. Labelfree supervision of neural networks with physics and domain knowledge. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 (39) Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
Appendix A Appendix
a.1 Additional Theoretical Results
a.1.1 Improved Bounds for Control
As explained in the paper, we can further improve the learning bounds in the control context when we control the target data in a strategically way. In Theorem 1, is the upper bound of the true density ratio of this two distribution, which potentially can be very large when target data is a very different one from the source. However, we can choose our next trajectory as the one not deviate too much from the source data in practice, so that further constraining and also in Theorem 1. We can rewrite the theorem as:
Theorem 4.
[Improved Generalization and perturbation bounds in general cases] Assume is a training set with i.i.d. data sampled from , is the function class of mean estimator in robust regression, it satisfies , is the Rademacher complexity on , is the upper bound of true density ratio , , the weight estimation , base distribution variance is , is the upperbound of all among the dimensions of , we have the generalization error bound on hold with probability ,
If we assume target data samples ’s stay in a ball with diameter from the source data , the true function is Lipschitz continuous with constant and the robust regression mean estimator is also Lipschitz continuous with constant ,
(14) 
Note that in generalization bound, we can further improve the bound if we know what is the method for estimating density ratio and further relate the overall learning performance with the density ration estimation. Here, we just use as if it is a value that is given to us beforehand.
a.1.2 High Probability Bounds for Gaussian Distribution
In Algorithm 2, we use as our approximation of the learning error from the robust regression instead of measuring the actual learning upper bound, which is hard to evaluate. Here we give the justification.
If the prediction from robust regression is , assuming true function is drawn from the same distribution, we have
. Also, for a unit normal distribution
, we have . Therefore, for data , hold with probability greater than . Therefore, we can choose in practice and it corresponds with different probability in bounds.a.2 Proof of Theoretical Results
a.2.1 Proof of Theorem 1
Proof.
We first prove the generalization bound using standard Redemacher Complexity for regression problems:
(15) 
where , is the Rademacher complexity on the function class of mean estimate Eq. 1, and the variance term is the empirical variance of the robust regression model and follows Eq. 2. This is a datadependent bound that relies on training samples.
We next prove the perturbation bounds. Assuming stays in a ball with diameter from the source training data , , the true function is Lipschitz continuous with constant and the mean function of our learned estimator is also Lipschitz continuous with constant , then we have
(16) 
If we have an upperbound for the parameter and the weight estimation , we have
Therefore, the generalization bound and perturbation bounds can be written as
(18) 
(19) 
∎
a.2.2 Proof of Corollary 1
Proof.
If function class is linear, assuming , we can further bound the Rademacher complexity of when is an linear model. Indeed, the mean estimate has the form , where and . If is large enough to make dominating , is approximately . When is very small,