Robust Regression for Safe Exploration in Control

06/13/2019 ∙ by Anqi Liu, et al. ∙ 1

We study the problem of safe learning and exploration in sequential control problems. The goal is to safely collect data samples from an operating environment to learn an optimal controller. A central challenge in this setting is how to quantify uncertainty in order to choose provably-safe actions that allow us to collect useful data and reduce uncertainty, thereby achieving both improved safety and optimality. To address this challenge, we present a deep robust regression model that is trained to directly predict the uncertainty bounds for safe exploration. We then show how to integrate our robust regression approach with model-based control methods by learning a dynamic model with robustness bounds. We derive generalization bounds under domain shifts for learning and connect them with safety and stability bounds in control. We demonstrate empirically that our robust regression approach can outperform conventional Gaussian process (GP) based safe exploration in settings where it is difficult to specify a good GP prior.

READ FULL TEXT VIEW PDF

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many sequential learning tasks, we often must iteratively collect new data and then learn from them, creating a dependency between learning and experiment design. For example, we can learn from previous experimental outcomes to choose the next experiment to optimize future experimental outcomes martinez2014bayesopt . This iterative interaction between learning and experiment design poses challenges, most notably how the updated model should inform new data collection. This challenge is further complicated when operating within a dynamical system, as both learning and experiment design must be integrated with dynamics modeling and controller design. We are further interested in the settings that require safety and stability guarantees of the closed-loop controller.


Figure 1: Illustration of Motivation

Motivating Application. Consider a motivating problem of safely landing a drone at fast landing speeds (e.g., beyond a human expert’s piloting abilities). We typically only have partial knowledge of the underlying dynamics and thus need to collect data to learn a better dynamics model. However, collecting relevant training data poses safety risks as one needs to guarantee that landing quickly will not crash the drone before we have collected such data. We typically start with nominal controller designed based on a nominal dynamics model that has high uncertainty when flying close to the ground at high speeds. Figure 1

describes the setting, where the goal is to eventually learn to execute the orange trajectory while not being overconfident and execute the green trajectory instead (which crashes into the ground); the initial nominal controller may only be able to execute the blue trajectory. The goal then is to iteratively (i.e., episodically) execute increasingly aggressive trajectories, while certifying (with high probability) the safety each trajectory to be executed.

To date, the most popular methods for safe exploration in dynamical systems are based on Gaussian processes (GPs) rasmussen2003gaussian , mainly due to their straightforward uncertainty quantification mechanism akametalu2014reachability ; berkenkamp2016safe ; berkenkamp2017safe ; fisac2018general ; khalil

. However, GPs are sensitive to model selection. Safety constraints can be violated if the model (i.e., the kernel) is mis-specified and hyperparameters are not well-tuned. For instance, when equipped with a kernel that is overly optimistic in its ability to extrapolate, GP-based exploration can choose overly aggressive behaviors, which is highly undesirable for safety-critical tasks.

Our Contributions. In this paper, we propose a deep robust regression approach for safe exploration in model-based control. We take the perspective that each episode of exploration can be viewed as a data shift problem, i.e., the “test” data in the proposed exploratory trajectory comes from a shifted distribution compared to the current training set. Our approach learns to quantify uncertainty under such covariate shift, which we then use to learn robust dynamics models to quantify uncertainty of entire trajectories for safe exploration. In the drone example above, we would aim to gradually increase the speed of landing by choosing controls for executing higher speed trajectories, while simultaneously reducing uncertainty of dynamics model.

Our approach builds upon robust linear regression under covariate shifts

chen2016robust

, which we extend to training deep neural networks. The resulting method learns to directly predicts an uncertainty bound via minimizing a relative loss in terms of a base distribution using a minimax approach, where the adversarial player is a worst-case competing probabilistic estimator from source data samples. We analyze learning performance from both generalization and data perturbation perspectives, which show a relation between learning errors on the target data and source prediction variances. We utilize spectral normalized neural networks 

bartlett2017spectrally ; shi2018neural , which guarantees certain stability properties, and derive corresponding bounds in the robust regression framework. Our results on robust regression are of independent interest beyond our application to control.

We integrate our robust regression analysis to address exploration in model-based control, and derive safety and stability bounds for control performance when learning robust dynamics models. We propose a novel safe exploration algorithm that guarantees safety during control deployment, and also derive convergence guarantees to optimal dynamics estimator and controller. We empirically show that our approach outperforms conventional GP-based safe exploration with much less tuning effort in two scenarios: (a) inverted pendulum trajectory tracking; and (b) fast drone landing using an aerodynamics simulation based on real-world flight data 

shi2018neural .

2 Robust Regression

2.1 Robust regression under covariate shifts

Covariate shift refers to the distribution shift caused only by the input variables , but not the conditional output distribution

. This assumption is valid in many cases, especially in dynamical systems. In our motivating safe landing example, there is a universal “true” aerodynamics model, but we typically only observe training data from a small part of the state space. We now briefly recap the original robust regression model. The method tries to robustly minimize a relative loss function defined as the difference in condition log-loss between an estimator

and a baseline conditional distribution on the target data distribution . This loss essentially measures the amount of expected “surprise" in modeling true data distribution that comes from instead of : . We constrain the true condition probability to satisfy certain statistical properties of the source data distribution : , where

is a vector of statistics measured from the source data. We then seek to find the regression model that is robust to the “most surprising" distribution that can arise from covariate shift:

. The solution of this problem has the parametric form: , with parameters obtained maximum condition log likelihood estimation with respect to the target distribution: .

2.2 Robust regression with deep neural networks

To apply the ideas in Section 2.1, we observe that one can directly compute the gradient of the original problem, since it is only associated with source training samples, without regularization: .

  Input: Source data points , density ratios , DNN with initialization, DNN SGD optimizer , learning rate , regularizer

, epoch number

.
   random initialization, epoch
  While epoch
    For each mini-batch
      Evaluate
      Compute with Eq. 1;
      Compute with Eq. 2.
      Compute with Eq. 4;
      Compute with Eq. 3.
      ;
      ;
      Back-Propagate through networks.
      
  Output: Trained NN and
Algorithm 1 Stochastic Gradient Descent for Deep Robust Regression under Covariate Shift

This sidesteps the need to explicitly evaluate the objective function, which we cannot do without ground truth on the target distribution by leveraging structural properties of the linear minimax problem. But it is a non-standard gradient computation in deep learning. We provide a derivation that is straightforward for implementation using deep learning packages. We also present new theoretical results in Section

2.3.

If we utilize a quadratic feature function on the last layer of the network and incorporate additional parameters

, we obtain a Gaussian distribution and the following form of mean and variance of the predicted output variable:

(1)
(2)

where is the mean and variance of the base distribution , is the output of the deep neural networks, the corresponding quadratic feature function is , and is the density ratio of a data point . Note that we regard as density ratio estimated independent from this framework. Then the gradient for the additional parameters becomes:

(3)
(4)

We then back-propagate the gradients to variables in NN. We show a gradient descent version for learning both the parameters in NN and in Algorithm 1.

Figure 2: Robust Regression with 3 data points

Figure 2 shows an intermediate step using robust regression to explore stateless space. Intuitively, the method produces less certain predictions where there is less training data distribution support. The density ratio and base distribution determine the uncertainty, and uncertainty is only reduced when there are training data samples. Moreover, when the mean estimator is inaccurate, variance estimator is large to compensate the possible error in prediction with larger uncertainty. We refer a more detailed discussion of the stateless case to the Appendix A.2.6.

2.3 Learning performance analysis

The learning performance of our deep robust regression approach can be analyzed from two perspectives: generalization error under distribution shift and error bound under data perturbation based on Lipschitz continuity. We first establish a general form of the bounds and then derive concrete versions for both linear and deep predictors. The proofs are in the appendix.

Theorem 1.

[Generic generalization and perturbation bounds] Assume is a training set with i.i.d. data sampled from , is a regression function class satisfying , is the Rademacher complexity on , is the upper bound of true density ratio , , the weight estimation , base distribution variance is , and is the upperbound of all among the dimensions of . When learning a on , the following generalization error bound holds with probability at least ,

If we assume that target data samples ’s stay in a ball with diameter from the source data , , the true function is Lipschitz continuous with constant , and the robust regression mean estimator is also Lipschitz continuous with constant , then,

(5)

We can further upper bound the Rademacher complexity if we know the function class.

Corollary 1.

[The linear case] If is linear function class with , i.e. we only use linear features for , and , the following holds with probability at least ,

The corresponding perturbation bounds for a linear function class is,

(6)

For deep robust regression, we utilize spectral normalized deep neural networks bartlett2017spectrally and upper bound the Radermacher Complexity using the bounded spectral complexity bartlett2017spectrally .

Corollary 2.

[The neural network case] Let neural networks use fixed nonlinearities , which are -Lipschitz and . Let reference matrices be given, as well as spectral norm bounds , and norm bounds . If , for robust regression using network with weight matrices and maximum dimension of each layer is at most obey , , and , the following holds with probability at least ,

(7)

where is the spectral complexity of networks , ; The corresponding perturbation bounds for spectral normalized deep neural networks is,

(8)

3 Control with a Robust Regression Dynamics Estimator

We propose to learn the non-linear dynamics using deep robust regression under covariate shift, which is used to safely explore and collect data to improve model accuracy and derive improved control polices.111 Note that robust regression is also applicable to the standard experimental design setting (e.g., Bayesian optimization), and we conduct an empirical evaluation of that setting in the Appendix A.2.6. In order to connect learning to control, we need to adapt the general analysis of robust regression to fit the control context, i.e., to analyze entire trajectory behaviors. Instead of assuming the target data is IID samples from a static target data distribution, which can be too conservative considering robustness on a large target data distribution consisting of all possible unseen states, we propose to only consider one trajectory as the targe data distribution at a time, and to use robust dynamics regression to establish safety and stability guarantees. We then enlarge the set of safe trajectories episodically by collect training data along the executed trajectory and update the dynamics model accordingly. We now introduce the dynamics model and controller design. Note that all the norms in this section are the 2-norm. All proofs are in the appendix.

3.1 A mixed model for robotic dynamics

Consider the following mixed model for continuous robotic (drone & pendulum in our experiments) dynamics:

(9)

with generalized coordinates (and their first & second time derivatives, & ), control input , inertia matrix , centrifugal and Coriolis terms , gravitational forces , actuation matrix and some unknown residual dynamics . Note that the matrix is chosen to make skew-symmetric from the relationship between the Riemannian metric and Christoffel symbols. Here is general, which potentially captures both parametric and nonparametric unmodelled terms.

Definition of safety in trajectory tracking.

The state vector is denoted as , . The main objective for a robotic system is to track some time-varying desired trajectory . Simultaneously, we want to guarantee safety: , with high probability, where is some safety set. It is obvious that . However, because we do not know a priori, the tracking error may be large such that . There are two 1-d examples following.

Figure 3: Illustration of two examples
Example 1 (inverted pendulum with external wind).

In addition to the classical pendulum model, we consider some unknown external wind. Dynamics can be described as , where is external torque generated by the unknown wind. Our control goal is to track , and the safety set is .

Example 2 (drone landing with ground effect)

For this example, we consider drone landing with unknown ground effect. Dynamics is , where is the thrust coefficient. The control goal is smooth and quick landing, i.e., quickly driving . The safety set is , i.e., the drone cannot hit the ground with high velocity.

3.2 Model based nonlinear control

We design a nonlinear controller, which can leverage robust regression of in Eq. 9. Define the reference trajectory as , where , and the composite variable as , where is positive definite. The control objective is to drive to or a small error ball in the presence of bounded uncertainty. With the robust estimation of , , we propose the following nonlinear controller:

(10)

where is a positive definite matrix, and denotes the Moore-Penrose pseudoinverse.

With the control law Eq. 10, we will have the following closed-loop dynamics:

(11)

where is the approximation error between and .

3.3 Stability analysis and trajectory bound

To connect learning with control, we set to correspond with the bounds in Section 2.3. The first option is to connect to Theorem 1, where target data is only a single trajectory that deviates from those in the training data, which means W is further bounded. The second option is to use a perturbation bound, where . We omit rewriting of the bounds here and refer to appendix A.1.1, but emphasize that is upper bounded with when we define target data in a specific set and use robust regression for learning dynamics. We show (Euclidean distance between the desired trajectory and the real trajectory) is bounded when the error of the dynamics estimation is bounded. Again, recall that is our state, and is the desired trajectory.

Theorem 2.

Suppose is in some compact set , and . Then will exponentially converge to the following ball: , where

(12)
  Input: Pool of desired trajectories with parameter , , cost function , robust regression model of dynamics , controller , safety set , base distribution , parameter
  Dynamics model
  Training set = ,
  While
    Safe trajectory list =
    For
      Predict
      ;
      If worst-case trajectory in
        Add to
    Track   to collect data using controller
    Add data to Training set
    Train dynamics model ,
  Output: dynamics model , last desired trajectory and actual trajectory
Algorithm 2 Safe Exploration for Control using Robust Dynamics Estimation

3.4 Safe exploration algorithm

Theorem 2 indicates that if we can design a compact set and find the corresponding maximum error bound on it, we can use it to decide whether a trajectory in this set is safe or not by checking whether its worst-case possible tracking trajectory is in the safety set .

In practice, we design a pool of desired trajectories and use the current predictor of the dynamics to find the worst-case possible tracking trajectory for each of the desired ones. Note that the worst-case tracking trajectories can be computed by generating a “tube" using euclidean distance in Theorem 2. We then eliminate unsafe ones and choose the most “aggressive" one in terms of the primary objective function for the next iteration. Instead of evaluating the actual upper bound, we use for measuring as an approximation, since it is guaranteed that the error is within with high probability as long as the prediction is a Gaussian distribution, if the true function is drawn from the same distribution. We refer a detailed discussion to Appendix A.1.2. Algorithm 2 describes this safe exploration procedure.

3.5 Convergence analysis

We show that using Algorithm 2, we are able to reach optimality in terms of learning the dynamics model, i.e. converge to the optimal predictor in the function class.

Theorem 3.

If there exists an optimal predictor in function class , , with mean and variance estimates , the sequence of estimator from robust regression in each step consist of and , for any , the output of Algorithm 2 converge to , with as the smallest integer that satisfies the following with at least probability :

(13)

Given the convergence of the dynamics model, we can prove the optimal desired trajectory is collected and tracked with good control performance at the end.

Corollary 3.

If there exist an optimal trajectory parameter for the controller to track safely and obtain minimal cost function among all the safe trajectories when the estimated dynamic model satisfies for all , is collected by Algorithm 2, as well as being tracked with for all , with in Theorem 2, where is the tracking trajectory using as dynamics estimation.

4 Experiments

(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 4: Top Row. Results on the pendulum task: (a) the first, (b) the third, and (c) the fifth iteration phase portrait of angle and angular velocity, dashed line shows the worst-case possible trajectory in tracking, according to Theorem 2; heatmap shows the prediction of unknown dynamics (the wind); (d) the unknown dynamics ground truth. Bottom Row. Results on the drone landing task: (e) the first, (f) the fifth, and (g) the tenth iteration phase portrait with height and velocity; heatmap shows the prediction of unknown dynamics (the ground effect); (h) the ground effect ground truth.

We conduct experiments on simulation on the inverted pendulum and drone landing examples as discussed in Section 3.1

. We use kernel density estimation to estimate density ratios. We demonstrate that our approach can reliably and safely converge to optimal behavior. We also compare with a Gaussian process (GP) version of Algorithm

2. In general, we find that it is difficult to tune kernel parameters, and all GP models underperform compared to our approach.

Example 1 (inverted pendulum with external wind).

Recall that the safety set is in the pendulum case, and the final control goal is to track . Therefore our desired trajectory pool is . The ground truth of wind comes from quadratic air drag model. We use the angle upper bound in trajectory as the reward function for choosing “most aggressive" trajectories. Figure 4 demonstrates the exploration process with selected desired trajectories, worst-case tracking trajectory under current dynamics model, tracking trajectories with the ground truth unknown dynamics model, and actual tracking trajectories. Here we use base distribution to start with and . As shown in Figure 4 (a) to (c), the algorithm selects small to guarantee safety at the beginning, and gradually is able to select larger values and track it with small error.

Example 2 (drone landing with ground effect)

Recall that the safety set is , which means the drone can not hit the ground with high velocity. Our desired trajectory pool is , which means the drone smoothly moves from to the desired height . If , the drone lands successfully. Note that greater means faster landing. We use smaller landing time as the reward function that determines the next “aggressive" trajectory. The ground truth of aerodynamics in landing comes from a dynamics simulator that is trained in shi2018neural , where

is a four-layer ReLU neural network trained by real flying data. Here we use base distribution

for robust regression and . Results in Figure 4(e) to (g) demonstrate that, because of the lack of aerodynamics , and big may not be safe at the beginning. Starting from conservative desired trajectories, the safe exploration using robust regression is able to track more aggressive desired trajectory with and big while staying safe.

Figure 5: Comparison with GPs

Comparison with GPs. We examine here drone landing time, and defer examining the simpler pendulum setting to the appendix. We compare against five GP models with a wide range of kernel parameters, including both ones that are optimistic or conservative about their prediction uncertainty, by setting different bandwidths in the RBF kernel. Figure 5 shows that our approach outperforms all GP models. Modeling the ground effect is notoriously challenging shi2018neural , and the GP suffers from model mis-specification. In contrast, our approach can fit general non-linear function estimators such as deep neural networks adaptively to the available data, which leads to more flexible inductive bias and better fitting of the data and uncertainty quantification. Additional results for the drone landing setting as well as inverted pendulum are in Appendix A.5. The supplemental material also contains a video demoing the results.

5 Related Work

Safe Exploration. Safe exploration methods commonly use Gaussian processes (GPs) to quantify uncertainty sui2015safe ; sui2018stagewise ; kirschner2019adaptive ; akametalu2014reachability ; berkenkamp2016safe ; turchetta2016safe ; wachi2018safe ; berkenkamp2017safe ; fisac2018general ; khalil . These methods are related to bandit algorithms bubeck2012regret and typically employ upper confidence bounds  auer2002using to balance exploration versus exploitation srinivas2009gaussian . As discussed above, GP-based approaches can be sensitive to model selection. One could blend GP-based modeling with general function approximations (such as deep learning) berkenkamp2017safe ; cheng2019end , but the resulting optimization-based control problem can be challenging to solve. Other approaches either require having a safety model pre-specified upfront alshiekh2018safe , are restricted to relatively simple models moldovan2012safe , have no convergence guarantees during learning taylor2019episodic , or have no guarantees at all garcia2012safe . Our work also shares some similarity with snoek2015scalable ; brookes2019conditioning , which use deep neural networks for sampling-based optimization; however, those approaches have no guarantees and so are unsuitable for safe exploration.

Distribution Shift. The study of data shift has seen increasing interest in recent years, owing to the widespread practical issue that real test distributions rarely match the training distribution. Our work is stylistically similar to liu2015shift ; chen2016robust ; liu2014robust ; liu2017robust , which also frame uncertainty quantification through the lens of covariate shift, although ours is the first to extend to deep neural networks with rigorous guarantees. More broadly, dealing with domain shift is a fundamental challenge in deep learning, as highlighted by their vulnerability to adversarial inputs goodfellow2014explaining , and the implied lack of robustness. Beyond robust estimation, the typical approaches are to either regularize srivastava2014dropout ; wager2013dropout ; simile ; bartlett2017spectrally ; miyato2018spectral ; shi2018neural ; benjamin2019measuring ; corerl or synthesize an augmented dataset that anticipates the domain shift prest2012learning ; zheng2016improving ; stewart2017label . We take the former approach by employing spectral normalization bartlett2017spectrally ; shi2018neural in conjunction with robust estimation.

6 Conclusion & Future Work

In this paper, we propose an algorithmic framework for safe exploration in model-based control. To quantify uncertainty, we develop a robust deep regression method for dynamics estimation. Using robust regression, we explicitly deal with data shifts during episodic learning, and in particular can quantify uncertainty over entire trajectories. We prove the generalization and perturbation bounds for robust regression, and show how to integrate with control to derive safety bounds in terms of stability. These bounds explicitly translates the error in dynamics learning to the tracking error in control. From this, we design a safe exploration algorithm based on a finite pool of desired trajectories. We prove that the proposed safe exploration algorithm converges to the optimal dynamics estimator in its function class, as well as the optimal controller for tracking optimal desired trajectories.

There are many avenues for future work. For instance, our safety criterion was relatively simple, and one can consider employing more sophisticated criteria that require more sophisticated certification approaches such as reachability analysis fisac2018general . Our theoretical analysis can also be improved, since tighter safety guarantees can lead to dramatically improved performance. Directions to explore include incorporating other regularization techniques, relaxing from Gaussian observation noise, and incorporating more sophisticated density estimation techniques. Another interesting direction is to incorporate with deep kernel learning wilson2016deep for safe exploration, which does make stronger assumptions (uses a Gaussian process model) but might alleviate some issues with kernel parameter tuning. Finally, our deep robust regression approach is of independent interest beyond model-based control, and can be incorporated in other applications as well.

References

  • (1) Ruben Martinez-Cantin. Bayesopt: A bayesian optimization library for nonlinear optimization, experimental design and bandits.

    The Journal of Machine Learning Research

    , 15(1):3735–3739, 2014.
  • (2) Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
  • (3) Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N Zeilinger, and Claire J Tomlin. Reachability-based safe learning with gaussian processes. In 53rd IEEE Conference on Decision and Control, pages 1424–1431. IEEE, 2014.
  • (4) Felix Berkenkamp, Angela P Schoellig, and Andreas Krause. Safe controller optimization for quadrotors with gaussian processes. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 491–496. IEEE, 2016.
  • (5) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause.

    Safe model-based reinforcement learning with stability guarantees.

    In Advances in neural information processing systems, pages 908–918, 2017.
  • (6) Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 2018.
  • (7) Hassan Khalil and Jessy Grizzle. Nonlinear systems. Prentice hall, 2002.
  • (8) Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In Artificial Intelligence and Statistics, pages 1270–1279, 2016.
  • (9) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
  • (10) Guanya Shi, Xichen Shi, Michael O’Connell, Rose Yu, Kamyar Azizzadenesheli, Animashree Anandkumar, Yisong Yue, and Soon-Jo Chung. Neural lander: Stable drone landing control using learned dynamics. International Conference on Robotics and Automation (ICRA), 2019.
  • (11) Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning, pages 997–1005, 2015.
  • (12) Yanan Sui, Vincent Zhuang, Joel W Burdick, and Yisong Yue. Stagewise safe bayesian optimization with gaussian processes. In International Conference on Machine Learning (ICML), 2018.
  • (13) Johannes Kirschner, Mojmír Mutnỳ, Nicole Hiller, Rasmus Ischebeck, and Andreas Krause. Adaptive and safe bayesian optimization in high dimensions via one-dimensional subspaces. In International Conference on Machine Learning (ICML), 2019.
  • (14) Matteo Turchetta, Felix Berkenkamp, and Andreas Krause.

    Safe exploration in finite markov decision processes with gaussian processes.

    In Advances in Neural Information Processing Systems, pages 4312–4320, 2016.
  • (15) Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of constrained mdps using gaussian processes. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • (16) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • (17) Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • (18) Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), 2010.
  • (19) Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In Conference on Artificial Intelligence (AAAI), 2019.
  • (20) Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • (21) Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in markov decision processes. In International Conference on Machine Learning (ICML), 2012.
  • (22) Andrew J Taylor, Victor D Dorobantu, Hoang M Le, Yisong Yue, and Aaron D Ames. Episodic learning with control lyapunov functions for uncertain robotic systems. arXiv preprint arXiv:1903.01577, 2019.
  • (23) Javier Garcia and Fernando Fernández. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45:515–564, 2012.
  • (24) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180, 2015.
  • (25) David H Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In International Conference on Machine Learning (ICML), 2019.
  • (26) Anqi Liu, Lev Reyzin, and Brian D Ziebart.

    Shift-pessimistic active learning using robust bias-aware prediction.

    In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • (27) Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in neural information processing systems, pages 37–45, 2014.
  • (28) Anqi Liu and Brian D Ziebart. Robust covariate shift prediction with general losses and feature views. arXiv preprint arXiv:1712.10043, 2017.
  • (29) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • (30) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • (31) Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in neural information processing systems, pages 351–359, 2013.
  • (32) Hoang M. Le, Andrew Kang, Yisong Yue, and Peter Carr.

    Smooth imitation learning for online sequence prediction.

    In International Conference on Machine Learning (ICML), 2016.
  • (33) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
  • (34) Ari S Benjamin, David Rolnick, and Konrad Kording. Measuring and regularizing networks in function space. In International Conference on Learning Representations (ICLR), 2019.
  • (35) Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, and Joel Burdick. Control regularization for reduced variance reinforcement learning. In International Conference on Machine Learning (ICML), 2019.
  • (36) Alessandro Prest, Christian Leistner, Javier Civera, Cordelia Schmid, and Vittorio Ferrari. Learning object class detectors from weakly annotated video. In

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3282–3289. IEEE, 2012.
  • (37) Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 4480–4488, 2016.
  • (38) Russell Stewart and Stefano Ermon. Label-free supervision of neural networks with physics and domain knowledge. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • (39) Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.

Appendix A Appendix

a.1 Additional Theoretical Results

a.1.1 Improved Bounds for Control

As explained in the paper, we can further improve the learning bounds in the control context when we control the target data in a strategically way. In Theorem 1, is the upper bound of the true density ratio of this two distribution, which potentially can be very large when target data is a very different one from the source. However, we can choose our next trajectory as the one not deviate too much from the source data in practice, so that further constraining and also in Theorem 1. We can rewrite the theorem as:

Theorem 4.

[Improved Generalization and perturbation bounds in general cases] Assume is a training set with i.i.d. data sampled from , is the function class of mean estimator in robust regression, it satisfies , is the Rademacher complexity on , is the upper bound of true density ratio , , the weight estimation , base distribution variance is , is the upperbound of all among the dimensions of , we have the generalization error bound on hold with probability ,

If we assume target data samples ’s stay in a ball with diameter from the source data , the true function is Lipschitz continuous with constant and the robust regression mean estimator is also Lipschitz continuous with constant ,

(14)

Note that in generalization bound, we can further improve the bound if we know what is the method for estimating density ratio and further relate the overall learning performance with the density ration estimation. Here, we just use as if it is a value that is given to us beforehand.

a.1.2 High Probability Bounds for Gaussian Distribution

In Algorithm 2, we use as our approximation of the learning error from the robust regression instead of measuring the actual learning upper bound, which is hard to evaluate. Here we give the justification.

If the prediction from robust regression is , assuming true function is drawn from the same distribution, we have

. Also, for a unit normal distribution

, we have . Therefore, for data , hold with probability greater than . Therefore, we can choose in practice and it corresponds with different probability in bounds.

a.2 Proof of Theoretical Results

a.2.1 Proof of Theorem 1

Proof.

We first prove the generalization bound using standard Redemacher Complexity for regression problems:

(15)

where , is the Rademacher complexity on the function class of mean estimate Eq. 1, and the variance term is the empirical variance of the robust regression model and follows Eq. 2. This is a data-dependent bound that relies on training samples.

We next prove the perturbation bounds. Assuming stays in a ball with diameter from the source training data , , the true function is Lipschitz continuous with constant and the mean function of our learned estimator is also Lipschitz continuous with constant , then we have

(16)

If we have an upperbound for the parameter and the weight estimation , we have

Therefore, the generalization bound and perturbation bounds can be written as

(18)
(19)

a.2.2 Proof of Corollary 1

Proof.

If function class is linear, assuming , we can further bound the Rademacher complexity of when is an linear model. Indeed, the mean estimate has the form , where and . If is large enough to make dominating , is approximately . When is very small,