1 Introduction
Deep reinforcement learning has enjoyed great empirical successes in various domains, including robotics, personalized recommendations, bidding, advertising and games (Levine et al., 2018; Zheng et al., 2018; Zhao et al., 2018; Silver et al., 2017; Jin et al., 2018). At the backbone of its success is the superior approximation power of deep neural networks, which parameterize complex policy, value or stateaction value functions, etc. However, the high complexity of deep neural networks makes the search space of the learning algorithm prohibitively large, thus often requires a significant amount of training data, and suffers from numerous training difficulties such as overfitting and training instability (Thrun & Schwartz, 1993; Boyan & Moore, 1995; Zhang et al., 2016).
Reducing the size of the search space while maintaining the network’s performance requires special treatment. While one can simply switch to a network of smaller size, numerous empirical evidences have shown that small network often leads to performance degradation and training difficulties. It is commonly believed that training a sufficiently large network (also known as overparameterization) with suitable regularization (e.g., dropout Srivastava et al. (2014), orthogonality parameter constraints Huang et al. (2018)) is the most effective way to adaptively constrain the search space, while maintaining the performance benefits of a large network.
For reinforcement learning problems, entropy regularization is one commonly adopted regularization, which is believed to help facilitate exploration in the learning process. Yet in the presence of high uncertainty in the environment and large noise, such regularization might yield poor performances. More recently, Pinto et al. (2017) propose robust adversarial reinforcement learning (RARL) that aims to perform well under uncertainties by training the agent to be robust against the adversarially perturbed environment. However, in addition to the marginal performance gain, the requirement of learning the additional adversarial policy makes the update of RARL computationally expensive and less sample efficient than traditional learning algorithms. Cheng et al. (2019) on the other hand propose a control regularization that enforces the behavior of the deep policy to be similar to a policy prior, yet designing a good prior often requires a significant amount of domain knowledge.
Different from previous works, we propose a new training framework – Smoothness Regularized Reinforcement Learning () for training reinforcement algorithms. Through promoting smoothness, we effectively reduce the size of the search space when learning the policy network and achieve stateoftheart sample efficiency. Our goal of promoting smoothness in the policy is motivated by the fact that natural environments with continuous state space often have smooth transitions from state to state, which favors a smooth policy – similar states leading to similar actions. As a concrete example, for MuJoCo environment (Todorov et al., 2012), which is a system powered by physical laws, the optimal policy can be described by a set of differential equations with certain smoothness properties.
Promoting smoothness is particularly important for deep RL, since deep neural networks can be extremely nonsmooth, due to their high complexity. It is observed that small changes in neural networks’ input would result in significant changes in its output. Such nonsmoothness has drawn significant attention in other domains (e.g., image recognition, information security) that involve using neural networks as a part of the decision process (Goodfellow et al., 2014; Kurakin et al., 2016)
. To train a smooth neural network, we need to employ many hacks in the training process. In supervised learning setting with i.i.d. data, these hacks include but not limited to batch normalization
(Ioffe & Szegedy, 2015), layer normalization (Ba et al., 2016), orthogonal regularization (Huang et al., 2018). However, most of existing hacks do not work well in RL setting, where the training data has complex dependencies. As one significant consequence, current reinforcement learning algorithms often lead to undesirable nonsmooth policy.Our proposed training framework uses a smoothnessinducing regularization to encourage the output of the policy (decision) to not change much when injecting small perturbation to the input of the policy (observed state). The framework is motivated by local shift sensitivity in robust statistics literature (Hampel, 1974), which can also be considered as a measure of the local Lipschitz constant of the policy. We highlight that is highly flexible and can be readily adopted into various reinforcement learning algorithms. As concrete examples, we apply to the TRPO algorithm (Schulman et al., 2015), which is an onpolicy method, and the regularizer directly penalizes nonsmoothness of the policy. In addition, we also apply to DDPG algorithm (Lillicrap et al., 2015), which is an offpolicy method, and the regularizer penalizes nonsmoothness of either the policy or the stateaction value function (also known as the Qfunction), from which we can further induce a smooth policy.
Our proposed smoothnessinducing regularizer is related to several existing works (Miyato et al., 2018; Zhang et al., 2019; Hendrycks et al., 2019; Xie et al., 2019; Jiang et al., 2019)
. These works consider similar regularization techniques, but target at other applications with different motivations, e.g., semisupervised learning, unsupervised domain adaptation and harnessing adversarial examples in image classification.
The rest of the paper is organized as follows: Section 2 introduces the related background; Section 3 introduces our proposed smooth regularized reinforcement learning () in detail; Section 4 presents numerical experiments on various MuJoCo environments to demonstrate the superior performance of .
Notations: We let denote the radius ball measured in metric centered at point . We use
to denote the identity matrix in
dimensional Euclidean space.2 Background
We consider a Markov Decision Process
, in which an agent interacts with an environment in discrete time steps. We let denote the continuous state space, denote the action space, denote the transition kernel, denote the reward function, denote the initial distribution and denote the discount factor. An agent’s behavior is defined by a policy, either stochastic or deterministic. A stochastic policymaps a state to a probability distribution over the action space
. A deterministic policy maps a state directly to an action . At each time step, the agent observes its state , takes action , and receives reward . The agent then transits into the next state with probability given by the transition kernel . The goal of the agent is to find a policy that maximize the expected discounted reward with discount factor :One way to solve the above problem is the classical policy gradient algorithms, which estimate the gradient of the expected reward through trajectory samples, and update the parameters of the policy by following the estimated gradient. The policy gradient algorithms are known to suffer from high variance of estimated gradient, which often leads to aggressive updates and unstable training. To address this issue, numerous other variants have been proposed. Below we briefly review two popular ones used in practice.
2.1 Trust Region Policy Optimization (TRPO)
TRPO iteratively improves a parameterized policy by solving a trust region type optimization problem. Before we describe the algorithm in detail, we need several definitions in place. The value function and the stateaction value function are defined by:
with . The advantage function and the discounted state visitation distribution (unnormalized) are defined by:
At the th iteration of TRPO, the policy is updated by solving:
(1) 
where is a tuning parameter for controlling the size of the trust region, and
denotes the KullbackLeibler divergence between two distributions
over support . For each update, the algorithm: (i) Samples trajectories using the current policy ; (ii) Approximates for each stateaction pair by taking the discounted sum of future rewards along the trajectory; (iii) Replaces the expectation in (2.1) and by sample approximation, then solves (2.1) with conjugate gradient algorithm.2.2 Deep Deterministic Policy Gradient (DDPG)
DDPG uses the actorcritic architecture, where the agent learns a parameterized stateaction value function (also known as the critic) to update the parameterized deterministic policy (also known as the actor).
DDPG uses a replay buffer, which is also used in in Deep QNetwork Mnih et al. (2013). The replay buffer is a finite sized cache. Transitions are sampled from the environment according to the policy and the tuple is stored in the replay buffer. When the replay buffer is full, the oldest samples are discarded. At each time step, and are updated by sampling a minibatch uniformly from the buffer.
Update of stateaction value function. The update of the stateaction value function network depends on the deterministic Bellman equation:
(2) 
The expectation depends only on the environment. This means that unlike TRPO, DDPG is an offpolicy method, which can use transitions generated from a different stochastic behavior policy denoted as (see (Lillicrap et al., 2015) for detail). At the th iteration, we update the by minimizing the associated mean squared Bellman error of transitions sampled from the replay buffer. Specifically, let and be a pair of target networks, we set , and then update the critic network:
After both critic and actor networks are updated, we update the target networks by slowly tracking the critic and actor networks:
with .
Update of policy. The policy network is updated by maximizing the value function using policy gradient:
(3) 
Similar to updating the critic, we use the minibatch sampled from the replay buffer to compute approximated gradient of and perform the update:
3 Method
In this section, we present the smoothnessinducing regularizer in its general form and describe its intuition in detail. We also apply the proposed regularizer to popular reinforcement learning algorithms to demonstrate its great adaptability.
3.1 Learning Policy with
We first focus on directly learning smooth policy with the proposed regularizer. We assume that the state space is continuous, i.e., .
For a fixed state and a policy , encourages the output and to be similar, where state is obtained by injecting a small perturbation to state . We assume the perturbation set is an radius ball measured in metric , which is often chosen to be the distance: . To measure the discrepancy between the outputs of a policy, we adopt a suitable metric function denoted by . The nonsmoothness of policy at state is defined in an adversarial manner:
To obtain a smooth policy , we encourage smoothness at each state of the entire trajectory. We achieve this by taking expectation with respect to the state visitation distribution induced by the policy, and our smoothnessinducing regularizer is defined by:
(4) 
For a stochastic policy , we set the metric to be the Jeffrey’s divergence, and the regularizer takes the form
(5) 
where the Jeffrey’s divergence for two distributions is defined by:
(6) 
For a deterministic policy , we set the metric to be the squared norm of the difference:
(7) 
The smoothnessinducing adversarial regularizer is essentially measuring the local Lipschitz continuity of policy under the metric . More precisely, we encourage the output (decision) of to not change much if we inject a small perturbation bounded in metric to the state (See Figure 1). Therefore, by adding the regularizer (4) to the policy update, we can encourage to be smooth within the neighborhoods of all states on all possible trajectories regarding to the sampling policy. Such a smoothnessinducing property is particularly helpful to prevent overfitting, improve sample efficiency and overall training stability .
TRPO with (TRPOSR). We now apply the proposed smoothness inducing regularizer to TRPO algorithm, which is itself an onpolicy algorithm. Since TRPO uses a stochastic policy, we use the Jeffrey’s divergence to penalize the discrepancy between decisions for the regular state and the adversarially perturbed state, as suggested in (5).
Specifically, TRPO with smoothnessinducing regularizer updates the policy by solving the following subproblem at the th iteration:
s.t.  (8) 
3.2 Learning Qfunction with Smoothnessinducing Regularization
The proposed smoothnessinduced regularizer can be also used to learn a smooth Qfunction, which can be further used to generate a smooth policy.
We measure the nonsmoothness of a Qfunction at stateaction pair by the squared difference of the stateaction value between the normal state and the adversarially perturbed state:
To enforce smoothness at every stateaction pair, we take expectation with respect to the entire trajectory, and the smoothnessinducing regularizer takes the form
where denotes the behavior policy for sampling in offpolicy training setting.
DDPG with . We now apply the proposed smoothnessinducing regularizer to DDPG algorithm, which is itself an offpolicy algorithm. Since DDPG uses two networks: the actor network and the critic network, we propose two variants of DDPG, where the regularizer is applied to the actor or the critic network.
Regularizing the Actor Network (DDPGSRA). We can directly penalize the nonsmoothness of the actor network to promote a smooth policy in DDPG. Since DDPG uses a deterministic policy , when updating the actor network, we penalize the squared difference as suggested in (7) and minimize the following objective:
The policy gradient can be written as:
with for .
Regularizing the Critic Network (DDPGSRC). Since DDPG simultaneously learns a Qfunction (critic network) to update the policy (actor network), inducing smoothness in the critic network could also help us to generate a smooth policy. By incorporating the proposed regularizer for penalizing Qfunction, we obtain the following update for inducing a smooth Qfunction in DDPG:
where is the minibatch sampled from the replay buffer.
3.3 Solving the Minmax Problem
Adding the smoothnessinducing regularizer in the policy/Qfunction update often involves solving a minmax problem. Though the inner max problem is not concave, simple stochastic gradient algorithm has been shown to be able to solve it efficiently in practice. Below we describe how to perform the update of TRPOSR, including solving the corresponding minmax problem. The details are summarized in Algorithm 1. We leave the detailed description of DDPGSRA and DDPGSRC in the appendix.
4 Experiment
We apply the proposed training framework to two popular reinforcement learning algorithms: TRPO and DDPG. Both of these algorithms have become the standard routine to solve largescale control tasks, and the building blocks of many stateoftheart reinforcement learning algorithms. For TRPO, we directly learns a smooth policy; For DDPG, we promote the smoothness either in the actor (policy) or the critic (Qfunction).
4.1 Implementation
Our implementation of
training framework is based on the open source toolkit garage
(garage contributors, 2019). We test our algorithms on OpenAI gym (Brockman et al., 2016) control environments with the MuJoCo (Todorov et al., 2012) physics simulator. For all tasks, we use a network of hidden layers, each containing neurons, to parameterize the policy and the Qfunction. For fair comparison, except for the hyperparameters related to the smooth regularizer, we keep all the hyperparameters the same as in the original implementation of garage. We use the grid search to select the hyperparameters (perturbation strength , regularization coefficient ) of the smoothnessinducing regularizer. We set the search range to be . To solve the inner maximization problem in the update, we run steps of projected gradient ascent, with step size set as . For each algorithm and each environment, we train 10 policies with different initialization for 500 iterations (1K environment steps for each iteration).Below we briefly describe the environments we use to evaluate our algorithms (See also Figure 2).
Swimmer. The swimmer is a planar robot of a single torso with links and actuated joints in a viscous container. The dimensional state space includes positions and velocities of sliders, angles and angular velocity of hinges. The dimensional action space includes torque of each actuated joints.
HalfCheetah. The halfcheetah is a planar biped robot with rigid links, including two legs and a torso, along with actuated joints. The dimensional state space includes positions and velocities of sliders, angles and angular velocity of hinges. The dimensional action space includes torque of each actuated joints.
Walker2D. The walker is a planar biped robot consisting of links, corresponding to two legs and a torso, along with actuated joints. The dimensional state space includes positions and velocities of sliders and angles and angular velocity of hinges. The dimensional action space includes torque of each actuated joints.
Hopper. The hopper is a planar monopod robot with rigid links, corresponding to the torso, upper leg, lower leg, and foot, along with 3 actuated joints. The 11dimensional state space includes positions and velocities of sliders, angles and angular velocity of hinges. The dimensional action space includes torque of each actuated joints.
Ant. The ant is a planar biped robot consisting of links, corresponding to legs and a torso, along with actuated joints. The dimensional state space includes positions and velocities of sliders, angles and angular velocity of hinges. The dimensional action space includes torque of each actuated joints.
4.2 Evaluating the Learned Policies
TRPO with (TRPOSR). We use Gaussian policy in our implementation. Specifically, for a given state
, the action follows a Gaussian distribution
, where is also a learnable parameter. Then the smoothnessinducing regularizer (5) takes the form:Figure 4 shows the mean and variance of the cumulative reward (over policies) for policies trained by TRPOSR and TRPO for Swimmer, HalfCheetah, Hopper and Ant. For all the four tasks, TRPOSR learns a better policy in terms of the mean cumulative reward. In addition, TRPOSR enjoys a smaller variance of the cumulative reward with respect to different initializations. These two observations confirm that our smoothnessinducing regularization improves sample efficiency as well as the training stability.
We further show that the advantage of our proposed training framework goes beyond improving the mean cumulative reward. To show this, we run the algorithm with different initializations, sort the cumulative rewards of learned policies and plot the percentiles in Figure 6.

For all four tasks, TRPOSR uniformly outperforms the baseline TRPO.

For Swimmer and HalfCheetach tasks, TRPOSR significantly improves the worst case performance compared to TRPO, and have similar best case performance.

For Walker and Ant tasks, TRPOSR significantly improves the best case performance compared to TRPO.
Our empirical results show strong evidences that the proposed not only improves the average reward, but also makes the training process significantly more robust to failure case compared to the baseline method.
DDPG with . We repeat the same evaluations for applying the proposed framework to DDPG (DDPGSRA and DDPGSRC). Figure 4 shows the mean and variance of the cumulative reward for policies trained by DDPGSRA and DDPGSRC in HalfCheetah, Hopper and Walker2D and Ant environments. For all the four tasks, DDPGSR learns a better policy in terms of mean reward. For task Ant, DDPGSRA shows superior training stability, which is the only algorithm without drastic decay in the initial training stage. In addition, DDPGSRC shows competitive performance compared to DDPGSRA, significantly outperforms DDPGSRA and DDPG for task HalfCheetah. This shows that instead of directly learning a smooth policy, we can turn to learn a smooth Qfunction and obtain similar performance benefits.
Figure 6 plots percentiles of cumulative reward of learned policies using DDPG and DDPGSR. Similar to TRPOSR, both DDPGSRA and DDPGSRC uniformly outperform the baseline DDPG for all the reward percentiles. DDPGSR is able to significantly improve the the worst case performance, while maintaining competitive best case performance compared to DDPG.
4.3 Robustness with Disturbance
We demonstrate that even if the training framework is not targeting for robustness, the trained policy is still able to achieve robustness against both stochastic and adversarial measurement error, which is a classical setting considered in partially observable Markov decision process (POMDP) (Monahan, 1982). To show this, we evaluate the robustness of the proposed training framework in the Swimmer and HalfCheetah environments. We evaluate the trained policy with two types of disturbances in the test environment: for a given state , we add it with either (i) random disturbance which are sampled uniformly from , or (ii) adversarial disturbance, which are generated by solving:
using steps of projected gradient ascent. For all evaluations, we use disturbance set . For each policy and disturbed environment, we do stochastic rollouts to evaluate the policy and plot the cumulative reward of policy.
To evaluate the robustness of TRPO with , we run both baseline TRPO and TRPOSR in the Swimmer environment. Figure 8 plots the cumulative reward against the disturbance strength (). We see that for both random and adversarial disturbance, increasing the strength of the disturbance decreases the cumulative reward of the learned policies. On the other hand, we see that TRPOSR clearly achieves improved robustness against perturbations, as its reward declines much slower than the baseline TRPO.
To evaluate the robustness of DDPG with , we run baseline DDPG, DDPGSRA and DDPGSRC in the HalfCheetah environment. Figure 8 plots the cumulative reward against the disturbance strength (). We see that incorporating the proposed smoothnessinducing regularizer into either the actor or the critic network improves the robustness of the learned policy against state disturbances.
4.4 Sensibility with Hyperparameters
The proposed smoothnessinducing regularizer involves setting two hyperparameters, the coefficient of the regularizer and the disturbance strength . We vary different choices of and plot the heatmap of cumulative reward for each configuration in Figure 9. In principle, and both control the strength of the regularization in similar way, as large and both increase the strength of the regularization, advocating for more smoothness; small and both decrease the strength of the regularization, favoring for cumulative reward over smoothness. We observe similar behavior in Figure 9: a relatively small and large shows similar performance with a relatively large and small .
5 Conclusion
We develop a novel regularization based training framework to learn a smooth policy in reinforcement learning. The proposed regularizer encourages the learned policy to produce similar decisions for similar states. It can be applied to either induce smoothness in the policy directly, or induce smoothness in the Qfunction, thus enjoys great applicability. We demonstrate the effectiveness of by applying it to two popular reinforcement learning algorithms, including TRPO and DDPG. Our empirical results show that improves sample efficiency and training stability of current algorithms. In addition, the induced smoothness in the learned policy also improves robustness against both random and adversarial perturbations to the state.
References
 Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Boyan & Moore (1995) Boyan, J. A. and Moore, A. W. Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems, pp. 369–376, 1995.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Cheng et al. (2019) Cheng, R., Verma, A., Orosz, G., Chaudhuri, S., Yue, Y., and Burdick, J. W. Conrol regularization for reduced variance reinforcement learning. arXiv preprint arXiv:1905.05380, 2019.
 garage contributors (2019) garage contributors, T. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019.
 Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 Hampel (1974) Hampel, F. R. The influence curve and its role in robust estimation. Journal of the american statistical association, 69(346):383–393, 1974.
 Hampel et al. (2011) Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
 Hendrycks et al. (2019) Hendrycks, D., Mazeika, M., Kadavath, S., and Song, D. Using selfsupervised learning can improve model robustness and uncertainty. arXiv preprint arXiv:1906.12340, 2019.

Huang et al. (2018)
Huang, L., Liu, X., Lang, B., Yu, A. W., Wang, Y., and Li, B.
Orthogonal weight normalization: Solution to optimization over
multiple dependent stiefel manifolds in deep neural networks.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Huber (2004) Huber, P. J. Robust statistics, volume 523. John Wiley & Sons, 2004.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Jiang et al. (2019) Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Zhao, T. Smart: Robust and efficient finetuning for pretrained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437, 2019.
 Jin et al. (2018) Jin, J., Song, C., Li, H., Gai, K., Wang, J., and Zhang, W. Realtime bidding with multiagent reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2193–2201, 2018.
 Kurakin et al. (2016) Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.

Levine et al. (2018)
Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., and Quillen, D.
Learning handeye coordination for robotic grasping with deep learning and largescale data collection.
The International Journal of Robotics Research, 37(45):421–436, 2018.  Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Miyato et al. (2018) Miyato, T., Maeda, S.i., Ishii, S., and Koyama, M. Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Monahan (1982) Monahan, G. E. State of the art?a survey of partially observable markov decision processes: theory, models, and algorithms. Management science, 28(1):1–16, 1982.

Pinto et al. (2017)
Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A.
Robust adversarial reinforcement learning.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 2817–2826. JMLR. org, 2017.  Qin et al. (2019) Qin, C., Martens, J., Gowal, S., Krishnan, D., Dvijotham, K., Fawzi, A., De, S., Stanforth, R., and Kohli, P. Adversarial robustness through local linearization. In Advances in Neural Information Processing Systems, pp. 13824–13833, 2019.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897, 2015.
 Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
 Thrun & Schwartz (1993) Thrun, S. and Schwartz, A. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993.
 Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012.
 Xie et al. (2019) Xie, Q., Dai, Z., Hovy, E., Luong, M.T., and Le, Q. V. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019.
 Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 Zhang et al. (2019) Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I. Theoretically principled tradeoff between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019.
 Zhao et al. (2018) Zhao, J., Qiu, G., Guan, Z., Zhao, W., and He, X. Deep reinforcement learning for sponsored search realtime bidding. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1021–1030, 2018.
 Zheng et al. (2018) Zheng, G., Zhang, F., Zheng, Z., Xiang, Y., Yuan, N. J., Xie, X., and Li, Z. Drn: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 167–176, 2018.
Appendix A Appendix
We present two variants of DDPG with the proposed smoothnessinducing regularizer. The first algorithm, DDPGSRA, directly learns a smooth policy with a regularizer that measures the nonsmoothness in the actor network (policy). The second variant, DDPGSRC, learns a smooth Qfunction with a regularizer that measure the nonsmoothness in the critic network (Qfunction). We present the details of DDPGSRA and DDPGSRC in Algorithm 2 and Algorithm 3