As robotic and machine learning technologies remarkably develop, the tasks required of intelligent robots becomes more complex: e.g. physical human-robot interaction [19, 11]; work on disaster sites [10, 5]; and manipulation of various objects [24, 17]. In most cases, these complex tasks have no accurate analytical model. To resolve this difficulty, deep reinforcement learning (DRL) 
has received a lot of attention as an alternative to classic model-based control. Using deep neural networks (DNNs) as a nonlinear function approximator, DRL can learn the complicated policy and/or value function in a model-free manner[18, 6], or learn the complicated world model for planning the optimal policy in a model-based manner [3, 20].
Since DNNs are nonlinear and especially model-free DRLs must generate pseudo-supervised signals by themselves, making DRL unstable. Techniques to stabilize learning have been actively proposed, such as the design of regularization [24, 6, 15] and the introduction of the model that makes learning conservative [21, 13]. Among them, a target network is one of the current standard techniques in DRL [18, 12]. After generating it as a copy of the main network to be learned by DRL, an update rule is given to make it slowly match the main network at regular intervals or asymptotically. In this case, the pseudo-supervised signals generated from the target network are more stable than those generated from the main network, which greatly contributes to the overall stability of DRL.
The challenge in using the target network is its update rule. It has been reported that too slow update stagnates the whole learning process , while too fast update reverts to instability of the pseudo-supervised signals. A new update rule, T-soft update , has been proposed to mitigate the latter problem. This method provides a mechanism to limit the amount of updates when the main network deviates unnaturally from the target network, which can be regarded as noise. Such noise robustness enables to stabilize the whole learning process even with the high update rate by appropriately ignoring the unstable behaviors of the main network.
However, the noise robustness of T-soft update is specified as a hyperparameter, which must be set to an appropriate value depending on the task to be solved. In addition, the simplified implementation for detecting noise deteriorates the noise robustness. More sophisticated implementation with the adaptive noise robustness is desired. As another concern, when the update of the target network is restricted like T-soft update does, the target network may not asymptotically match the main network. A new constraint is needed to avoid the situation where the main network deviates from the target network.
Hence, this paper proposes two methods to resolve each of the above two issues: i) the adaptive and sophisticated implementation of T-soft update and; ii) the appropriate consolidation of the main target network to the target network. Specifically, for the issue i), a new update rule, so-called adaptive T-soft (AT-soft) update, is developed based on the recently proposed AdaTerm 
formulation, which is an adaptively noise-robust stochastic gradient descent method. This allows us to sophisticate the simplified implementation of T-soft update and improve the noise robustness, which can be adaptive to the input patterns. For the issue ii), a new consolidation so that the main network is regularized to the target network when AT-soft update restricts the updates of the target network. By implementing it with interpolation, the parameters in the main network that naturally deviate significantly from that of the target network, are updated to a larger extent. With this consolidation, the proposed method is so-called consolidated AT-soft (CAT-soft) update.
To verify CAT-soft update, typical benchmarks implemented by Pybullet  are tried using the latest DRL algorithms [15, 22]. It is shown that even though the learning rate is larger than the standard value for DRL, the task performance is improved by CAT-soft update and more stable learning can be achieved. In addition, the developed consolidation successfully suppresses the divergence between the main and target networks.
Ii-a Reinforcement learning
First of all, the basic problem statement of DRL and an actor-critic algorithm, which can handle continuous action space, is natural choice  as one of the basic algorithms for robot control. Note that the proposed method can be applied to other algorithms with the target network.
In DRL, an agent interacts with an unknown environment under Markov decision process (MDP) with the current state, the agent’s action , the next state , and the reward from the environment . Specifically, the environment implicitly has its initial randomness
and its state transition probability. Since the agent can act on the environment’s state transition through , the goal is to find the optimal policy to reach the desired state. To this end, is sampled from a state-dependent trainable policy, , with its parameters set (a.k.a. weights and biases of DNNs in DRL). The outcome of the interaction between the agent and the environment can be evaluated as .
By repeating the above process, the agent gains the sum of over the future (so-called return), , with discount factor. The main purpose of DRL is to maximize by optimizing (i.e. ). However, cannot be gained due to its future information, hence its expected value is inferred as a trainable (state) value function, , with its parameters set . Finally, DRL optimizes to maximize while increasing the accuracy of .
To learn , a temporal difference (TD) error method is widely used as follows:
where denotes the pseudo-supervised signal generated from the target network with the parameters set (see later). By minimizing , can be optimized to correctly infer the value over .
To learn , a policy-gradient method is applied as follows:
where is sampled from the alternative policy , which is often given by the target network with the parameters set
. The sampler change is allowed by the importance sampling, and the likelihood ratio is introduced as in the above loss function. By minimizing, can be optimized to reach the state with higher value.
Ii-B Target network with T-soft update
The target network with is briefly introduced together with the latest update rule, T-soft update . First, with the initialization phase of the main network, the target network is also created as a copy with . Since the copied is given independently of , and not updated through the minimization problem of eqs. (1) and (3). Therefore, the pseudo-supervised signal has the same value for the same input, which greatly contributes to the stability of learning by making the minimization problem stationary.
However, in practice, if is fixed at its initial value, the correct cannot be generated and the task is never accomplished. Thus, must be updated slowly towards as in alternating optimization . When the target network was first introduced, a technique called hard update was employed , where was updated a certain number of times and then copied again as . Afterwards, soft update shown in the following equation has been proposed to make asymptotically match more smoothly.
where denotes the update rate.
The above update rule is given as an exponential moving average, and all new inputs are treated equivalently. As a result, even when is incorrectly updated, its adverse effect as noise is reflected into . This effect is more pronounced when is large, but mitigating this by reducing causes a reduction in learning speed .
To tackle this problem, T-soft update that is robust to noise even with relatively large has recently been proposed 
. It regards the exponential moving average as update of a location parameter of normal distribution, and derives a new update rule by replacing it with student-t distribution that is more robust to noise by specifying degrees of freedom. T-soft update is described in Alg. 1. Note that and must be updated as new internal states.
With mathematical explanations, the issues of T-soft update are again summarized as below.
Since must be specified as a constant in advance, it must be tuned for each task to provide the appropriate noise robustness to maximize performance.
The larger is, the more the update is suppressed. However, the simple calculation of as mean square error makes it easier to hide the noise hidden in the -th subset.
If is frequently close to zero (i.e. no update), there is a risk that will not asymptotically match .
Iii-a Adaptive T-soft update
The first two of the three issues mentioned above are resolved by deriving a new update rule, so-called AT-soft update. To develop AT-soft update, the formulation of AdaTerm 
, which is a kind of stochastic gradient descent method with the adaptive noise robustness, is applied. In this method, by assuming that the gradient is generated from student-t distribution, its location, scale, and degrees of freedom parameters, which are utilized for updating the network, can be estimated by approximate maximum likelihood estimation. Instead of the gradient as stochastic variable, the parameters of the main network are considered to be generated from student-t distribution, and its location is mapped to the parameters of the target network. With this assumption, AT-soft update obtains the noise robustness as in the conventional T-soft update. In addition, the degrees of freedom can be estimated at the same time in this formulation, so that the noise robustness can be automatically adjusted according to the faced task.
Specifically, the -th subset of (e.g. a weight matrix in each layer), with the number of dimensions, is assumed to be generated from -dimensional diagonal student-t distribution with three kinds of sample statistics: a location parameter ; a scale parameter ; and degrees of freedom . With and , its density can be described as below.
where denotes the gamma function. Note that the conventional T-soft update simplifies this model as one-dimensional student-t distribution for the mean of , but here we treat it as -dimensional distribution with slightly increased computational cost.
With this assumption, following the derivation of AdaTerm, , , and are optimally inferred to maximize the approximated log-likelihood. The important variable in the derivation is , which indicates the deviation of from , and is calculated as follows:
That is, since represents the pseudo-distance from , the larger is, the closer is to . In addition, the smaller is, the more sensitive is to fluctuations in , leading to higher noise robustness. Using , , which is used only for updating , can be derived as follows:
These are used to calculate the update ratio of the sample statistics, .
where denotes the basic update ratio given as a hyperparameter. To satisfy , the upper bounds of , , are employed.
where means the negative logarithm with the tiny number of float32.
The update amounts for , , and are respectively given as follows:
where denotes the small value for stabilizing the computation and denotes the lower bound of (i.e. the maximum noise robustness) given as a hyperparamter.
Using the update ratios and the update amounts obtained above, , , and can be updated.
As a result, AT-soft update enables to update the parameters set of the target network adaptively (i.e. depending on the deviation of from ), while automatically tuning the noise robustness represented by and .
Iii-B Consolidation from main to target networks
However, if continues in the above update, will deviate from gradually, and the target network will no longer be able to generate pseudo-supervised signals since the assumption is broken. In such a case, parts of would be updated with the minimization of eqs. (1)–(3) in a wrong direction, causing
as outlier. Hence, to stop this fruitless judgement and restart the appropriate updates, revertingto and holding
would be the natural effective way. To this end, a heuristic consolidation is designed as below.
Specifically, the update ratio from the main to target networks, , is designed to be larger when the update ratio of is smaller (i.e. when deviates from ).
where adjusts the strength of this consolidation, which should be the same as or weaker than the update speed of the target network.
Next, since consolidating all would interfere with learning, the consolidated subset of outliers, , should be extracted. The simple and popular way is to use the
-th quantilewith . Since the component that contributes to making small is with large , is defined as follows:
Thus, the following update formula consolidates to the corresponding subset of the target network, .
A rough sketch of this consolidation is shown in Fig. 1. Although loss-function-based consolidations, as proposed in the context of continual learning [9, 25], would be also possible, a more convenient implementation with lower computational cost was employed.
The pseudo-code for the consolidated adaptive T-soft (CAT-soft) update is summarized in Alg. 2. Note that, although must be specified as a new hyperparameter in (C)AT-soft update, specified in T-soft update is already conservatively set for noise, and we can inherit it (i.e. ). Therefore, the additional hyperparameters to be tuned are and . can be given as so that a few parameters in -th subset are consolidated without interfering with learning. can be given as the inverse of the number of parameters to be consolidated. In other words, we can decide whether to make closer to and consolidate fewer parameters tightly, or make smaller and consolidate more parameters slightly.
#Neuron for each layer
|For AdaTerm |
|For PPO-RPE |
|For PER |
|For L2C2 |
For the statistical verification of the proposed method, the following simulations are conducted. As simulation environments, Pybullet  with OpenAI Gym  is employed. From it, InvertedDoublePendulumBulletEnv-v0 (DoublePendulum), HopperBulletEnv-v0 (Hopper), and AntBulletEnv-v0
(Ant) are chosen as tasks. To make the tasks harder, the observations from them are with white noises, scale of which is. With 18 random seeds, each task is tried to be accomplished by each method. After training, the learned policy is run 100 times for evaluating the sum of rewards as a score (larger is better).
The implementation of the base network architecture and DRL algorithm is basically the same as in the literature . However, it is noticeable that the stochastic policy function is modeled by student-t distribution for conservative learning and efficient exploration , instead of normal distribution. Hyperparamters for the implementation are summarized in Table I. Note that the learning difficulty is higher because the specified value of the learning rate is higher than one suitable for DRL, revealing the effectiveness of the target network for stabilizing learning.
The following three methods are compared.
Here, is designed to consolidate only one parameter in each subset as the simplest implementation. Correspondingly, is given as the inverse of the (maximum) number of consolidated parameters. is set smaller than one in the literature , but this is to counteract the negative effects of the high learning rate (i.e. ) set above.
|T-soft||6427.1 3357.8||1852.8 900.9||2683.8 249.3|
|AT-soft||6379.7 3299.7||1662.7 897.4||2764.1 265.5|
|CAT-soft||7129.2 2946.0||1971.2 812.9||2760.0 312.2|
The learning behaviors are depicted in Fig. 2. As pointed out, the deviation by (C)AT-soft updates were larger than that of the conventional T-soft update since (C)AT-soft update have better outlier and noise detection performance and the target network update is easily suppressed. This was pronounced in the early stage of training when the noise robustness is high and the update of the main network is unstable. However, CAT-soft update suppressed the deviation in the early stage of training. As the learning progresses, CAT-soft update converged to roughly the same level of deviation as AT-soft update because the consolidation was relaxed with the weakened noise robustness.
The scores of 100 runs after learning are summarized in Table II. AT-soft update slightly increased the performance of T-soft update on Ant, but decreased it on Hopper. In contrast, CAT-soft update outperformed T-soft update in all tasks.
As a demonstration, a simulation closer to the real robot experiment, MinitaurBulletDuckEnv-v0 (Minitaur) in Pybullet, is tried. The task is to move a duck on top of a Ghost Minitaur, a quadruped robot developed by Ghost Robotics. Since this duck is not fixed, careful locomotion is required, and its states (e.g. position) are unobserved, making this task a partially observed MDP (POMDP). Note that the default setting for Minitaur tasks is unrealistic, as pointed out in the literature . Therefore, it was modified as shown in Table III (arguments not listed are left at default).
T-soft and CAT-soft updates are compared under the same conditions as in the above simulations. The learning curves of the scores for 8 trials and the test results of the trained policies are depicted in Fig. 3. The best behaviors on the tests can be found in the attached video. As can be seen from Fig. 3, only the proposed CAT-soft update was able to acquire the successful cases of the task (walking without dropping the duck). Thus, it is suggested that CAT-soft update can contribute to the success of the task by steadily improving the learning performance even for more practical tasks.
This paper proposed a new update rule for the target network, CAT-soft update, which stabilizes DRL. In order to adaptively adjust the noise robustness, the update rule inspired by AdaTerm, which has been developed recently, was derived. In addition, a heuristic consolidation from the main to target networks was developed to suppress the deviation between them, which may occur when updates are continuously limited due to noise. The developed CAT-soft update was tested on the DRL benchmark tasks, and succeeded in improving and stabilizing the learning performance over the conventional T-soft update.
Actually, the target network should not deviate from the main network in terms of its outputs, not in terms of its parameters. A new consolidation and a noise-robust update based on the output space are expected to contribute to further performance improvements. These efforts to stabilize DRL will lead to its practical use in the near future.
This work was supported by JSPS KAKENHI, Grant-in-Aid for Scientific Research (B), Grant Number JP20H04265.
-  (2003) Convergence of alternating optimization. Neural, Parallel & Scientific Computations 11 (4), pp. 351–368. Cited by: §II-B.
-  (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §IV-A.
-  (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §I.
-  (2016) Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository. Cited by: §I, §IV-A.
-  (2019) The current state and future outlook of rescue robotics. Journal of Field Robotics 36 (7), pp. 1171–1191. Cited by: §I.
-  (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. Cited by: §I, §I.
AdaTerm: adaptive t-distribution estimated robust moments towards noise-robust stochastic gradient optimizer. arXiv preprint arXiv:2201.06714. Cited by: §I, §III-A, TABLE I.
Deepmellow: removing the need for a target network in deep q-learning.
International Joint Conference on Artificial Intelligence, pp. 2733–2739. Cited by: §I, §II-B.
-  (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §III-B.
-  (2015) Selection algorithm for locomotion based on the evaluation of falling risk. IEEE Transactions on Robotics 31 (3), pp. 750–765. Cited by: §I.
-  (2021) Whole-body multicontact haptic human–humanoid interaction based on leader–follower switching: a robot dance of the “box step”. Advanced Intelligent Systems, pp. 2100038. Cited by: §I.
-  (2021) T-soft update of target network for deep reinforcement learning. Neural Networks. Cited by: §I, §I, §II-B, §II-B, §IV-A, Algorithm 1.
-  (2019) Student-t policy in reinforcement learning to acquire global optimum of robot control. Applied Intelligence 49 (12), pp. 4335–4347. Cited by: §I, §IV-A.
Optimistic reinforcement learning by forward kullback-leibler divergence optimization. arXiv preprint arXiv:2105.12991. Cited by: §IV-C.
-  (2021) Proximal policy optimization with relative pearson divergence. In IEEE international conference on robotics and automation, pp. 8416–8421. Cited by: §I, §I, TABLE I.
-  (2022) L2C2: locally lipschitz continuous constraint towards stable and smooth reinforcement learning. arXiv preprint arXiv:2202.07152. Cited by: §IV-A, TABLE I.
-  (2021) A review of robot learning for manipulation: challenges, representations, and algorithms. Journal of Machine Learning Research 22 (30), pp. 1–82. Cited by: §I.
-  (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §I, §I, §II-B.
-  (2015) Optimized assistive human–robot interaction using reinforcement learning. IEEE transactions on cybernetics 46 (3), pp. 655–667. Cited by: §I.
Planet of the bayesians: reconsidering and improving deep planning network by incorporating bayesian inference. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5611–5618. Cited by: §I.
-  (2016) Deep exploration via bootstrapped dqn. Advances in neural information processing systems 29, pp. 4026–4034. Cited by: §I.
-  (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §I, TABLE I.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §I, §II-A.
-  (2019) Deep reinforcement learning with smooth policy update: application to robotic cloth manipulation. Robotics and Autonomous Systems 112, pp. 72–83. Cited by: §I, §I.
-  (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995. Cited by: §III-B.