Autonomous Underwater Vehicles (AUVs) have become vital assets in search and recovery, exploration, surveillance, monitoring, and military applications . For large AUVs in deep water applications, the strength and changes of external wave and current disturbances are negligible to the AUVs, due to their considerable size and thrust capabilities. While small AUVs are required for some shallow water applications, like bridge pile inspection , where the disturbances coming from the turbulent flows may frequently exceed AUVs’ thrust capabilities. These unknown disturbances inevitably bring adverse effects and may even destabilize robots [3, 4]. Thus this paper studies an optimal control problem of robots subject to excessive time-varying disturbances, and presents an observer-integrated RL solution. Such problem also arises in many other applications, e.g., aerial quadrotors for surveillance in wind conditions  and manipulators operating with constantly varying loads. In the cases of robot’s actuator failure, the control capabilities may also be lower than the external disturbances.
is a trial-and-error method that does not require an explicitly system model, and can naturally adapt to noises and uncertainties in the real system. With recent advances in deep neural network, RL is now able to solve practical problems. However, the excessive disturbances are not appropriate to be regarded as noises any more, since AUV’s state transition is heavily affected by the external disturbances, thus violating the assumption of Markov Decision Process (MDP).While considering the time-varying characteristics of the current and wave disturbances, if future disturbance forces can be predicted, RL may be able to generate optimal controls.
Classical DOB  and related methods have been researched and applied in various industrial sectors in the last four decades. The main objective of the use of DOB is to deduce the unknown disturbances from measurable variables, without additional sensors. Then, a control action can be taken, based on the disturbance estimate, to compensate for the influence of the disturbances, which is called Disturbance-Observer-Based Control (DOBC) .
However, classical DOB does have three limitations when solving our problem. The first limitation is that DOB normally needs a sufficient system model to estimate the disturbances, this could be difficult for the underwater robots due to hydrodynamics effects. In this case, external disturbances and model uncertainties are lumped together, and DOB will estimate the lumped disturbances, such lumped disturbances may affect the prediction. In addition, DOB is only capable of dealing with slow time-varying disturbances, as evidenced by its proof of convergence, which has to assume invariance of the disturbance signals. While the current and wave disturbances usually changes rapidly, which is beyond the capability of classical DOB. Furthermore, even with an sufficiently accurate estimate of the disturbances at current time step, the optimal control solution is still unreachable due to the neglection of time series correlation of the disturbance signals. The reason behind this is that excessive disturbances cannot be well rejected only through one step feedback. In order for optimal overall performance, the AUV behaviour needs to be optimized over a future time horizon using a sequence of disturbance estimates.
This paper proposes a novel RL approach called DOB-Net, which enables integrated learning of disturbance dynamics and an optimal controller, for current and wave disturbance rejection control of AUVs in shallow and turbulent water, as shown in Fig 1. The DOB-Net consists of a disturbance dynamics observer network and a controller network. The observer network is built and enhanced via RNNs, through imitating the classical DOB mechanisms. But this network is more flexible than the classical one, since it encodes the estimation and prediction of the external disturbances in RNN hidden state, instead of only providing the current estimated value of the lumped disturbances. Also, the observer function is more robust to the model uncertainties and the rapidly time-varying characteristics of the external disturbances. Based on the encoded disturbance prediction, the controller network is able to actively reject the unknown disturbances. The observer and the controller are jointly learned within policy optimization by advantage actor critic. This integrated learning may achieve an optimized representation of observer outputs, compared with traditional hand-designed features. The policy is trained using simulated sinusoidal wave disturbances, and evaluated using both simulated disturbances and collected disturbances, the latter is gathered from an Inertial Measurement Unit (IMU) onboard an AUV in a water tank at University of Technology Sydney. During training, the amplitude, period and phase of sine-wave disturbances are randomly sampled in each episode.
In this paper, some related work is presented in Section II, and Section III introduces problem formulation. Section IV provides the detailed description of the DOB-Net. Then, Section V presents validation procedures and result analysis. Some potential future improvements are discussed in the last section.
Ii Related Work
Ii-a Feedback and Predictive Control
In the early development of disturbance rejection control, feedback control strategies are used to suppress the unknown disturbances. Examples of feedback controllers include robust control , adaptive control [10, 11], optimal control , sliding mode control , etc. Then, disturbance estimation and attenuation methods through adding a feedforward compensation term [14, 8] have been proposed and practiced, such as DOB  and Extended State Observer (ESO) . However, these methods often assume that the system deals with bounded disturbances which should be small enough, thus fail to guarantee stability considering control constraints when meeting strong disturbances .
To this end, Model Predictive Control (MPC)  is often applied due to its constraint handling capacity . The method can achieve approximately optimal control performance even under practical constraints, since it optimizes plant behaviour over a certain time horizon, sometimes even sacrifices instant performance for better overall performance. However, MPC requires a sufficiently accurate prediction model of systems to optimize future behaviour , this model is often difficult to obtain when unknown time-varying disturbances exist, since these disturbances are jointly determined by fluid conditions, robot morphologies, as well as varying robot states and controls. Thus, researchers have developed a compound control scheme consisting of a feedforward compensation part based on the classical DOB and a feedback regulation part based on MPC (DOB-MPC) [18, 14]. The DOB can provide estimate of disturbance, and Auto-Regressive Moving Average (ARMA) is used to predict future disturbances based on past disturbances, then MPC can be employed based on the given system dynamics and this disturbance model. However, such separated modeling and control optimization process might not be able to produce models and controls that jointly optimize robot performance, as evidenced in [19, 20]. In contrast, the DOB-Net seeks of a joint optimization of observer and controller.
Ii-B Classical RL
RL has drawn a lot of attention in finding optimal controllers for systems that are difficult to model accurately. Recently, deep RL algorithms based on Q-learning , policy gradients [22, 23], and actor-critic methods [24, 25] have been shown to learn very complex skills in high-dimensional state and action spaces, including simulated robotic locomotion, driving, video game playing, and navigation. RL generally considers stochastic systems of the form 
with state variables , control signal and i.i.d. system noise marginalized over time , where . While in our case, the current and wave disturbances should be regarded as functions of time instead of random noises, refer to (2), due to its large amplitudes and time-varying characteristics, as evidence in Section V.
Ii-C History Window Approach
When using RL to deal with external disturbances, the problem cannot be defined as a MDP, since the robot state transition does not only depend on the current state and action, but also heavily on the unknown disturbances. The history window approaches  attempt to resolve the hidden state by making the selected action depend not only on the current state, but also on a fixed number of the most recent states and actions. Wang et al. 
applied this approach to handle the external disturbances of an AUV through characterizing the disturbed AUV dynamic system as a multi-order Markov chain, and assuming the unobserved time-varying disturbances and their prediction over next planning horizon are encoded in the AUV state-action history of fixed length , where represents the length of history. Thus, the resultant trained policy takes a fixed length of state-action history along with the current state as inputs to generate optimal controls. However, it is difficult to determine an optimal length of the state-action history when using this history window approach.
Ii-D Recurrent Policy
A popular approach to handle partial observability is to use RNN to represent policies [29, 30]. The idea being that the RNN will be able to retain information from states further back in time and incorporate that into predicting better value functions and thus performing better on tasks that require long term planning. Particularly, at each time step , a RNN policy takes as input state
and hidden vector, then gets the action and the next hidden state . The hidden vector is then received as input to the network at the next time step. Since depends on , the action is a function that depends on all of the previous states. These policies are able to solve tasks that require memory by loading sequence of states . However, most of them only considered the state-only history, which, for example, has been used for estimating velocities in training video game player . While this is not sufficient to observe disturbance dynamics.
Iii Problem Formulation
Iii-a System Description
Our 6 Degree Of Freedom (DOF) AUV is designed to be sufficiently stable in orientation even under strong disturbances, thanks to its large restoring forces. Thus, in order to simplify this problem, we only consider the control of the vehicle’s position. The AUV model can be considered as a floating rigid body with external disturbances, which can be represented by
where is the inertia matrix, is the matrix of Coriolis and centripetal terms, is the matrix of drag force, is the vector of the gravity and buoyancy forces, represent replacements, velocities and accelerations of the AUV, represents the control forces, is the time-varying disturbance forces, and the variation of with time from the past to the future is the disturbance dynamics, which is exactly what the observer network tries to produce. The AUV dynamic model is assumed to have fixed parameters, the model and the parameters are not known. In our case, we assume that the magnitudes of the disturbances will exceed the AUV control limits and , but are constrained within reasonable ranges, ensuring the controller is able to stabilize the AUV in a sufficiently small region.
Iii-B Problem Definition
In RL, the goal is to learn a policy that chooses actions at each time step in response to the current state , such that the total expected sum of discounted rewards is maximized over all time. The state of the robot consists of position as well as the corresponding velocities . The action includes the control forces . At each time step, the system transitions from to in response to the chosen action and the transition dynamics function , collecting a reward according to the reward function
where and represent weight matrices. The discounted sum of future rewards is then defined as , where is a discount factor that prioritizes near-term rewards over distant rewards .
The underwater disturbances presents great challenges for stabilization control due to its excessive amplitudes as well as rapidly time-varying characteristics. In this section, a classical DOB is first compared with a GRU, the results show some similarities in the structure of processing hidden information. Thus, an enhanced observer network for excessive time-varying disturbances is designed using GRUs, encoding the disturbance dynamics into GRU hidden state. A following controller network is then built upon this encoding in order to generate optimal controls.
Iv-a Classical DOB
The basic idea of classical DOB is to estimate current disturbance forces based on robot state and executed controls, its formulation is proposed as
where is the estimated disturbances, is the internal state of the nonlinear observer and is the nonlinear function to be designed. The DOB gain is determined by the following nonlinear function:
It has been shown in  that DOB is globally asymptotically stable by choosing
where . More specifically, the exponential convergence rate can be specified by choosing . The convergence and the performance of the DOB have been established for slowly time-varying disturbance and disturbance with bounded rate in . A discrete version of DOB is also provided (illustrated in Fig. 2)
Iv-B Gated Recurrent Unit (GRU)
where is the input vector, is the output vector, is the update gate vector, is the reset gate vector, and
are the weight matrices and bias vectors,and denotes the Hadamard product.
DOB-Net: The DOB-Net is constructed based on classical actor-critic architecture. The observer network consists of two GRUs and two fully connected layers between them, in order to imitate and enhance the function of the classical DOB. As described in Fig. 2 and Fig. 3, the DOB and GRU have a similar architecture, especially the part in the red box. of DOB acts as the hidden state, similar to the role of in GRU, which preserves hidden information for the usage of next time step. In order to equip GRU with capability of observing disturbances, we first employ a GRU to process the same inputs as DOB, which are the current state and the last action . Except for the hidden state, the DOB also outputs the estimated disturbances , which is a function of both the input state and the hidden state. Thus, we add fully connected layers after the first GRU to provide better embedding of the disturbance estimation.
After that, the embedding of the estimated disturbances can be further fed into another GRU, in order to encode a sequence of disturbances over a period of time up to current time step. The embedding of this disturbance sequence is supposed to represent the disturbance dynamics. It can then be combined with the current state , becoming the actual inputs of the controller network. One design parameter of the DOB-Net is dimension of the embedding of . In this paper, 3-dimension (size of disturbances) and 64-dimension (size of RNN hidden state) are chosen. These two choices will be compared in simulation. Such comparison shows the flexibility of neural networks after building observer from GRU.
Reduced DOB-Net: Instead of imitating the full design idea of the classical DOB, a simplified structure with only one GRU might also work due to RNN’s powerful processing capacity of time series data. Depicted in Fig. 5, the disturbance dynamics observer network is simply a GRU with inputs as , while all the other parts remains the same. The standard version and the reduced version of the DOB-Net will be compared in Section V.
Training: Advantage Actor Critic (A2C)  is a conceptually simple and lightweight framework for deep RL that uses synchronous gradient descent for optimization of deep neural network controllers. The algorithm synchronously execute multiple agents in parallel, on multiple instances of the environment. This parallelism also decorrelates the agents’ data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states.
Our algorithm is developed in A2C style. Pseudocode of the DOB-Net is shown in Algorithm 1. Each thread interacts with its own copy of the environment. The disturbances are also different in each thread, and each of them are randomly sampled. We found this setting helps accelerate the convergence of learning and improve performance, through comparison with using the same disturbances through all threads during numerical simulations. The algorithm operates in the forward view by explicitly computing -step returns. In order to compute a single update for the policy and the value function, the algorithm first samples and performs actions using its exploration policy for up to steps or until a terminal state is reached. The algorithm then computes gradients for -step updates. Each -step update uses the longest possible -step return resulting in a one-step update for the last state, a two-step update for the second last state, and so on for a total of up to updates. The accumulated updates are applied in a single gradient step.
V Simulation Experiments
V-a Simulation Setup
A position regulation task is simulated to test our approaches. The simulated AUV has the mass with the size of . Only positional motion and control are considered, thus, the AUV has a 6-dimensional state space and a 3-dimensional action space. The control limits . Each training episode contains 200 steps with 0.05s per step. In each episode, the robot starts at a random position with a random velocity, and it is controlled to reach a target position and stay within a region (refer to as converged region) thereafter.
In these experiments, the algorithms are trained using simulated disturbances, and tested using both simulated disturbances and collected disturbances. The simulated disturbances are in the form of sinusoidal waves with period ranging from 2s to 4s and phase ranging from 0 to 2 rad. According to the problem setting, the amplitudes of disturbances exceed the AUV control limits, then two different ranges of amplitude are tested, which are 100-120% and 130-150% of the AUV control limits. One example of the simulated disturbances used in test case is given in Fig. 6 (a), with the amplitude between 130-150% of the AUV control limits, each curve represents the disturbance in one direction (X, Y or Z). Our purpose is to enable the trained policy to deal with unknown time-varying disturbances, thus the values of amplitudes, periods, and phases are randomly sampled from these distributions in each training episode. In order to further validate the efficacy of the proposed algorithms, we also collected the current and wave disturbance data in a water tank using wave generator, as shown in Fig. 6 (b). The data is collected through an onboard IMU of an unactuated AUV, the measured linear accelerations are mapped to forces, which can be assumed as the disturbance forces. We notice that the amplitudes of the collected disturbances are time-varying, and not constrained within the amplitude ranges seen during training (100-120% and 130-150% of control limits). Also, from the frequency spectrum (Fig. 7), we can see that the simulated disturbances only have one frequency, they are periodic signals. While the collected disturbances are better described as superposition of multiple sinusoidal waves, they are obviously more complex and challenging for the controller.
Eight different methods for disturbance rejection control are tested and compared:
(a) Trajectory Optimization
(b) Robust Integral of the Sign Error (RISE) Control 
(d) Recurrent A2C (RA2C)
(e) History Window A2C (HWA2C)
(f) Reduced DOB-Net
(g) DOB-Net ()
(h) DOB-Net ()
Notice that, Among these methods, the trajectory optimization assumes full knowledge of the disturbances over the whole episode, while all other algorithms deal with unknown disturbances. The comparison is not fair, the trajectory optimization is used only to provide a performance in ideal case during comparison. RISE control is a traditional feedback controller, HWA2C is to apply history window approach into A2C framework, and RA2C employs RNN to deal with the state-only inputs. In the remaining part of this section, we first evaluate the training process of different algorithms, then test and compare the control performance among them using either simulated disturbances or collected disturbances.
V-B Training Results
Fig. 8 shows that considering history information of states only or states and actions does have significant improvements to the RL algorithm. When having small disturbances, different usages of history information have nearly the same reward. While when the disturbances become larger, we can find noticeable increase of using additional action inputs for recurrent policy. Also, using recurrent network instead of history window approach achieves higher reward, this may due to the more efficient way to utilize the history information. For the DOB-Net algorithms, we notice that the reduced version performs worse than the standard DOB-Net, and using larger size () of embedding of disturbance estimate gives higher reward.
V-C Test Results on Simulated Disturbances
The training reward is not sufficient to compare the performance among different algorithms, we are also interested in state distribution and bounded response (i.e. converged region) of the AUV disturbed by flows. Thus, we further test and compare these well-trained algorithms, still on the position regulation task with randomly sampled parameters of the simulated disturbances. As shown in Fig. 9 and Fig. 10, we compare the distribution of the distance from the target among different algorithms, in the first half (step 1-100) and second half (step 101-200) of each episode. The distance ranges of the AUV of the second stage are smaller than those of the first stage, this is because the AUV first observes the disturbance dynamics and then tries to stabilize itself. This also demonstrates that all these algorithms do stabilize the AUV to a certain extend.
Again notice that, the trajectory optimization provides an optimal solution in the case that the disturbance values through the entire episode are known. Our goal is to narrow the gap between our algorithm and the optimal solution in ideal case. It is clear if we focus on the second stage of the episode that, both history window policy and recurrent policy perform better than the standard A2C policy and the RISE control, which means the history information does improve the disturbance rejection capability. And the recurrent policies considering both state and action inputs are even better than the history window policy, showing that recurrent networks can utilize the history information more efficiently than naively putting together multiple past state-action pairs into observation space. Also, considering action as additional inputs besides state yields better performance.
Among the three implementations of the DOB-Net, the DOB-Net with achieves the best control performance. We believe enlarging the embedding size of disturbance estimate can provide better representation of the disturbance dynamics, and transforming this embedding from a 64-dimensional variable to a 3-dimensional variable may cause loss of information. However, even using the best RL algorithms we mentioned so far, the control performance still has a large gap from the trajectory optimization solution. There is still room for further improvements.
In addition, it is obvious that stronger disturbances lead to worse control performance. But we also found that, larger amplitude range of disturbances gives closer performance between the DOB-Net and the optimal method (the ratio of the difference between two medians over the median of trajectory optimization is 635.49% and 95.84% respectively for small and large disturbances). This phenomenon might because the optimal controls for different disturbance patterns with large disturbance amplitudes tend to be more similar than those with small disturbance amplitudes, thus it is easier for RL to learn a near-optimal control policy under larger disturbance amplitudes.
The 3D trajectories of these control approaches are compared in Fig. 11, using the simulated disturbances from Fig. 6. The red ball represents the maximum distance of the AUV from the target during the last 50 steps, which is the converged region. Based on this region, we can see that the AUV is difficult to achieve satisfactory bounded response using either traditional feedback controller (RISE) or classical RL policy (A2C). While the proposed DOB-Net algorithms can significantly narrow the converged region, the DOB-Net with achieves the best results among all the RL methods. Using the DOB-Net, the AUV can quickly navigate to the target and stabilize itself within a distance of from the target thereafter, which proves the effectiveness of DOB-Net. However, there is still an obvious gap between the DOB-Net and the optimal trajectory.
V-D Test Results on Collected Disturbances
Other than the simulated disturbance dynamics, we also use collected current and wave disturbances (as shown in Fig. 6
(b)) for testing. Note the collected data is only used for testing, no retraining has been proceeded in this stage. The algorithm performance follows the same order as in simulated case, but all of them are worse. This is because the amplitudes of collected disturbances are not exactly fall in the range of 100-120% or 130-150% of control limits, there are some outliers, leading to a wider range of amplitudes than the simulated disturbances. Our algorithm might not be capable of handling these outliers optimally. This also give rise to another research question, which is how to deal with disturbances with a wider range of parameters, this may need the technique of transfer learning.
Fig. 13 shows the 3D trajectories of RISE controller and the DOB-Net when applying the collected disturbances, the results do prove the effectiveness of the DOB-Net, but the performance is worse compared with using simulated disturbances, due to the more complex and diverse dynamics of the collected disturbances.
Vi Conclusion & Future Work
This paper proposes an observer-integrated RL approach called DOB-Net, for mobile robot control problems under unknown excessive time-varying disturbances. A disturbance dynamics observer network employing RNNs has been used to imitate and enhance the function of classical DOB, which produces the embedding of disturbance estimation and prediction. A controller network is designed using the observer outputs as well as current state as inputs, to generate optimal controls. Multiple control and RL algorithms have been tested and compared on position regulation tasks using both simulated disturbances and collected disturbances, the results demonstrate that the proposed DOB-Net does have a significant improvement for the disturbance rejection capacity compared to existing methods.
Currently, the test disturbances are collected in a water tank using wave generator, we plan to seek for the disturbance data from open water environments with natural current and wave for further testing. Also, we have noticed that the performance of the DOB-Net is worse using the collected disturbances, due to its more complex and diverse dynamics. An interesting future work is to investigate the usage of transfer learning in dealing with real world current and wave disturbances when simulated data and only a small amount of collected data are available. In addition, the deployment of this method on real-world robotic systems requires future investigation, where the low sample efficiency of generic model-free RL might be a problem. Some model-based approaches are necessary to overcome the constraints of real-time sample collection in the real world.
-  G. Griffiths, Technology and applications of autonomous underwater vehicles. CRC Press, 2002, vol. 2.
-  J. Woolfrey, D. Liu, and M. Carmichael, “Kinematic control of an autonomous underwater vehicle-manipulator system (auvms) using autoregressive prediction of vehicle motion and model predictive control,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 4591–4596.
-  L.-L. Xie and L. Guo, “How much uncertainty can be dealt with by feedback?” IEEE Transactions on Automatic Control, vol. 45, no. 12, pp. 2203–2217, 2000.
-  Z. Gao, “On the centrality of disturbance rejection in automatic control,” ISA transactions, vol. 53, no. 4, pp. 850–857, 2014.
-  S. Waslander and C. Wang, “Wind disturbance estimation and rejection for quadrotor position control,” in AIAA Infotech@ Aerospace Conference and AIAA Unmanned… Unlimited Conference, 2009, p. 1983.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
-  W.-H. Chen, D. J. Ballance, P. J. Gawthrop, and J. O’Reilly, “A nonlinear disturbance observer for robotic manipulators,” IEEE Transactions on industrial Electronics, vol. 47, no. 4, pp. 932–938, 2000.
-  W.-H. Chen, J. Yang, L. Guo, and S. Li, “Disturbance-observer-based control and related methods—an overview,” IEEE Transactions on Industrial Electronics, vol. 63, no. 2, pp. 1083–1095, 2016.
-  S. Skogestad and I. Postlethwaite, Multivariable feedback control: analysis and design. Wiley New York, 2007, vol. 2.
-  W. Lu and D. Liu, “Active task design in adaptive control of redundant robotic systems,” in Australasian Conference on Robotics and Automation. ARAA, 2017.
-  ——, “A frequency-limited adaptive controller for underwater vehicle-manipulator systems under large wave disturbances,” in 2018 13th World Congress on Intelligent Control and Automation (WCICA). IEEE, 2018, pp. 246–251.
-  D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas, Dynamic programming and optimal control. Athena scientific Belmont, MA, 1995, vol. 1, no. 2.
-  C. Edwards and S. Spurgeon, Sliding mode control: theory and applications. Crc Press, 1998.
-  J. Yang, S. Li, X. Chen, and Q. Li, “Disturbance rejection of ball mill grinding circuits using dob and mpc,” Powder Technology, vol. 198, no. 2, pp. 219–228, 2010.
-  J. Han, “The extended state observer of a class of uncertain systems,” Control and decision, vol. 10, no. 1, pp. 85–88, 1995.
-  H. Gao and Y. Cai, “Nonlinear disturbance observer-based model predictive control for a generic hypersonic vehicle,” Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering, vol. 230, no. 1, pp. 3–12, 2016.
-  E. F. Camacho and C. B. Alba, Model predictive control. Springer Science & Business Media, 2013.
-  U. Maeder and M. Morari, “Offset-free reference tracking with model predictive control,” Automatica, vol. 46, no. 9, pp. 1469–1476, 2010.
S. Brahmbhatt and J. Hays, “Deepnav: Learning to navigate large cities,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3087–3096.
-  P. Karkus, D. Hsu, and W. S. Lee, “Particle filter networks: End-to-end probabilistic localization from visual observations,” arXiv preprint arXiv:1805.08975, 2018.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region
policy optimization,” in
International Conference on Machine Learning, 2015, pp. 1889–1897.
-  S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Q-prop: Sample-efficient policy gradient with an off-policy critic,” arXiv preprint arXiv:1611.02247, 2016.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
-  S. Sæmundsson, K. Hofmann, and M. P. Deisenroth, “Meta reinforcement learning with latent variable gaussian processes,” arXiv preprint arXiv:1803.07551, 2018.
-  L.-J. Lin and T. M. Mitchell, “Reinforcement learning with hidden states,” From animals to animats, vol. 2, pp. 271–280, 1993.
-  T. Wang, W. Lu, and D. Liu, “Excessive disturbance rejection control of autonomous underwater vehicle using reinforcement learning,” in Australasian Conference on Robotics and Automation, 2018.
-  M. Hausknecht and P. Stone, “Deep recurrent q-learning for partially observable mdps,” in 2015 AAAI Fall Symposium Series, 2015.
-  D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber, “Solving deep memory pomdps with recurrent policy gradients,” in International Conference on Artificial Neural Networks. Springer, 2007, pp. 697–706.
-  A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning,” in Robotics and Automation (ICRA), 2018 IEEE International Conference on. IEEE, 2018, pp. 7579–7586.
-  S. Li, J. Yang, W.-H. Chen, and X. Chen, Disturbance observer-based control: methods and applications. CRC press, 2016.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
-  N. Fischer, D. Hughes, P. Walters, E. M. Schwartz, and W. E. Dixon, “Nonlinear rise-based control of an autonomous underwater vehicle,” IEEE Transactions on Robotics, vol. 30, no. 4, pp. 845–852, 2014.
-  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.