I Introduction and Related Work
Acting in unstructured environments can be challenging especially when the environment is dynamic and involves continuous control states. We study the goaldirected action decisionmaking problem where a robot’s action can be disturbed by environmental disturbances such as the ocean waves or air turbulence. To be more concrete, consider a scenario where an underwater vehicle navigates across an area of ocean over a period of a few weeks to reach a goal location. Underwater vehicles such as autonomous gliders currently in use can travel long distances but move at speeds comparable to or slower than, typical ocean currents [25, 21]. Moreover, the disturbances caused by ocean eddies oftentimes are complex to be modeled. This is because when we navigate the underwater (or generically aquatic) vehicles, we usually consider long term and long distance missions, and during this process the ocean currents can change significantly, causing spatially and temporally varying disturbances. The ocean currents are not only complex in patterns, but are also strong in tidal forces and can easily perturb the underwater vehicle’ motion, causing significantly uncertain action outcomes. In general, such nonstatic and diverse disturbances are a reflection of the unstructured natural environment, and oftentimes it is very difficult to accurately formulate the complex disturbance dynamics using mathematical models. Fortunately, many disturbances caused by nature are seasonal, recurring, and can be observed, and the observation data is available for some time horizons. For example, we can get the forecast, nowcast, and hindcast of the weather including the wind (air turbulence) information from related observatories. Similarly, the ocean currents information can also be obtained, and using such data allows us to train the robot to learn to interact with the ocean currents which is spatiotemporal. Learning spatiotemporal features have been well investigated. For instance, the latent spatiotemporal features between two images can be retrieved through convolutional and recurrent learning frameworks [23, 24, 7]
. Unfortunately, most existing spatiotemporal deep learning algorithms do not involve agent decisionmaking mechanisms.
Recently, studies on deep reinforcement learning have revealed a great potential for addressing complex decision problems such as robotic control [8, 5], computer and board game playing [14, 20, 16]. We found that there are certain similarities between our marine robots decisionmaking and the game playing scenarios if one regards the agent’s interacting platform/environment here is the nature instead of a game. However, one critical challenge that prevents robots from using deep learning is the lack of sufficient training data [17]. Indeed, using robots to collect training data can be extremely costly (e.g., in order to get one set of marine data using onboard sensors, it is not uncommon that a marine vehicle needs to take a few days and traverse hundreds of miles). Also, modeling a vast area of environment can be computationally expensive. Fortunately, the complexpatterned disturbance usually can be characterized by local patches, where a single patch may possess a particular disturbance pattern (e.g., a vortex/ring pattern, or a riverflowlike meandering pattern, or a constant waving pattern) [15], and the total number of the basic patterns are enumerable. Therefore, we are motivated by training the vehicle to learn those local patches/patterns offline so that during the realtime mission, if the disturbance is a mixture of a subset of those learned patterns, the vehicle can take advantage of what it has learned to cope with it easily, thus reducing the computation time for online action prediction and control. We use the iterative linear quadratic regulator [10] to model the vehicle dynamics and control, and use the policy gradient framework [9] to train the network. We tested our method on simulations with both artificially created dynamic disturbances as well as from a history of ocean current data, and our extensive evaluation results show that the trained robot achieved satisfying performance.
Ii Technical Approach
We use the deep reinforcement learning framework to model our decisionmaking problem. Specifically, we use and
to denote the robot’s state and action, respectively. The input of the deep network is the disturbance information which is typically a vector field. Our goal is to obtain a stochastic form of policy
parameterized by that maximizes the discounted, cumulative reward , where is a horizon term specifying the maximum time steps and is the reward at time andis a discounting constant between 0 and 1 that ensures the convergence of sums. In this study, a deep recurrent neural network is used to approximate the optimal actionvalue function
. Therefore, the policy parameters are the weights of the neural network. More details of the basic model can be found in [14].Iia Network Design
Since the ocean currents data over a period is available, we build our neural network with an input that integrates both the ocean (environmental) and the vehicle’s states. The environmental state here is a vector field representing the ocean currents (their strengths and directions). The structure of the neural network is shown in Fig. 2. Specifically, the input consists of two components: environment and vehicle states. The environmental component has three channels, where the first two channels convey the information about the axis and axis of the disturbance vector field, and the third describes the goal and obstacles. We assume that each grid of the input map in the third channel has three types: it can be occupied by obstacle (we set its value 1), or be free/empty for robot to transit to (with value 0), or be occupied by the robot (with value 1). The vehicle state component of the input is a vector that includes the vehicle’s velocity and its direction towards the goal. Note that we do not include the robot’s position in input because we want the robot to be sensitive only to environmental dynamics but not to specific (static) locations. The design of internal hidden layers is depicted in Fig. 2. The front 3 recurrent layers process the environment information, while the vehicle states begin to be combined from the first fully connected (FC) layer. The reason of such a design lies in that the whole net could be regarded as two correlated and eventually connected subnets: one subnet is used to characterize features of disturbances, which is analogous to that of image classification; the other subnet is a decision component for choosing the best action strategy. In addition, such separation of the inputs can reduce the number of parameters so that the training process can be accelerated. The structure for each recurrent layer is depicted in Fig. 3
. For each recurrent layer, we unfold it into 3 time steps and therefore a sequential input consisting of 3 time steps is required by each layer. Each recurrent layer will generate 3 outputs, which will be used by the following recurrent layer. Note that, for the last recurrent layer (i.e., layer 3), only the last output (output generated at time step 3) will be passed to the FC layer for further process. After each recurrent layer a maxpool is applied. The vehicle states first pass through 2 FC layers in its subnet, and then are combined with the environmental component output from recurrent layer 3 as the input to a successive FC Layer 1. Between FC Layer 1 and 2 there exists a dropout layer to avoid overfitting. The Softmax layer is used to normalize outputs for generating a probability distribution that can be used for sampling future actions. Additionally, the
loss funciton is calculated using this probability distribution as well as the actual rewards.IiB Motion with Environmental Dynamics
We consider robot’s motion on a twodimensional ocean surface, and assume the robot’s state , which includes the vehicle’s position and orientation in the world frame, respectively. Since the behavior of the vehicle in the 2D environment is similar to that of the ground mobile robot, thus we opt to use a Dubins car model to simulate its motion. Similar settings can also be found in [12, 13, 4, 21, 6]. The dynamics can be written as:
(1) 
where and are the vehicle’s linear and angular speeds, respectively. One challenge here is that, the vehicle dynamics is not only nonlinear, but also time varying due to the nonnegligible spatiotemporal ocean disturbances . Here is assumed to be a deterministic function with differentiable components, and no explicit notation is used for and for simplicity. Since the ocean environment is usually open with little or no obstacles on or near the surface, consequently we assume that the robot’s linear speed is slowly timevarying and can be viewed as a constant during the planning horizon, and assume the control input is angular speed . Therefore, the motion dynamics that integrate external timevarying disturbance becomes
(2) 
where is the magnitude of the robot velocity , is the control variable. The magnitude of ocean disturbances is assumed to be similar to or smaller than that of the robot velocity such that the robot can reach the destination through the control policy. Such nonlinear control problem can be solved using the iterative Linear Quadratic Regulator (iLQR) [10, 22]. However, similar to LQR, the iLQR basically assumes that the system dynamics are known so that the solution to nonlinear case can be approximated and converged. In our case, the motion dynamics Eq. (2) largely depends on the external disturbance which is timevarying and also hard to accurately model. In our work, we approximate such complex spatiotemporal dynamics and assume that in the nearfuture horizon and around the local area (patch), the external disturbance is known (approximatible and predictable). Specific formulations can be found in Appendix. We also compared our deep learning framework with the baseline iLQR strategy in the experiments.
IiC Loss Function and Reward
We employ the policy gradient framework for solution. With the stochastic policy and the Qvalue for the stateaction pair where
denotes the parameters to be learned, the policy gradient of loss function
can be defined as follows:(3) 
To improve the sampling efficiency and accelerate the convergence, we adopt the importance sampling strategy using guided samples [9]. With the objective of reaching the designated goal, our rewarding mechanism is related to minimize the cost from start to goal. The main idea is to reinforce with a large positive value for those correct actions that lead to reaching the goal quickly, and to penalize those undesired actions (e.g., those take long time or even fail to reach the goal) with small positive or negative values. Formally, we define the reward of each trial/episode as:
(4) 
where
(5)  
(6) 
where denotes the distance from the th step position to the goal, and is the minimum of such distance along the whole path (if the robot fails to reach the goal, is a nonzero value). The term in Eq. (5) evaluates the state with respect to the goal state, whereas the term in Eq. (6) summarizes an evaluation over the entire path. Coefficient is an empirical value to scale between and so that they contribute about the same to the total reward .
IiD Offline Training and Online DecisionMaking
We train the robot by setting different starting and goal positions in the disturbance field, and the experience replay [14, 18] mechanism is employed to avoid overfitting. Specifically, we define an experience as a 3tuple consisting of a sequence of three consecutive states (i.e., the current state and two prior states), an action , and a reward . The idea is to store those experiences obtained in the past into a dataset. Then during the reinforcement learning update process, a minibatch of experiences is sampled from the dataset each time for training. The process of training is described in Algorithm 1, which can be summarized into four steps.

Following the current (learned) action policies, sample actions and finish a trial path or an episode.

Upon completion of each episode, obtain corresponding rewards (a list) according to whether the goal is reached, and assign rewards to the actions taken on that path.

Add all these experiences into dataset. If the dataset has exceeded the maximum limit, erase as many as the oldest ones to satisfy the capacity.

Sample a minibatch of experiences from the dataset. This batch should include the most recent path. Then shuffle this batch of data and feed them into the neural network for training. If current round number is less than the max training rounds, go back to step 1.
With the offline trained results, the decisionmaking is straightforward: only one forward propagation of the network with small computational effort is needed. This also allows us to handle continuous motion and unknown states.
Iii Results
We validated the method in the scenario of marine robot goaldriven decision making, where the ocean disturbances vary both spatially and temporally. An underwater glider simulator written in C++ was built in order to test the proposed approach. We assume the glider glides near the surface and the ocean currents do not vary within a small depth under the surface area. The underwater glider is modeled with simplified dynamics described in Section IIB
. Thus, the simulation environment was constructed as a twodimensional ocean surface, and the spatiotemporal ocean currents are external disturbances for the robot and are represented as a vector field, with each vector representing the water flow velocity captured at a specific moment in a specific location. Specifically, each disturbance vector at location
and time is denoted as , where vector denotes the easting velocity component (along latitude axis) and vector denotes the northing component (along longitude axis). The simulator is able to read and process ocean current data from the Regional Ocean Model System (ROMS) [19], where the ocean current data is labelled with latitude (corresponding to ), longitude (corresponding to ), the current easting (corresponding to ) and northing components (corresponding to ), as well the time stamp. The collected historical ocean data is used to train our agent.Iiia Network Training
We use Tensorflow
[1] to build and train the network described in Fig. 2. In our experiments, the input vector field map is , and the size of dataset for action replay is set to 10000. The learning rate is , the coefficient of Eq. (4) is set to , and the batch size used for each iteration is 500. In addition, we set the length of each episode as 300 steps. Fig. 4 demonstrates a path learned by the network under a vortexlike disturbance field, and Fig. 5shows the features extracted from internal layers of the network at some point on that path. Fig.
5(a) illustrates the feature of the disturbance vector field. Specifically, the first two channels of Fig. 5(a) are and components of the vector field, and the greyscale color represents the strength of disturbance. The third channel of Fig. 5(a) is a pixel map that contains the goal point (white dot) and obstacle information (black borders). Other grey grids denote free place. Fig. 5(b) shows a mixed view of the features with three channels colored in red, green and blue, respectively. The picture depicts a local centripetal pattern with the center located near the upper left corner. Fig. 5(c) shows outputs of recurrent layer 3 (the last output which is used by the FC layer), from which we can observe that the hidden layers extract some local features such as the edge of fields and the direction of currents. Those feature maps with deep darkness correspond to other patterns.IiiB Evaluations
We implemented two methods: one belongs to the control paradigm and we use the basic iLQR to compute the control inputs; the other one is the deep reinforcement learning (DRL) framework that employs the guided policy mechanism, where the policy is guided by (and combined with) the iLQR solving process [9].
IiiB1 Artificial Disturbances
We first investigate the method using artificially generated disturbances. We tested different vector fields including vortex, meandering, uniformly spinning (waving), and centripetal patterns. For different trials, we specify the robot with different start and goal locations, and the goal reaching rate is calculated by the number of successes divided by the total number of simulations. The results in Table I show that within given time limits, both the iLQR and DRL methods lead to a good success rate, and particularly the DRL performs better in complex environments like the vortex field. In contrast, the iLQR framework has a slightly better performance in relatively mild environments where the disturbance has slowlychanging dynamics, such as the meandering disturbance field. Then, we test the average time costs, as shown in Table II. The results reveal that the trials using iLQR tend to consume less time than those from the DRL method. This can be due to the “idealized” artificial disturbances with known simple and accurate patterns, which can be precisely handled by the traditional control methodology.

Method 





Vortex  DRL  50  48  0.96  
iLQR  50  46  0.92  
Meander  DRL  50  49  0.98  
iLQR  50  50  1.00  
Spin  DRL  50  49  0.98  
iLQR  50  48  0.96  
Centripetal  DRL  50  49  0.98  
iLQR  50  48  0.96 
Pattern  Method  Num of trials  Average time cost 

Vortex  DRL  50  20.549 
iLQR  50  14.811  
Meander  DRL  50  16.926 
iLQR  50  15.367  
Spin  DRL  50  17.667 
iLQR  50  17.803  
Centripetal  DRL  50  20.220 
iLQR  50  14.792 
IiiB2 Ocean Data Disturbances
In this part of evaluation, we use ocean current data obtained from the California Regional Ocean Modeling System (ROMS) [19]. The ocean data along the coast near Los Angeles is released every 6 hours and a window of 30 days of data is maintained and retrievable [2]. An example of ocean current surface can be visualized in Fig. 6, which also demonstrates a robot’s path from executing our training result. Because the original ROMS ocean dataset covers a vast area and practically it requires several days for the robot to travel through the whole space, during which a lot of variation and uncertainty may occur. Thus, we opt to focus on smaller local areas and randomly cropped such subareas to evaluate our training results. Fig. 8(a) illustrates a synthesized view of multiple trials by employing the DRL training result in a selected area of ocean data. Fig. 8 to 8 illustrate different paths during training in various disturbance areas. Similar to the evaluation process for the artificial disturbances, we looked into those aforementioned performances, i.e., the success rate and time cost, under the ocean disturbances. We also examined the total number of steps of each path, where each step represents the robot transiting from one grid to a new one (resolution of grid map is the same as the vector field map). Table. III shows the results (robot speed does not scale to real map), from which we can see that the DRL takes less time (it saves around time on average comparing to iLQR) and fewer steps (by saving around on average). Fig. 7 provides a better view for comparing the time and step costs, where statistical variations can be closely examined. In detail, the standard derivation of DRL is much less than that of iLQR. Such characteristic can also be observed in Fig. 9, where those sawtooth curves of iLQR imply dramatic changes among differing trials. Our experimental evaluations indicate that the iLQR works well for those environments that can be well described and accurately formulated. In contrast, the DRL framework is particularly capable of handling complex and (partially) unstructured spatiotemporal environments that cannot be precisely modelled.
Area  Method 

Success rate 




Area 1  DRL  15  1.00  12.833  41.30  
iLQR  15  1.00  16.375  47.27  
Area 2  DRL  15  1.00  11.142  39.55  
iLQR  15  1.00  15.530  45.33  
Area 3  DRL  15  1.00  11.383  40.50  
iLQR  15  1.00  13.186  41.13 
Iv Conclusions
In this paper we investigate applying the deep reinforcement learning framework for robotic learning and acting in partiallystructured environments. We use the scenario of marine vehicle decisionmaking under spatiotemporal disturbances to demonstrate and validate the framework. We show that the deep network well characterizes local features of varying disturbances. By training the robot under artificial and real ocean disturbances, our simulation results indicate that the robot is able to successfully and efficiently act in complex and partially structured environments. This section will discuss the solutions to the dynamic model (2). We define the state of the robot as , and write the model dynamics in discrete time as
(7) 
where is the length of discrete time, and . The initial state of the robot is known. We further assume that the total loss function to be minimized is of the form
(8) 
where and are in the quadratic form, i.e.,
(8a)  
(8b)  
where  
(8c) 
is a constant parameter, and is the position of final state. It is typically difficult to search for solutions when the dynamic model is of a nonlinear form. To make the problem tractable, the iterative LQR (iLQR) algorithm is then employed [22]. The essential idea is based on the iteratively incremental improvement. Starting from a sequence of states and control variables , we approximate the nonlinear dynamic model (7) by a linear one, and then apply LQR algorithm (with the nonlinear dynamics (7) in the backward pass) to obtain the next control sequence and state sequence . By repeating this procedure until convergence, one can obtain the final state and control sequences. Now assume the iLQR proceeds until the th iteration with the state sequence and control . The dynamic model can be linearly approximated as
(9) 
where , , and
Note that the total loss function becomes where and are viewed as constants. Specifically, the total loss function can be expressed by the secondorder expansion as
(10) 
where
Due to the form of loss functions (8a)(8c), there are no cross quadratic terms between and . All the above equalities in (10) are exact. If, however, the loss function is nonlinear, the last equation with necessary cross terms is considered as a quadratic approximation. We then search for and to minimize the . The procedure is essentially the standard dynamic programming for the LQR problem [11, 3]. During the backward iteration , the feedback gain matrix is computed to get the optimal control update ,
where
The related matrices in above equations can be obtained through the following functions
and
where we denote the joint dynamic matrix , and value matrices
After the backward iteration, there is a forward iteration to compute by
and to update the by the dynamic model (7) with the initial state. This completes the iteration in the iLQR framework.
References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.  [2] Y. Chao. Regional ocean model system. http://www.sccoos.org/data/roms3km/.
 [3] K. Fragkiadaki. Lecture notes on deep reinforcement learning and control, Spring 2017. URL: https://katefvision.github.io/. Last visited on 02/27/2018.
 [4] C. Goerzen, Z. Kong, and B. Mettler. A survey of motion planning algorithms from the perspective of autonomous uav guidance. Journal of Intelligent and Robotic Systems, 57(14):65, 2010.
 [5] S. Gu, T. Lillicrap, R. E. Turner, Z. Ghahramani, B. Schölkopf, and S. Levine. Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems 30, pages 3849–3858. Curran Associates, Inc., 2017.
 [6] K. Hvamb. Motion planning algorithms for marine vehicles. Master’s thesis, NTNU, 2015.

[7]
A. Jain, A. R. Zamir, S. Savarese, and A. Saxena.
Structuralrnn: Deep learning on spatiotemporal graphs.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5308–5317, 2016.  [8] S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 [9] S. Levine and V. Koltun. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 1–9, 2013.
 [10] W. Li and E. Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), pages 222–229, 2004.
 [11] D. Liberzon. Calculus of variations and optimal control theory: a concise introduction. Princeton University Press, 2011.
 [12] N. Mahmoudian and C. Woolsey. Underwater glider motion control. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 552–557. IEEE, 2008.
 [13] A. Manzanilla, M. Garcia, R. Lozano, and S. Salazar. Design and control of an autonomous underwater vehicle (auvumi). In Marine Robotics and Applications, pages 87–100. Springer, 2018.
 [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [15] L.Y. Oey, T. Ezer, and H.C. Lee. Loop current, rings and related circulation in the gulf of mexico: A review of numerical models and future challenges. Circulation in the Gulf of Mexico: Observations and models, pages 31–56, 2005.
 [16] A. Oroojlooyjadid, M. Nazari, L. V. Snyder, and M. Takác. A deep qnetwork for the beer game with partial information. CoRR, abs/1708.05924, 2017.

[17]
L. Pinto and A. Gupta.
Supersizing selfsupervision: Learning to grasp from 50k tries and 700 robot hours.
In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 3406–3413. IEEE, 2016.  [18] M. Riedmiller. Neural fitted q iterationfirst experiences with a data efficient neural reinforcement learning method. In ECML, volume 3720, pages 317–328. Springer, 2005.
 [19] A. F. Shchepetkin and J. C. McWilliams. The regional oceanic modeling system (ROMS): a splitexplicit, freesurface, topographyfollowingcoordinate oceanic model. Ocean Modelling, 9(4):347–404, 2005.
 [20] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, Oct. 2017.
 [21] R. N. Smith, M. Schwager, S. L. Smith, B. H. Jones, D. Rus, and G. S. Sukhatme. Persistent ocean monitoring with underwater gliders: Adapting sampling resolution. Journal of Field Robotics, 28(5):714 – 741, 2011.
 [22] Y. Tassa, N. Mansard, and E. Todorov. Controllimited differential dynamic programming. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 1168–1175. IEEE, 2014.
 [23] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatiotemporal features. In European conference on computer vision, pages 140–153. Springer, 2010.
 [24] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4489–4497. IEEE, 2015.
 [25] R. B. Wynn, V. A. Huvenne, T. P. L. Bas, B. J. Murton, D. P. Connelly, B. J. Bett, H. A. Ruhl, K. J. Morris, J. Peakall, D. R. Parsons, E. J. Sumner, S. E. Darby, R. M. Dorrell, and J. E. Hunt. Autonomous underwater vehicles (auvs): Their past, present and future contributions to the advancement of marine geoscience. Marine Geology, 352:451 – 468, 2014.