Learning Partially Structured Environmental Dynamics for Marine Robotic Navigation

We investigate the scenario that a robot needs to reach a designated goal after taking a sequence of appropriate actions in a non-static environment that is partially structured. One application example is to control a marine vehicle to move in the ocean. The ocean environment is dynamic and oftentimes the ocean waves result in strong disturbances that can disturb the vehicle's motion. Modeling such dynamic environment is non-trivial, and integrating such model in the robotic motion control is particularly difficult. Fortunately, the ocean currents usually form some local patterns (e.g. vortex) and thus the environment is partially structured. The historically observed data can be used to train the robot to learn to interact with the ocean tidal disturbances. In this paper we propose a method that applies the deep reinforcement learning framework to learn such partially structured complex disturbances. Our results show that, by training the robot under artificial and real ocean disturbances, the robot is able to successfully act in complex and spatiotemporal environments.

READ FULL TEXT VIEW PDF

page 1

page 4

page 5

05/28/2020

Deep Reinforcement learning for real autonomous mobile robot navigation in indoor environments

Deep Reinforcement Learning has been successfully applied in various com...
03/06/2019

Combining Optimal Control and Learning for Visual Navigation in Novel Environments

Model-based control is a popular paradigm for robot navigation because i...
03/25/2018

Maneuver Control based on Reinforcement Learning for Automated Vehicles in An Interactive Environment

Operating a robot safely and efficiently can be considerably challenging...
03/07/2021

Robotic Visuomotor Control with Unsupervised Forward Model Learned from Videos

Learning an accurate model of the environment is essential for model-bas...
04/21/2020

Structured Mechanical Models for Robot Learning and Control

Model-based methods are the dominant paradigm for controlling robotic sy...
04/07/2020

Self-propulsion on spandex: toward a robotic analog gravity system

Numerous laboratory systems have been proposed as analogs to study pheno...

I Introduction and Related Work

Acting in unstructured environments can be challenging especially when the environment is dynamic and involves continuous control states. We study the goal-directed action decision-making problem where a robot’s action can be disturbed by environmental disturbances such as the ocean waves or air turbulence. To be more concrete, consider a scenario where an underwater vehicle navigates across an area of ocean over a period of a few weeks to reach a goal location. Underwater vehicles such as autonomous gliders currently in use can travel long distances but move at speeds comparable to or slower than, typical ocean currents [25, 21]. Moreover, the disturbances caused by ocean eddies oftentimes are complex to be modeled. This is because when we navigate the underwater (or generically aquatic) vehicles, we usually consider long term and long distance missions, and during this process the ocean currents can change significantly, causing spatially and temporally varying disturbances. The ocean currents are not only complex in patterns, but are also strong in tidal forces and can easily perturb the underwater vehicle’ motion, causing significantly uncertain action outcomes. In general, such non-static and diverse disturbances are a reflection of the unstructured natural environment, and oftentimes it is very difficult to accurately formulate the complex disturbance dynamics using mathematical models. Fortunately, many disturbances caused by nature are seasonal, recurring, and can be observed, and the observation data is available for some time horizons. For example, we can get the forecast, nowcast, and hindcast of the weather including the wind (air turbulence) information from related observatories. Similarly, the ocean currents information can also be obtained, and using such data allows us to train the robot to learn to interact with the ocean currents which is spatiotemporal. Learning spatiotemporal features have been well investigated. For instance, the latent spatiotemporal features between two images can be retrieved through convolutional and recurrent learning frameworks [23, 24, 7]

. Unfortunately, most existing spatiotemporal deep learning algorithms do not involve agent decision-making mechanisms.

Fig. 1: Ocean currents consist of local patterns (source: NASA). Red box: uniform pattern. Blue box: vortex. Yellow box: meandering

Recently, studies on deep reinforcement learning have revealed a great potential for addressing complex decision problems such as robotic control [8, 5], computer and board game playing [14, 20, 16]. We found that there are certain similarities between our marine robots decision-making and the game playing scenarios if one regards the agent’s interacting platform/environment here is the nature instead of a game. However, one critical challenge that prevents robots from using deep learning is the lack of sufficient training data [17]. Indeed, using robots to collect training data can be extremely costly (e.g., in order to get one set of marine data using onboard sensors, it is not uncommon that a marine vehicle needs to take a few days and traverse hundreds of miles). Also, modeling a vast area of environment can be computationally expensive. Fortunately, the complex-patterned disturbance usually can be characterized by local patches, where a single patch may possess a particular disturbance pattern (e.g., a vortex/ring pattern, or a river-flow-like meandering pattern, or a constant waving pattern) [15], and the total number of the basic patterns are enumerable. Therefore, we are motivated by training the vehicle to learn those local patches/patterns offline so that during the real-time mission, if the disturbance is a mixture of a subset of those learned patterns, the vehicle can take advantage of what it has learned to cope with it easily, thus reducing the computation time for online action prediction and control. We use the iterative linear quadratic regulator [10] to model the vehicle dynamics and control, and use the policy gradient framework [9] to train the network. We tested our method on simulations with both artificially created dynamic disturbances as well as from a history of ocean current data, and our extensive evaluation results show that the trained robot achieved satisfying performance.

Ii Technical Approach

We use the deep reinforcement learning framework to model our decision-making problem. Specifically, we use and

to denote the robot’s state and action, respectively. The input of the deep network is the disturbance information which is typically a vector field. Our goal is to obtain a stochastic form of policy

parameterized by that maximizes the discounted, cumulative reward , where is a horizon term specifying the maximum time steps and is the reward at time and

is a discounting constant between 0 and 1 that ensures the convergence of sums. In this study, a deep recurrent neural network is used to approximate the optimal action-value function

. Therefore, the policy parameters are the weights of the neural network. More details of the basic model can be found in [14].

Ii-a Network Design

Since the ocean currents data over a period is available, we build our neural network with an input that integrates both the ocean (environmental) and the vehicle’s states. The environmental state here is a vector field representing the ocean currents (their strengths and directions). The structure of the neural network is shown in Fig. 2. Specifically, the input consists of two components: environment and vehicle states. The environmental component has three channels, where the first two channels convey the information about the -axis and -axis of the disturbance vector field, and the third describes the goal and obstacles. We assume that each grid of the input map in the third channel has three types: it can be occupied by obstacle (we set its value -1), or be free/empty for robot to transit to (with value 0), or be occupied by the robot (with value 1). The vehicle state component of the input is a vector that includes the vehicle’s velocity and its direction towards the goal. Note that we do not include the robot’s position in input because we want the robot to be sensitive only to environmental dynamics but not to specific (static) locations. The design of internal hidden layers is depicted in Fig. 2. The front 3 recurrent layers process the environment information, while the vehicle states begin to be combined from the first fully connected (FC) layer. The reason of such a design lies in that the whole net could be regarded as two correlated and eventually connected sub-nets: one sub-net is used to characterize features of disturbances, which is analogous to that of image classification; the other sub-net is a decision component for choosing the best action strategy. In addition, such separation of the inputs can reduce the number of parameters so that the training process can be accelerated. The structure for each recurrent layer is depicted in Fig. 3

. For each recurrent layer, we unfold it into 3 time steps and therefore a sequential input consisting of 3 time steps is required by each layer. Each recurrent layer will generate 3 outputs, which will be used by the following recurrent layer. Note that, for the last recurrent layer (i.e., layer 3), only the last output (output generated at time step 3) will be passed to the FC layer for further process. After each recurrent layer a max-pool is applied. The vehicle states first pass through 2 FC layers in its sub-net, and then are combined with the environmental component output from recurrent layer 3 as the input to a successive FC Layer 1. Between FC Layer 1 and 2 there exists a drop-out layer to avoid overfitting. The Softmax layer is used to normalize outputs for generating a probability distribution that can be used for sampling future actions. Additionally, the

loss funciton is calculated using this probability distribution as well as the actual rewards.

Fig. 2: Neural network structure
Fig. 3: The structure of each recurrent layer: stands for the weight matrix for input data, and represents the weight matrix for feedback data. All inner outputs are used by the next recurrent layer, and only the final inner output (inner output 3) of the last recurrent layer is exported to the next FC layer.

Ii-B Motion with Environmental Dynamics

We consider robot’s motion on a two-dimensional ocean surface, and assume the robot’s state , which includes the vehicle’s position and orientation in the world frame, respectively. Since the behavior of the vehicle in the 2D environment is similar to that of the ground mobile robot, thus we opt to use a Dubins car model to simulate its motion. Similar settings can also be found in [12, 13, 4, 21, 6]. The dynamics can be written as:

(1)

where and are the vehicle’s linear and angular speeds, respectively. One challenge here is that, the vehicle dynamics is not only nonlinear, but also time varying due to the non-negligible spatiotemporal ocean disturbances . Here is assumed to be a deterministic function with differentiable components, and no explicit notation is used for and for simplicity. Since the ocean environment is usually open with little or no obstacles on or near the surface, consequently we assume that the robot’s linear speed is slowly time-varying and can be viewed as a constant during the planning horizon, and assume the control input is angular speed . Therefore, the motion dynamics that integrate external time-varying disturbance becomes

(2)

where is the magnitude of the robot velocity , is the control variable. The magnitude of ocean disturbances is assumed to be similar to or smaller than that of the robot velocity such that the robot can reach the destination through the control policy. Such nonlinear control problem can be solved using the iterative Linear Quadratic Regulator (iLQR) [10, 22]. However, similar to LQR, the iLQR basically assumes that the system dynamics are known so that the solution to nonlinear case can be approximated and converged. In our case, the motion dynamics Eq. (2) largely depends on the external disturbance which is time-varying and also hard to accurately model. In our work, we approximate such complex spatiotemporal dynamics and assume that in the near-future horizon and around the local area (patch), the external disturbance is known (approximatible and predictable). Specific formulations can be found in Appendix. We also compared our deep learning framework with the baseline iLQR strategy in the experiments.

Ii-C Loss Function and Reward

We employ the policy gradient framework for solution. With the stochastic policy and the Q-value for the state-action pair where

denotes the parameters to be learned, the policy gradient of loss function

can be defined as follows:

(3)

To improve the sampling efficiency and accelerate the convergence, we adopt the importance sampling strategy using guided samples [9]. With the objective of reaching the designated goal, our rewarding mechanism is related to minimize the cost from start to goal. The main idea is to reinforce with a large positive value for those correct actions that lead to reaching the goal quickly, and to penalize those undesired actions (e.g., those take long time or even fail to reach the goal) with small positive or negative values. Formally, we define the reward of each trial/episode as:

(4)

where

(5)
(6)

where denotes the distance from the -th step position to the goal, and is the minimum of such distance along the whole path (if the robot fails to reach the goal, is a non-zero value). The term in Eq. (5) evaluates the state with respect to the goal state, whereas the term in Eq. (6) summarizes an evaluation over the entire path. Coefficient is an empirical value to scale between and so that they contribute about the same to the total reward .

Ii-D Offline Training and Online Decision-Making

We train the robot by setting different starting and goal positions in the disturbance field, and the experience replay [14, 18] mechanism is employed to avoid over-fitting. Specifically, we define an experience as a 3-tuple consisting of a sequence of three consecutive states (i.e., the current state and two prior states), an action , and a reward . The idea is to store those experiences obtained in the past into a dataset. Then during the reinforcement learning update process, a mini-batch of experiences is sampled from the dataset each time for training. The process of training is described in Algorithm 1, which can be summarized into four steps.

  1. Following the current (learned) action policies, sample actions and finish a trial path or an episode.

  2. Upon completion of each episode, obtain corresponding rewards (a list) according to whether the goal is reached, and assign rewards to the actions taken on that path.

  3. Add all these experiences into dataset. If the dataset has exceeded the maximum limit, erase as many as the oldest ones to satisfy the capacity.

  4. Sample a mini-batch of experiences from the dataset. This batch should include the most recent path. Then shuffle this batch of data and feed them into the neural network for training. If current round number is less than the max training rounds, go back to step 1.

  
  while  do
     Obtain of this episode.
     
     for all  do
        
        
     end for
     

     pad up

to batch size with data from dataset
     store into dataset
     shuffle
     feed into neural network
     perform back propagation
     
  end while
Algorithm 1 Training

With the offline trained results, the decision-making is straightforward: only one forward propagation of the network with small computational effort is needed. This also allows us to handle continuous motion and unknown states.

Iii Results

We validated the method in the scenario of marine robot goal-driven decision making, where the ocean disturbances vary both spatially and temporally. An underwater glider simulator written in C++ was built in order to test the proposed approach. We assume the glider glides near the surface and the ocean currents do not vary within a small depth under the surface area. The underwater glider is modeled with simplified dynamics described in Section II-B

. Thus, the simulation environment was constructed as a two-dimensional ocean surface, and the spatiotemporal ocean currents are external disturbances for the robot and are represented as a vector field, with each vector representing the water flow velocity captured at a specific moment in a specific location. Specifically, each disturbance vector at location

and time is denoted as , where vector denotes the easting velocity component (along latitude axis) and vector denotes the northing component (along longitude axis). The simulator is able to read and process ocean current data from the Regional Ocean Model System (ROMS) [19], where the ocean current data is labelled with latitude (corresponding to ), longitude (corresponding to ), the current easting (corresponding to ) and northing components (corresponding to ), as well the time stamp. The collected historical ocean data is used to train our agent.

Fig. 4: Demonstration of a learned path under artificially generated disturbances. The center of this vortex-like vector field is translating back and forth along the diagonal direction of the simulation environment. Color represents strength of the disturbance.
(a) Input
(b) Mix-input
(c) Recurrent Layer 3
Fig. 5: Illustration of disturbance features captured by hidden layers

Iii-a Network Training

We use Tensorflow 

[1] to build and train the network described in Fig. 2. In our experiments, the input vector field map is , and the size of dataset for action replay is set to 10000. The learning rate is , the coefficient of Eq. (4) is set to , and the batch size used for each iteration is 500. In addition, we set the length of each episode as 300 steps. Fig. 4 demonstrates a path learned by the network under a vortex-like disturbance field, and Fig. 5

shows the features extracted from internal layers of the network at some point on that path. Fig. 

5(a) illustrates the feature of the disturbance vector field. Specifically, the first two channels of Fig. 5(a) are and components of the vector field, and the grey-scale color represents the strength of disturbance. The third channel of Fig. 5(a) is a pixel map that contains the goal point (white dot) and obstacle information (black borders). Other grey grids denote free place. Fig. 5(b) shows a mixed view of the features with three channels colored in red, green and blue, respectively. The picture depicts a local centripetal pattern with the center located near the upper left corner. Fig. 5(c) shows outputs of recurrent layer 3 (the last output which is used by the FC layer), from which we can observe that the hidden layers extract some local features such as the edge of fields and the direction of currents. Those feature maps with deep darkness correspond to other patterns.

Iii-B Evaluations

We implemented two methods: one belongs to the control paradigm and we use the basic iLQR to compute the control inputs; the other one is the deep reinforcement learning (DRL) framework that employs the guided policy mechanism, where the policy is guided by (and combined with) the iLQR solving process [9].

Iii-B1 Artificial Disturbances

We first investigate the method using artificially generated disturbances. We tested different vector fields including vortex, meandering, uniformly spinning (waving), and centripetal patterns. For different trials, we specify the robot with different start and goal locations, and the goal reaching rate is calculated by the number of successes divided by the total number of simulations. The results in Table I show that within given time limits, both the iLQR and DRL methods lead to a good success rate, and particularly the DRL performs better in complex environments like the vortex field. In contrast, the iLQR framework has a slightly better performance in relatively mild environments where the disturbance has slowly-changing dynamics, such as the meandering disturbance field. Then, we test the average time costs, as shown in Table II. The results reveal that the trials using iLQR tend to consume less time than those from the DRL method. This can be due to the “idealized” artificial disturbances with known simple and accurate patterns, which can be precisely handled by the traditional control methodology.

Disturbance
pattern
Method
Num of
trials
Num of
success
Success rate
Vortex DRL 50 48 0.96
iLQR 50 46 0.92
Meander DRL 50 49 0.98
iLQR 50 50 1.00
Spin DRL 50 49 0.98
iLQR 50 48 0.96
Centripetal DRL 50 49 0.98
iLQR 50 48 0.96
TABLE I: Simulation with artificially generated disturbances
Pattern Method Num of trials Average time cost
Vortex DRL 50 20.549
iLQR 50 14.811
Meander DRL 50 16.926
iLQR 50 15.367
Spin DRL 50 17.667
iLQR 50 17.803
Centripetal DRL 50 20.220
iLQR 50 14.792
TABLE II: Average time cost under artificial disturbances
Fig. 6: Demonstration of the ocean currents and a path of the robot
(a) Time Cost
(b) Step Cost
Fig. 7: Average cost under ocean disturbances
(a) DRL
Fig. 8: Illustrations of robot’s motion trajectories. (a) Typical paths learned from DRL; (b) Examples of paths under different spatiotemporal disturbance patterns.

Iii-B2 Ocean Data Disturbances

In this part of evaluation, we use ocean current data obtained from the California Regional Ocean Modeling System (ROMS) [19]. The ocean data along the coast near Los Angeles is released every 6 hours and a window of 30 days of data is maintained and retrievable [2]. An example of ocean current surface can be visualized in Fig. 6, which also demonstrates a robot’s path from executing our training result. Because the original ROMS ocean dataset covers a vast area and practically it requires several days for the robot to travel through the whole space, during which a lot of variation and uncertainty may occur. Thus, we opt to focus on smaller local areas and randomly cropped such sub-areas to evaluate our training results. Fig. 8(a) illustrates a synthesized view of multiple trials by employing the DRL training result in a selected area of ocean data. Fig. 8 to 8 illustrate different paths during training in various disturbance areas. Similar to the evaluation process for the artificial disturbances, we looked into those aforementioned performances, i.e., the success rate and time cost, under the ocean disturbances. We also examined the total number of steps of each path, where each step represents the robot transiting from one grid to a new one (resolution of grid map is the same as the vector field map). Table. III shows the results (robot speed does not scale to real map), from which we can see that the DRL takes less time (it saves around time on average comparing to iLQR) and fewer steps (by saving around on average). Fig. 7 provides a better view for comparing the time and step costs, where statistical variations can be closely examined. In detail, the standard derivation of DRL is much less than that of iLQR. Such characteristic can also be observed in Fig. 9, where those saw-tooth curves of iLQR imply dramatic changes among differing trials. Our experimental evaluations indicate that the iLQR works well for those environments that can be well described and accurately formulated. In contrast, the DRL framework is particularly capable of handling complex and (partially) unstructured spatiotemporal environments that cannot be precisely modelled.

Area Method
Num of
trials
Success rate
Average
time cost
Average
step cost
Area 1 DRL 15 1.00 12.833 41.30
iLQR 15 1.00 16.375 47.27
Area 2 DRL 15 1.00 11.142 39.55
iLQR 15 1.00 15.530 45.33
Area 3 DRL 15 1.00 11.383 40.50
iLQR 15 1.00 13.186 41.13
TABLE III: Average time cost under ocean disturbances
(a) Example Area 1
(b) Example Area 2
(c) Example Area 3
(d) Example Area 1
(e) Example Area 2
(f) Example Area 3
Fig. 9: Time and step costs under different ocean disturbances in three example areas

Iv Conclusions

In this paper we investigate applying the deep reinforcement learning framework for robotic learning and acting in partially-structured environments. We use the scenario of marine vehicle decision-making under spatiotemporal disturbances to demonstrate and validate the framework. We show that the deep network well characterizes local features of varying disturbances. By training the robot under artificial and real ocean disturbances, our simulation results indicate that the robot is able to successfully and efficiently act in complex and partially structured environments. This section will discuss the solutions to the dynamic model (2). We define the state of the robot as , and write the model dynamics in discrete time as

(7)

where is the length of discrete time, and . The initial state of the robot is known. We further assume that the total loss function to be minimized is of the form

(8)

where and are in the quadratic form, i.e.,

(8a)
(8b)
where
(8c)

is a constant parameter, and is the position of final state. It is typically difficult to search for solutions when the dynamic model is of a nonlinear form. To make the problem tractable, the iterative LQR (iLQR) algorithm is then employed [22]. The essential idea is based on the iteratively incremental improvement. Starting from a sequence of states and control variables , we approximate the nonlinear dynamic model (7) by a linear one, and then apply LQR algorithm (with the nonlinear dynamics (7) in the backward pass) to obtain the next control sequence and state sequence . By repeating this procedure until convergence, one can obtain the final state and control sequences. Now assume the iLQR proceeds until the -th iteration with the state sequence and control . The dynamic model can be linearly approximated as

(9)

where , , and

Note that the total loss function becomes where and are viewed as constants. Specifically, the total loss function can be expressed by the second-order expansion as

(10)

where

Due to the form of loss functions (8a)-(8c), there are no cross quadratic terms between and . All the above equalities in (10) are exact. If, however, the loss function is nonlinear, the last equation with necessary cross terms is considered as a quadratic approximation. We then search for and to minimize the . The procedure is essentially the standard dynamic programming for the LQR problem [11, 3]. During the backward iteration , the feedback gain matrix is computed to get the optimal control update ,

where

The related matrices in above equations can be obtained through the following functions

and

where we denote the joint dynamic matrix , and value matrices

After the backward iteration, there is a forward iteration to compute by

and to update the by the dynamic model (7) with the initial state. This completes the -iteration in the iLQR framework.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    Software available from tensorflow.org.
  • [2] Y. Chao. Regional ocean model system. http://www.sccoos.org/data/roms-3km/.
  • [3] K. Fragkiadaki. Lecture notes on deep reinforcement learning and control, Spring 2017. URL: https://katefvision.github.io/. Last visited on 02/27/2018.
  • [4] C. Goerzen, Z. Kong, and B. Mettler. A survey of motion planning algorithms from the perspective of autonomous uav guidance. Journal of Intelligent and Robotic Systems, 57(1-4):65, 2010.
  • [5] S. Gu, T. Lillicrap, R. E. Turner, Z. Ghahramani, B. Schölkopf, and S. Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems 30, pages 3849–3858. Curran Associates, Inc., 2017.
  • [6] K. Hvamb. Motion planning algorithms for marine vehicles. Master’s thesis, NTNU, 2015.
  • [7] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena. Structural-rnn: Deep learning on spatio-temporal graphs. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5308–5317, 2016.
  • [8] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • [9] S. Levine and V. Koltun. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1–9, 2013.
  • [10] W. Li and E. Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), pages 222–229, 2004.
  • [11] D. Liberzon. Calculus of variations and optimal control theory: a concise introduction. Princeton University Press, 2011.
  • [12] N. Mahmoudian and C. Woolsey. Underwater glider motion control. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 552–557. IEEE, 2008.
  • [13] A. Manzanilla, M. Garcia, R. Lozano, and S. Salazar. Design and control of an autonomous underwater vehicle (auv-umi). In Marine Robotics and Applications, pages 87–100. Springer, 2018.
  • [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [15] L.-Y. Oey, T. Ezer, and H.-C. Lee. Loop current, rings and related circulation in the gulf of mexico: A review of numerical models and future challenges. Circulation in the Gulf of Mexico: Observations and models, pages 31–56, 2005.
  • [16] A. Oroojlooyjadid, M. Nazari, L. V. Snyder, and M. Takác. A deep q-network for the beer game with partial information. CoRR, abs/1708.05924, 2017.
  • [17] L. Pinto and A. Gupta.

    Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours.

    In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 3406–3413. IEEE, 2016.
  • [18] M. Riedmiller. Neural fitted q iteration-first experiences with a data efficient neural reinforcement learning method. In ECML, volume 3720, pages 317–328. Springer, 2005.
  • [19] A. F. Shchepetkin and J. C. McWilliams. The regional oceanic modeling system (ROMS): a split-explicit, free-surface, topography-following-coordinate oceanic model. Ocean Modelling, 9(4):347–404, 2005.
  • [20] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, Oct. 2017.
  • [21] R. N. Smith, M. Schwager, S. L. Smith, B. H. Jones, D. Rus, and G. S. Sukhatme. Persistent ocean monitoring with underwater gliders: Adapting sampling resolution. Journal of Field Robotics, 28(5):714 – 741, 2011.
  • [22] Y. Tassa, N. Mansard, and E. Todorov. Control-limited differential dynamic programming. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 1168–1175. IEEE, 2014.
  • [23] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In European conference on computer vision, pages 140–153. Springer, 2010.
  • [24] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4489–4497. IEEE, 2015.
  • [25] R. B. Wynn, V. A. Huvenne, T. P. L. Bas, B. J. Murton, D. P. Connelly, B. J. Bett, H. A. Ruhl, K. J. Morris, J. Peakall, D. R. Parsons, E. J. Sumner, S. E. Darby, R. M. Dorrell, and J. E. Hunt. Autonomous underwater vehicles (auvs): Their past, present and future contributions to the advancement of marine geoscience. Marine Geology, 352:451 – 468, 2014.