Training an Interactive Humanoid Robot Using Multimodal Deep Reinforcement Learning

11/26/2016 ∙ by Heriberto Cuayáhuitl, et al. ∙ University of Lincoln 0

Training robots to perceive, act and communicate using multiple modalities still represents a challenging problem, particularly if robots are expected to learn efficiently from small sets of example interactions. We describe a learning approach as a step in this direction, where we teach a humanoid robot how to play the game of noughts and crosses. Given that multiple multimodal skills can be trained to play this game, we focus our attention to training the robot to perceive the game, and to interact in this game. Our multimodal deep reinforcement learning agent perceives multimodal features and exhibits verbal and non-verbal actions while playing. Experimental results using simulations show that the robot can learn to win or draw up to 98 test of the proposed multimodal system for the targeted game---integrating speech, vision and gestures---reports that reasonable and fluent interactions can be achieved using the proposed approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interactive humanoid robots that perceive, act, communicate and learn simultaneously are not only interesting for demonstrating robot capabilities, but they have the potential of being used to study embodied human intelligence. This paper describes a small step in these directions, where we equip a humanoid robot111http://www.rethinkrobotics.com/baxter/ with multiple input and output modalities in order to play the game of noughts and crosses, also known as ‘tic-tac-toe’—see Figure 1. These modalities allow the robot to listen to human commands, see human gazing and human drawings in the targeted game, gaze at the human player in focus, talk to human players, draw noughts/crosses, and learn from examples—all of them asynchronously. The latter capability (learning) requires very efficient forms of training to approach human-like behaviour. While previous work have studied robot learning from real interactions with the environment Levine2016 ; Kobber2013 ; Zhang2015 —mostly without verbal abilities, training an autonomous agent to learn even simple behaviours can take large amounts of experience mnih-dqn-2015

. Other previous works have addressed multimodal deep learning but in non-conversational settings

WermterWEPEP04 ; NgiamKKNLN11 ; SrivastavaS14 .

While those previous works learn from raw pixels, our learning approach is semi-decoupled into two tasks: (1) learning to perceive and (2) learning to interact. It is semi-decoupled because the task of learning to interact uses multimodal perception, which requires learning from multimodal features rather than unimodal ones. Because these two tasks represent high-dimensional systems, the former uses deep supervised learning, and the latter uses deep reinforcement learning. The advantage of using this two-stage approach is that multimodal learning can be achieved more efficiently. Let us assume that our robot has to be trained to play using grids of different sizes. While the raw pixel approach would have to re-learn its behaviour for new grid sizes, our approach only re-learns to perceive and reuses its learnt behaviour to interact. Similarly, assuming that our robot has to be trained to interact using increasing amounts of verbal features (e.g. words), our approach would have to re-learn its verbal behaviour but not its vision-based perception. These two examples illustrate the benefits of our proposed approach, which can be seen as an indirect form of transfer learning. Due to the complexity of behaving in unseen environments, in this paper we use grids of one size to illustrate our multimodal deep reinforcement learning approach—see example interaction in Figure 

1.

   Rob: Hello! Rob: I am Baxter. Rob: Would you like to play a game with me? Usr: Yes, let’s go for it. Rob: Nice. Let me start. Rob: I take this one === Rob: Your turn. Usr: I pick this === Rob: I take this one === Rob: Your turn. Usr: I do this === Rob: I take this one === Rob: Yes, I won. Rob: Good bye!

Figure 1: Example human-humanoid interaction while playing the game of noughts and crosses. The commands in squared brackets represent physical actions with handwriting. Example video at https://youtu.be/25jdV8FN4ic

2 Learning Approach for Physical Human-Humanoid Interaction

Our approach uses two independent but related learning tasks. First, learning to perceive in order to predict what is going on in the environment (game moves in our case). Second, learning to interact in order to decide what to do or say next. In this way, humanoid robots can learn to interact from words and game moves (a more compact set of multimodal features) rather than words or speech and pixels. Although our approach implies more efficient learning due to the more compact environment states, the latter state representation (raw multimodal features) remains to be investigated in future work.

Figure 2: (Left) raw input images. (Right) Grayscale images used for game move recognition.
Figure 3: Architecture of the deep supervised learner for game move recognition

2.1 Learning to Perceive with Deep Supervised Learning

We use the camera on the right arm of the robot to perceive symbols in the game grid. Rather that using a single image, we use multiple images (one per location in the grid, 9 in our case) to detect new drawings used to generate game moves. The robot continuously takes images and splits them into 9 images of 4040 pixels as shown in Figure 2.

Figure 4: Seed training examples for training the deep supervised learner

Given a data set of the form , where are matrices of pixel-based features and are class labels, the tasks is to map images to labels. In our case, the images have 40

40 pixels, and the labels are {‘circle’, ‘cross’, ‘nothing’}. We use a deep supervised classifier to induce function

out of a space of functions , where are images and are the labels. The labelling process is defined as , where is a scoring function using learnt features

derived from a Convolutional neural network

LeCun2015 . To train this classifier we use a set of seed labelled images (see Figure 4) to generate more images but with the drawings (circles and crosses) in randomly assigned locations. For example, an image with a cross in the middle can generate more images by shifting it to the left or right, and up or down.

Our Convolutional neural net used the architecture shown in Figure 3 with the following layers: input layer of 40

40 pixels, convolutional layer with 8 filters, RELU, pooling layer of size 2

2 with stride 2, convolutional layer with 16 filters, RELU, pooling layer of size 3

3 with stride 3, and the output layer used a Support Vector Machine (SVM) with 3 labels.

This classifier is used continuously often in user turns, every 100 milliseconds, to detect activity (drawings of human players) in each location of the game grid. In addition and to reduce noise, we accept a new drawing if it has been recognised at least 3 times in a row. In this way, our perception component can output game moves in the following format222The drawn symbol is inferred from who starts the game and with what symbol. For example, if the robot starts the game, we assume that the robot draws circles and the users draws crosses.: ===.

2.2 Learning to Interact with Deep Reinforcement Learning

The visual perceptions above plus speech-based perceptions (words with confidence scores) are given as input to a reinforcement learning agent to induce its behaviour from interaction with the environment, where situations are mapped to actions by maximizing a long-term reward signal Sutton_Barto-1998 ; Szepesvari:2010 . An RL agent is typically characterized by: (i) a finite set of states ; (ii) a finite set of actions ; (iii) a state transition function that specifies the next state given the current state and action ; (iv) a reward function that specifies the reward given to the agent for choosing action when the environment makes a transition from state to state ; and (v) a policy that defines a mapping from states to actions. The goal of an RL agent is to find an optimal policy by maximising its cumulative discounted reward defined as

where function represents the maximum sum of rewards discounted by factor

at each time step. While RL agents take actions with probability

during training, they select the best at test time, i.e. =.

To induce the function above our agent approximates using a multilayer neural network as in mnih-dqn-2015 . The function is parameterised as , where are the parameters (weights) of the neural net at iteration . Furthermore, training a deep RL agent requires a dataset of experiences (also referred to as ‘experience replay memory’), where every experience is described as a tuple . Inducing the function consists in applying Q-learning updates over minibatches of experience drawn uniformly at random from the full dataset . A Q-learning update at iteration

is thus defined according to the loss function

where are the parameters of the neural net at iteration , and are the target parameters of the neural net at iteration . The latter are only updated every steps. This process is implemented in the learning algorithm Deep Q-Learning with Experience Replay described in mnih-atari-2013 .

Figure 5: (Left) Physical game-based environment, (Right) deep reinforcement learning agent for human-humanoid interaction—see text for details

State Space

The state space

of our learning agent includes 57 features that describe the game moves and words raised in the last system and user turn. While words derived from system responses are treated as binary variables (i.e. word present or absent), the words derived from noisy user responses can be seen as continuous variables by taking confidence scores into account. Since we use a single variable per word, user features override system ones in case of overlaps. In contrast to word-based features, game moves do take into account the context or history of each game.

Action Space

The action space includes 18 dialogue acts in the domain of noughts & crosses333Actions: GameMove(gridloc=$loc) x 9, Provide(feedback=draw), Provide(feedback=loose), Provide(feedback=win), Provide(name), Reply(playGame=yes), Request(playGame), Request(userGameMove), Salutation(closing), Salutation(greeting).. Rather than learning using all actions in every state, the action set was derived from the most likely actions,

, with probabilities derived from a Naive Bayes classifier trained from example dialogues. In addition, if a physical action was included in the action set, we included all valid physical actions to allow the agent explore different game moves.

State Transition Function

This function is based on a numerical vector representing the last system and user word-based responses, and game history. The latter means that we kept the game move features as to describe the game state rather than resetting them at every turn. The system responses are straightforward, 0 if absent and 1 if present. The user responses correspond to the confidence level [0..1] of noisy user responses.

Reward Function

It is motivated by the fact that dialogues should be human-like and game-based. It is defined as , where is a bonus reward using the following values: 5 if the agent is about to win, 1 if it is about to draw, 0 otherwise; is a weight over the bonus reward (BR), we used =0.5; is a data-like probability of having observed action in state in the seed example dialogues; and is used to encourage efficient interactions, we used =0.1. The scores are derived from the same statistical classifier above, which allows us to do statistical inference over actions given states (). In addition, this function provided the following rewards at the end of the interaction (dialogue): 0 for loosing the game, 5 otherwise.

Model Architecture

It consists of a fully-connected multilayer neural network with 57 nodes in the input layer 60 nodes in the first and second hidden layers, and 18 nodes (action set) in the output layer. The hidden layers use RELU (Rectified Linear Units) activation functions to normalise their weights, see

NairH10 for details. Other learning parameters include the following: experience replay size=100K, discount factor=0.7, minimum epsilon=0.01, learning rate=0.001, and batch size=32.

  

Figure 6: (Left) Components of the integrated system of the humanoid robot playing noughts and crosses, which apart from head movements that used imitation, were orchestrated by a deep reinforcement learning interaction manager. (Right) 3D head tracking is used to observe changes in orientation based on patterns detected from the signals produced by Fanelli2013 .

3 Experiments and Results

In this section we apply the approach and learning agent above to a humanoid robot that learns to play the game of noughts and crosses.

3.1 Integrated System

Our humanoid robot was equipped with multiple modalities—including speech, touch and vision—to play the targeted game. To do that we used both off-the-shelf components and components built specifically for our ROS-based ROS integrated system. These components run concurrently, via multi-threading, and are explained as follows.

Speech Recognition

This component runs the Google Speech Recogniser on an Android App with a touch-to-speak mechanism. This component communicates the speech recognition results (also referred to as ‘N-best lists’) to our ROS-based integrated system via Bluetooth. These speech-based perceptions are used as features in the state space of the deep reinforcement learning agent.

Game Move Recognition

This component runs the vision-based perception subsystem described in  2.1, see Figure2 for an illustration. Briefly, it follows the next steps: takes RGB images from a predefined initial location of the right arm (for consistent perceptions), converts them to grayscale, removes the grid, splits each image into 9 images (one image per location in the game grid), predicts the label based on an SVM classifier with learnt features (labels: circle, cross, nothing), and generates a game move based on a newly observed labels at least 3 times in a row (to avoid noise). While this component can be used to recognise all system and user game moves, it was used to recognise user game moves only. These vision-based perceptions are used as features in the state space of the deep reinforcement learning agent.

Head Move Recognition

Head tracking is used to detect changes in orientation of the human player’s head (e.g. left, right, up, down, and centre). To do that we extract patterns from depth-based sensory data using a Kinect sensor and the algorithm described in Fanelli2013 —see Figure 6 (Right). Using those patterns, we track a basic set of movements (left, right, up, down, centre) using a threshold-based approach. This allowed the robot to know where the user is looking at in order to imitate head movements and give the impression that the robot is following the gaze of human players.

Speech Synthesis

The verbalisations (in English), which correspond to translations from high-level actions derived the interaction manager to words, used a template-based approach and an off-the-shelf speech synthesizer444http://mary.dfki.de/. The spoken verbalisations were synchronised with the face of the robot—a video played on the robot’s head, which moved its eyes and mouth while speaking. Rather than using a static robot face, the video and speech started and ended simultaneously to give the impression of a synchronised talking face.

Arm Movement Generation

This component receives commands from the interaction manager for drawing symbols in the game grid. Given that we assumed a static and fixed-sized game grid, the task of what and where to draw was simplified—though future work should assume dynamic game grids. While arm movements (at predefined speeds) started as soon as a command was received, it notified the interaction manager when it was executed. In this way, a future verbalisation would have to wait until the drawings were done and the arm was back at the initial position.

Head Movement Generation

This component takes the head tracking movements (from the head move recogniser) of human players as inputs in order to imitate them. Its outputs correspond to head movements to the left, right, up, down, and centre. For example, if a human player turns their head to the left and then looks at the robot, the robot does the same except that from its own spatial perspective (e.g. human turns left = robot turns right). This gives the impression that the robot is actually paying attention to the human player, and also gives more liveliness to the robot.

  

Figure 7: Learning curves of the deep supervised/reinforcement learners for multimodal interaction: (left) classification accuracy, (middle) average reward, and (right) win/draw rate.

Interaction Manager

The interaction manager, based on the publicly available SimpleDS tool555https://github.com/cuayahuitl/SimpleDS Cuayahuitl16 , orchestrates the components above by continuously receiving speech-based and vision-based perceptions from the environment, and deciding what to do next and when. Regarding what to do next, it chooses actions based on the learning agent described in Section 2.2. While half of such actions are only verbal actions, the other half are multimodal actions. For example, communicating action corresponds “I take this one ===”, where the square brackets represent a physical action (drawing a circle or cross at the given location). The policy evaluated below assumed that the robot starts playing with a default symbol=circle. Training policies with a large repertoire of multimodal actions is an interesting future research direction.

3.2 Experimental Results

While the integrated system is able to produce reasonable and fluent interactions with human players (as can be observed in this video666https://youtu.be/25jdV8FN4ic), we focus our evaluation to learning to perceive and learning to interact from simulated interactions. Nonetheless, this integrated system has been used in preliminary though successful demonstrations. We tried the system with seven independent human users, and all reported successful interactions, which they enjoyed. A comprehensive evaluation with multiple human players is left as future work.

Deep Supervised Learner

The task of this learner is to classify 40x40 grayscale images into three labels (circle, cross, nothing) using the SVM classifier with learnt features described in Section 2.1. While the classifier used a set of seed images (as shown in Figure 4), it used additionally generated images by positioning the drawings at different locations of each image. Figure 7 (left) shows a learning curve of classification accuracy given an increasing amount of training examples, where the training time using 10K images required only 6 minutes as illustrated Figure 7 (left, green dotted line) using a contemporary desktop computer (Core i7 with 3.4 GHz). During testing and based on 1000 randomly generated images from the seed examples, produced a classification accuracy of 99.9%. This is an indication that our vision-based perception component was accurate enough to classify human handwriting for the targeted game. Pilot tests with human subjects reports that this component can be used to detect handwriting from different human players in a way that allow them to play the game of noughts and crosses.

Deep Reinforcement Learner

The task of this learner is to select high-level multimodal actions (18 in total) based on speech-based and vision-based perceptions using the deep reinforcement learning agent described in Section 2.2. Rather than using raw pixels, this learner takes words and game moves as features. In addition, rather than training from real human-humanoid interactions we used user simulations that provide semi-randomly generated user responses. The system actions, system responses and user responses are derived from example dialogues provided to the interaction manager. We used a set of 10 seed example dialogues as the one shown in Figure 1. The user simulator is semi-random because responses already chosen were not allowed. For example, drawing a symbol in the middle of the grid was only allowed if the location was empty. The goal of the agent was to induce its behaviour based on human-like behaviour (similar to the example dialogues), and to win as much as possible.

Figure 7 (middle) shows a learning curve of average reward according to an increasing amount of experiences, where one action is selected in each experience. A reasonable policy was found after 3 hours of training using the same desktop as the previous learner. In addition, Figure 7 (right) shows a learning curve of win/draw rate in relation to the number of games played. These learning curves show how the proposed agent achieved successful learning from multimodal input features and multimodal actions—with no other information of the game apart from the given rewards.

4 Related Work and Limitations

The recent developments in machine learning, specifically in the area of deep learning, are allowing the development of more ambitious intelligent interactive systems. For example, previous interactive systems would require a substantial amount of effort in feature selection. Now, deep supervised learners can be trained from learnt features

LeCun2015 , and deep reinforcement learners can be trained to induce their features and policy jointly mnih-dqn-2015 ; MnihBMGLHSK16 . Despite of these advances, applying deep learning to interactive robots is far from trivial. For example, the robot described in Zhang2015 was trained to carry out target reaching from pixels, which was successful in simulation but failed when tested in the real environment. The behaviours of other robots have been induced with reasonable success—though they usually do not target multiple modalities as in this paper. In addition, previous works teaching agents to play noughts and crosses assume perfectly drawn grids and symbols without any multimodal inputs and outputs Boyan92modularneural ; Siegel2001 ; SteegDW15 . Our robot assumes perception from imperfect human drawings, speech-based responses, and gestures. Training robots to perceive, act and communicate using multiple modalities is important to bring them to end users Mavridis2015ras ; HC2015aisb . Our humanoid robot has, to our knowledge, the first system that learns to perceive, act and communicate using deep (reinforcement) learning. This system can be considered as data efficient because it used only 108 seed images and 10 seed dialogues to bootstrap the learning environment and agent.

Although our robot system is reasonably advanced, it has a number of limitations that encourage interesting research avenues. First, our robot assumes a fixed-size grid, and grids of different sizes and shapes (as drawn by humans) remain to be explored. This would represent a major upgrade because this direction addresses joint perception and gestures (arm movements). The former requires more robust perception due to unknown drawings, and the latter requires the robot to draw symbols in non-predefined locations and of different sizes. Second, our robot uses push-to-talk speech-based interaction. It would be interesting to compare this approach with a microphone array so human players can have more natural interactions. Third, our robot used a small vocabulary (about 40 different words) and larger vocabularies remain to be explored as well, specially if we would like a chatty robot rather than a robot that plays the game using the same verbalisations over and over again. Fourth and in a similar vein, our robot uses predefined templates for language generation, and including a situated language generator (e.g. DethlefsC15 ) in the training process would contribute towards more natural interactions. Fifth, the robot learns its behaviour offline and uses its best behaviours while playing with human users (without further learning). It would be interesting to incorporate other forms of learning to train or retrain the robot as it collects data from real interactions Argall2009 ; Kobber2013

. Unsupervised, semi-supervised and active learning could be useful here to learn from unlabelled examples or labelled ones but in a more efficient way than standard supervised learning. Sixth, it would be interesting to integrate more complex versions of noughts and crosses and other games using deep learning

CuayahuitlKL15 , which represents a niche for investigating more efficient learning. The more complex the human-humanoid interactions the more features and actions, which will be reflected in slow or infeasible learning. Last but not least, evaluations with end users remain to be carried out, specially with unknown human players in public spaces.

5 Concluding Remarks

Robot systems that interact with their environment by perceiving, acting, communicating, and learning often face a challenge in how to bring these different concepts together. We describe a general approach for training a robot to interact with the world using multiple modalities. Rather than training the robot directly from raw pixels only, the proposed approach simplifies the overall learning task into two stages: learning to perceive and learning to interact. We tested our approach by training the Baxter humanoid robot to play the game of nought and crosses. Our experimental results using simulations report that learning to perceive achieved 99.9% of classification accuracy, and learning to interact achieved a win/draw rate of 98%. A pilot test reported reasonable interactions using the proposed approach in a multimodal integrated system. In addition, our system showed to be data-efficient due to the amount of data used to induce the simulated environment (108 seed images and 10 seed dialogues), which is relevant for trainable robot systems from example demonstrations with end users—as pointed out by Vollmer2016 . This is the first deep (reinforcement) learning system that learns to perceive, act and communicate. Although functional for demonstration purposes, it requires a number of improvements before it can be released in the wild. The previous section outlines exiting future directions in interactive intelligent humanoids.

References

  • [1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robot. Auton. Syst., 57(5):469–483, May 2009.
  • [2] J. A. Boyan, J. A. Boyan, J. A. Boyan, and J. A. Boyan. Modular neural networks for learning context-dependent game strategies. Master’s thesis, 1992.
  • [3] H. Cuayáhuitl. Robot learning from verbal interaction: A brief survey. In 4th International Symposium on New Frontiers in HRI, 2015.
  • [4] H. Cuayáhuitl. SimpleDS: A simple deep reinforcement learning dialogue system. CoRR, abs/1601.04574, 2016.
  • [5] H. Cuayáhuitl, S. Keizer, and O. Lemon. Strategic dialogue management via deep reinforcement learning. CoRR, abs/1511.08099, 2015.
  • [6] N. Dethlefs and H. Cuayáhuitl.

    Hierarchical reinforcement learning for situated natural language generation.

    Natural Language Engineering, 21(3):391–435, 2015.
  • [7] G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. Gool. Random forests for real time 3d face analysis.

    Int. J. Comput. Vision

    , 101(3):437–458, Feb. 2013.
  • [8] J. Kober, J. A. D. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. Intl Journal of Robotics Research, 2013.
  • [9] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 05 2015.
  • [10] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. JMLR, 17(1):1334–1373, Jan. 2016.
  • [11] N. Mavridis. A review of verbal and non-verbal human–robot interactive communication. Robot. Auton. Sys., 63, Part 1:22 – 35, 2015.
  • [12] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783, 2016.
  • [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop. 2013.
  • [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015.
  • [15] V. Nair and G. E. Hinton.

    Rectified linear units improve restricted Boltzmann machines.

    In ICML, 2010.
  • [16] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In International Conference on Machine Learning ICML, 2011.
  • [17] M. Quigley, K. Conley, B. P. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng. Ros: an open-source robot operating system. In ICRA Workshop on Open Source Software, 2009.
  • [18] S. Siegel. Training an artificial neural network to play tic-tac-toe. Technical report, University of Wisconsin-Madison, 2001.
  • [19] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15(1), 2014.
  • [20] M. V. D. Steeg, M. M. Drugan, and M. Wiering. Temporal difference learning for the game tic-tac-toe 3d: Applying structure to neural networks. In IEEE Symp. S. on Comp. Intelligence SSCI, 2015.
  • [21] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
  • [22] C. Szepesvári. Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, 2010.
  • [23] A.-L. Vollmer, B. Wrede, K. J. Rohlfing, and P.-Y. Oudeyer. Pragmatic frames for teaching and learning in human–robot interaction: Review and challenges. Frontiers in Neurorobotics, 10:10, 2016.
  • [24] S. Wermter, C. Weber, M. Elshaw, C. Panchev, H. R. Erwin, and F. Pulvermüller. Towards multimodal neural robot learning. Robotics and Autonomous Systems, 47(2-3), 2004.
  • [25] F. Zhang, J. Leitner, M. Milford, B. Upcroft, and P. I. Corke. Towards vision-based deep reinforcement learning for robotic motion control. CoRR, abs/1511.03791, 2015.