In many robot tasks, the robotic manipulator is required to contact and follow a surface stably. For example, when robots perform tasks like buffing, grinding, polishing and painting, it is necessary to maintain good contact between the manipulator end-effector and the surface of the object, and requires the end-effector to follow the object surface precisely, defined as a surface following  or surface tracking  problem. It is also an important feature when robots explore environments that are unknown or of inaccurate modeling. Similar problems include contour following  and surface exploration . The former focuses more on exploring the shape of objects, while the latter aims to understand the physical properties of unknown objects through exploration, such as surface roughness, object shape and compliance. In this paper, we focus on the surface following problem whereas the methodologies developed can be extended to tackle the other two problems.
To achieve surface following, a robot manipulator needs to maintain constant and uniform contact with the object surface. This requires the robot to have the ability to sense and recognize its contact state with the surface, and the ability to control the manipulator to make real-time adjustments according to the surface variations. Therefore, the surface following task involves robot sensing, learning and control, which makes it a complicated problem. The common strategy to solve the surface following problem is to learn the desired trajectory and then control the robot motion according to the learned desired trajectory [1, 3]. The obvious problem with this strategy is that the trajectory needs to be relearned for each object with new surface. In addition, camera vision has been used to sense the object surface and hence visual servoing can be applied to surface following . However, vision cannot provide detailed information of physical properties, such as surface roughness and compliance, which can greatly affect the contact between robotic manipulator and object surface. To overcome this information loss, the haptic feedback can be used to perceive these physical properties and achieve more consistent surface following .
The main procedure of the existing strategies for surface following using haptic feedback can be summarized in two steps: The handcrafted features are first extracted from the haptic data, such as surface normal, surface tangent and contact force; the learned features are then used for controlling the end-effector movements. Due to error accumulation in the two-step strategy, it has high demand on the accuracy of sensors, robot hardware and control algorithms in order to achieve good performance. To overcome the challenge, in this paper, we propose a novel end-to-end learning strategy for surface following, by directly mapping raw tactile data to the end-effector control policies using deep reinforcement learning. We aim to maintain the contact area between the robot end-effector and the object surface into a fixed range while following the surface which is adaptable to various object surfaces. A GelSight tactile sensor  is used to provide tactile feedback for facilitating the motion of the manipulator end-effector to follow different object surfaces. As shown in Fig. 1, the robot performs surface following using the learned policies with tactile feedback from the GelSight sensor equipped to a robot manipulator.
The rest of paper is organized as follows: Section II reviews the related works; Section III introduces the proposed end-to-end surface following algorithm; Section IV illustrates and details the experiment setup; experimental results and analysis are given in Section V; finally Section VI concludes the paper and points future directions.
Ii Related works
Ii-a Control Strategy for Surface Following
There are mainly two kinds of control strategy for surface following in the literature. One is based on the trajectory control along the modelled surface, the other is based on the motion control using feedback information while interacting with environment. For the trajectory control based strategy, the surface of the object is modelled first and then the trajectory of the end-point is planned according to it. The surface following performance of this strategy depends on the accuracy of the trajectory and the error between actual and modelled surfaces [1, 3]. As the object surface needs to be modeled in advance to predict the trajectory, this strategy cannot be used to follow an unknown object. For the motion control based strategy, it focuses on the control of robot motion to maintain the desired contact status using real-time sensor information that obtained by interacting with object surface. In , it uses sensors to obtain the information about surface normal and contact force, and then compute the surface tangent to guide the motion of the robot end-effector to perform surface following. The limitation of this strategy lies in the high demand on the accuracy of sensing, learning and control, as it depends strongly on the accuracy of end-effector positions and surface normal. In addition, it lacks adaptability to different surfaces using force index to reflect the desired contact status, e.g., the desired force needs to be adjusted according to the feature of the surfaces to avoid the stick-slip phenomenon in . In our work, we use the motion based control strategy. Different from existing works, we use contact area rather than contact force to indicate the contact status, and guide the robot motion by directly mapping the sensors information to robot actions without computing the surface tangent. Furthermore, to the authors’ best knowledge, this work is the first to directly map the tactile data to the robot motion for surface following, which can avoid accumulative error that exists in other works.
Ii-B Sensors used for Surface Following
. Visual cameras can be used to estimate the position of the target object, which is then used in combined vision/force control for interaction with a stiff uncalibrated environment in surface following tasks. In addition to the surface following task, visual cameras have also been used to measure the planar-contour in contour following task [10, 5]. As the camera is located away from the object surface, vision could be occluded by the robot manipulator that limits its application in practice. As a manner of contact sensing, force sensors have been used in the surface following since the 1990s [11, 12]. In recent years, a 6-axis force/torque sensor, covered with a deformable rubber skin, was integrated onto the fingertip of a robotic finger to perform surface following of a computer mouse . An improved intrinsic contact sensing algorithm is proposed, which can provide accurate estimation of the contact location with the deformable finger skin even at high friction forces. However, the control quality could be deteriorated due to the inaccurate estimation of surface normal under high friction and the error of finger position estimation caused by the accumulated sensing errors of the finger joint angles . In addition, the force sensor can only provide the information of a single contact point each time resulting in limited sensing ability. Compared to force sensors, tactile sensors can provide distributed multi-point tactile information of the contact [13, 14, 15], localise the contact  and predict the shape  and pose  of the object in hand . In , an optical fiber based tactile array sensor is developed and a simple contour tracing task was performed but no learning was involved. In , two dynamical matrix analog pressure sensors are equipped on a robot gripper to provide tactile feedback for a task of gently scraping a surface with a spatula. The deviations of actual tactile data and desired tactile trajectory is used to correct robot movements. This tactile feedback is added to the system through perceptual coupling and its parameters is optimized using reinforcement learning. In this work, the sensor has no direct contact with the surface, therefore, detailed contact information was not able to be acquired. In addition, due to the large size (16cm16cm) and low spatial resolution (20mm), it is not suitable to be equipped onto robot end-effector for surface following. In this paper, we use a high-resolution GelSight tactile sensor [20, 21] to provide tactile feedback. With the detailed contact information in the tactile images, the motion of the manipulator end-effector can be facilitated to follow different object surfaces.
As previous mentioned, we achieve end-to-end learning for surface following by direct mapping raw sensor data to robot actions draw from deep reinforcement learning by Google DeepMind 
. The mapping is based on a novel proposed policy that exists in form of an artifical agent, termed surface following deep Q-network (SFDQN), which combines reinforcement learning with a deep neural network. The deep Q-learning algorithm used to train the artificial agent in was applied to virtual game environment. However, it faces trouble to be applied to real robot learning. For example, it may cause damage to sensor and robot due to continuous unreasonable actions according to output of SFDQN in the early training stage. To overcome this problem, we divide the standard training procedure into 2 steps: 1) generate the training and testing datasets on real robot using a designed behavior policy, and 2) train the SFDQN offline on computer using the obtained datasets. A noval index, , is proposed to characterize the contact area and it is also used for a reward function when training the SFDQN. In this section, we describe the methods in detail, including the definition of and image processing, the elements of reinforcement learning, the designed behavior policy and generated datasets, and the model architecture and training algorithm for SFDQN.
Iii-a Contact Rate and Image Processing
We observed through experiments that the change region of the GelSight sensor image increase monotonously as the contact between the sensor and the object surface increases in the initial contact stage. It indicates that maintaining contact area in a certain range is a good way to achieve good surface following performance. The intuitive idea to characterize contact area is to count the non-zero-pixel ratio of the subtraction image between the contact and non-contact images, which we called .
In our experiments, the GelSight sensor image is not only used for calculation of , but also as partial input of SFDQN. To make it work more efficiently, we do image processing so as to: (1) remove extra information to speed up the data flow; (2) extract contact status information to calculate the reward for the deep Q-learning.
The GelSight sensor outputs colored images of a resolution of . Prior to feeding these images into the SFDQN, we first remove the color information and resize the image to lower resolution () (as shown in Fig. 1(a)). After that, we store the non-contact image (when the sensor is not in contact with the object surface) as the image background (as shown in Fig. 1(b)). By subtracting the background from the contact image, we obtain the subtraction image. As the image is sensitive, we use a threshold filter to remove noisy pixels. Then we can see the outline of the contact area (as shown in Fig. 1(c)). Finally, we use Eq. (1) to calculate the .
Iii-B State, Action and Reward
As a surface following problem in reinforcement learning (RL) model, the state should include information of the robot end effector, the target surface, their relative position and velocity. The states of the robot arm, i.e., the angular and velocity values of each joint, can be acquired. The information of the surface and the relative position can also be read from the GelSight sensor feedback. We combine the feedback image from sensor and the joints as the RL state as well as the input of SFDQN.
The RL actions should match up with the actual movements performed by the robot. We define actions based on the joints motion. Each joint can move forward, backward and remain still. For joints, the number of actions can be . We define these actions as the output of SFDQN. In this paper, we use 2 joints to simplify the action design. The 3rd and 4th joints are programmed to execute the actions as shown in Fig. 3 and 9 actions can be generated as shown in Fig. 4. Additionally, the angular shift of a single action (0.2 rad by default) can be adjusted by an auxiliary program.
The calculation of reward is based on the using Eq. (2). If the sensor-surface contact after the action’s execution is in the desired contact status ( in the range of ), the agent will receive a reward 10, otherwise there will be no reward.
Iii-C Behavior Policy and Generated Datasets
Generating dataset is a necessary and important task prior to training of SFDQN. Given a state , the standard algorithm in  select an action using
-greedy strategy, that is, with probabilityselect a random action , otherwise select the action with maximum value according to SFDQN output. In our problem, an ideal training dataset might be generated in a real surface following scenario. However, creating such dataset needs the GelSight sensor surface always rub on the object surface. In the early training stage, the actions selected according to SFDQN output is unreasonable and it may cause damage to the reflective membrane of the GelSight sensor or even the robot. Instead of pursuing the perfect dataset, an independent behavior policy is designed to generate as many contact statuses as possible, which is presented in Algorithm 1.
At each time step, the behavior policy generates an action to map the current state . The designed behavior policy contains 2 kinds of rules. One is complete random rule that randomly select an action from all the actions to enrich the diversity of dataset. The other is partial random rule that randomly select an action from a subset of actions which can drive the robot to the desired contact status. The latter can increase the ratio of positive reward in the dataset which can benefit the training of SFDQN. It also avoids cumulative actions that leeds to high , which can damage the GelSight sensor.
To implement the partial random rule, we classify the actions into 3 subsets according to their effects on the change of: ’increase’, ’decrease’ and ’unchange’ the . The classification can be easily done by tests, or even by analysis when it is not complicated as shown in Fig. 4. The rule is as follows: if the current is greater or equal to the median value of the desired contact status range, we randomly select an action from the subsets other than the ’increase’ subset; otherwise, we select an action from the subsets other than the ’decrease’ subset.
In our experiment, the behavior policy uses 2 rules alternatively. It generates actions for 5 states using partial random rule and for the following 5 states using complete random rule, and then back to partial random rule. This is repeated until we get enough data for training of SFDQN. During the process, we record the state , the generated action , the new state after is executed, and the reward calculated based on the GelSight sensor image regarding to using Eq. (1). The generated dataset is composed of units in form of .
Iii-D Model architecture and training algorithm for SFDQN
We use an artificial agent, SFDQN, for mapping the current information (state) to the robot motion (action) draw from the excellent ability of learning policies directly from hign-dimensional sensory inputs using end-to-end reinforcement learning. The goal of SFDQN is to find an optimal policy that offers the best action for given state, so as to maximize the discounted future reward. As the GelSight sensor that used to collect information from object surface generates sequences of high-resolution images as feedback, we design SFDQN using a deep convolutional neural network, which is especially good at extracting information from raw images, to approximate the optimal action-value (also known as) function
which is the maximum sum of rewards discounted by at each time step, achievable after taking an action under state according to a behavior policy . As function obeys the Bellman equation, Eq. (3) can be rewritten as
where denotes the reward after taking action , and denotes the state and action in the time step next to and
, respectively. For training of the Q-network, we define the loss function as follows:
where is a discount factor, denotes the weights of the Q-network at iteration , and denotes the weights of another Q-network used to compute the target at iteration .
Fig. 5 shows the Q-network that used to parameterize an approximate value function . Given a state as input, the Q-network output the value for each valid action. The preprocessed GelSight sensor image (as mentioned in Section III-A) is accepted as the main input to the Q-network, and followed by several convolutional layers to extract the feature information. The position and velocity values of robot joints are fed to the Q-network as auxiliary input, and followed by a fully connected layer (dense layer). Then the processed data flow from two inputs are merged using a concatenate layer. And then after 2 following dense layers, the Q-network will output value for each action.
To obtain the best approximation of optimal function, we use deep Q-learning algorithm to train the Q-network for SFDQN. We apply and to overcome the problems associated with neural network type function approximator. The former randomly samples units as training data at each time step, which can prevent correlated input to the Q-network. The later uses another Q-network , which only updates the weight every steps, to calculate the Q-learning target: . The reason of doing this is that if the Q-learning target is calculated by the main Q-network , every weight update will also change the label data distribution and this could make the learning unstable. The detailed algorithm is presented in Algorithm 2.
Iv Experiment Setup
The experiment was conducted on a KUKA youBot platform, with a GelSight tactile sensor as the end effector. In order to connect and control KUKA youBot and GelSight tactile sensor and to perform neural network computation, several ROS nodes are designed to deal with the data stream in learning and testing stage. OpenCV library is employed to assist GelSight sensor image process. The neural network computation is GPU accelerated.
Finally, when the learned model is ready to be assessed, we fix its weights and modify it to be an surface following action generator.
Iv-a Components of the experiment platform
YouBot is a mobile manipulator platform111KUKA youBot platform: http://www.youbot-store.com/, developed for the purpose of basic level robotics education, cognitive-manipulation research and industrial-oriented application development. The two main components of the youBot are the 5 DOF arm and the mobile platform. The youBot operation command can be assigned to 9 joints, 5 located on the arm, 4 on the mobile platform. As a result, the youBot arm and the mobile platform can each carry out an independent task simultaneously. The original end-effector - a gripper, is replaced with the GelSight sensor. The youBot has an internal computer but of relatively weak computation power, therefore, it is necessary to run the deep Q-learning program on a external computer.
The key innovation of GelSight sensor is the use of inward reflective membrane that can adapt to various textures. A group of LED unit is fixed inside the sensor as illuminator. The camera is located in the center bottom of the sensor. Unlike traditional tactile sensors, GelSight sensor generates stream of high resolution images from a target object. More details of the sensor can be found in [20, 21].
Iv-B ROS nodes
Robot Operating System (ROS)222ROS: http://www.ros.org/ has a collection of tools specialized for robot tasks. ROS manages a complex robot manipulation task by turn its subtasks into ROS nodes that effectively communicate between each other. In this experiment, nodes with following functions are created: (1) read in keyboard command (2) read in raw GelSight sensor data and process with (3) manage action command and transfer to youBot driver (4) record and read observation data units (5) create and train the deep neural network using Q-learning rule (6) read in GelSight sensor data, generate action under behavior policy (7) read in GelSight sensor data, generate action from SFDQN
Iv-C SFDQN setup
In our experiment, we set in the range of 20 to 40 as the desired contact status.
In the training process, we use two SFDQN model. One model includes 2 convolutional layers to process the image input. The first CONV layer has 8 filter of size 44, the second CONV layer has 16 33 filters. The other model is much deeper, and it uses 10 CONV layers with 5 44 layers followed by 5 33 layers. The fully connected layers use default settings.
The training of SFDQN is carried out using Algorithm 2. The number of units in dataset , denoted by , is set to 12,000. The number of training steps is set to 20,000. The number of units sampled at each step, denoted by , is set to 10. The synchronization interval of and , denoted by , is set to 500. The interval of recording the weights of SFDQN, denoted by , is set to 100.
V Experiments Results and Analysis
In this section, we will analyze the performance of proposed surface following approach, including the performance of the trained deep neural network model, the behavior policy used to generate various state-action pairs, and the image processing method.
V-a The performance of behavior policy
As mentioned in Section III, we use a behavior policy to create various observation units as training data.
The advantage is obvious, it is easy to design, theoretically safe, moreover, the two rules that form up the behavior policy both generate random actions, as a result, a large enough dataset should cover most possible state transition near the sensor-surface contact point.
However, this policy has certain defects. First, it is has to be manually adjusted to fit actions to different arm poses, for example, if the youBot arm is moved to the opposite side of the mobile platform, the effect of carrying out action 8 is also inverted. Second, this policy is static, so it does not evolve like a neural network.
In the actual test, dataset generated by this policy did not contains high-contact states(contact rate ¿ 300), which means the policy is safe enough to avoid risky situation.
Fig. 6 shows the distribution of actions generated according to behavior policy. We can see the behavior policy is able to create even action distribution, which helps to cover the state-action space more evenly so as to enhance the exploration effect in training of SFDQN.
V-B The Performance of SFDQN
A trained SFDQN is expected to extract contact information from the GelSight sensor input, then generate actions that lead to desired .
This ability is evaluated by checking if the action given by the SFDQN is in the list of actions, while action is determined by the of input state. For example, when the input state has a low , actions that increase are , when the is already in the desired region, actions that have trival effect of are considered .
We input all state in a test dataset to the SFDQN and check the proportion of actions it generated. The proportion, also refered to as ision ,indicate sthe abilcity of a certain SFDQN model. During the training, we also save the weight of the SFDQN periodically and evaluate all saved models using above method, this will give us the learning curve of a SFDQN with particular structure.
Fig. 7 shows the learning curve of two SFDQN models with 2 and 10 CONV layers, respectively. The horizontal axis shows the time step and the vertical axis shows the precision of action prediction. The initial learning rate is set to 0.0001. We can see that both models reached their performance peak after training for 5,000 time steps, and the deeper model tend to produce more stable action prediction compared to the shallow one.
V-C Real Surface Test
After fully trained, we tested the neural network model as a surface following action generator on a real wood surface (Fig. 0(b)). In the test, we move the youBot alongside the wood surface manually, the neural network model receives the changing contact rate and send the predicted action to the youBot arm. The whole system is able to perform the surface following action properly although the GelSight sensor left the test surface occasionally. We also tested the system by touching the sensor with finger while moving the finger up and down, it turns out that surface following action has a velocity upper bound, the arm cannot adapt to fast changing surface. The reason is, the current effect of an action is to update the joint angular value, the maximum speed is determined by number of predicted actions per second and it is possible to define an action to update the joint velocity, but velocity type command tend to have unstable delay time on youBot platform compare to angular position type command, which could make the training process more challenging.
As shown in fig. 2, the proposed image process works well with GelSight sensor, the outline of the target object is clearly displayed. In addition, the method we used to calculate the contact rate has shown stable performance and is very sensitive to slight and medium level contact especially in the central area of the sensor. The main drawback of this method is that the calculation of contact rate relies on an independent background image. As the reward is defined based on contact rate, this means the whole training process will be influenced if the background is not correct. Once the reflective membrane on the GelSight sensor is relocated, we have to retrain the neural network to make it functional again. Another problem exists when using GelSight sensor on complex surface shape, e.g., surface with high slope. The edge area of the sensor is less sensitive the center area, contact on the edge tend to be underestimated.
Vi Conclusion and Future Works
In this paper, we propose a novel surface following approach based on deep Q-learning algorithm using a GelSight sensor. We built up a experiment platform with a GelSight tactile sensor and a KUKA youBot. We ran a set of experiments to check the performance of the proposed solution. In conclusion, our proposed solution has reached more than 80% of the theoretical maximum performance. The future research can be conducted as following: (1) To analyze different RL elements definition, e.g., take a sequence of GelSight sensor image as RL state, map input state to continuous action space, change action command type from position update to velocity update. (2) To extend the current solution to other surface following problems, e.g., contour following and surface exploration.
Dr Jing Wang was supported by the China Scholarship Council for 1 year academic visit at the University of Liverpool.
-  S. Demey and J. De Schutter, “Enhancing surface following with invariant differential part models,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 668–673, 1994.
-  R. Araújo, U. Nunes, and A. T. de Almeida, “3d surface-tracking with a robot manipulator,” Journal of Intelligent and Robotic Systems, vol. 15, no. 4, pp. 401–417, 1996.
-  H. Koch, A. Konig, K. Kleinmann, A. Weigl-Seitz, and J. Suchy, “Predictive robotic contour following using laser-camera-triangulation,” in Proc. IEEE/ASME Int. Conf. Adv. Intell. Mecha. (AIM), pp. 422–427, 2011.
-  J. Back, J. Bimbo, Y. Noh, L. Seneviratne, K. Althoefer, and H. Liu, “Control a contact sensing finger for surface haptic exploration,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 2736–2741, 2014.
-  W.-C. Chang, “Cartesian-based planar contour following with automatic hybrid force and visual feedback,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), vol. 3, pp. 3062–3067, 2004.
M. K. Johnson and E. H. Adelson, “Retrographic sensing for the measurement of
surface texture and shape,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1070–1077, 2009.
-  H. Liu, K. C. Nguyen, V. Perdereau, J. Bimbo, J. Back, M. Godden, L. D. Seneviratne, and K. Althoefer, “Finger contact sensing and the application in dexterous hand manipulation,” Autonomous Robots, vol. 39, no. 1, pp. 25–41, 2015.
-  T. Olsson, R. Johansson, and A. Robertsson, “Flexible force-vision control for surface following using multiple cameras,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), vol. 1, pp. 789–803, 2004.
-  M. Ohka, J. Takata, H. Kobayashi, H. Suzuki, N. Morisawa, and H. B. Yussof, “Object exploration and manipulation using a robotic finger equipped with an optical three-axis tactile sensor,” Robotica, vol. 27, no. 5, pp. 763–770, 2009.
-  J. Baeten and J. De Schutter, “Hybrid vision/force control at corners in planar robotic-contour following,” IEEE/ASME Trans. Mechatronics, vol. 7, no. 2, pp. 143–151, 2002.
-  D. Bossert, U.-L. Ly, and J. Vagners, “Experimental evaluation of a hybrid position and force surface following algorithm for unknown surfaces,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), vol. 3, pp. 2252–2257, 1996.
-  U. Nunes, P. Faia, and A. T. de Almeida, “Sensor-based 3-d autonomous contour-following control,” in Proc. IEEE/RSJ/GI Int. Conf. Intell. Robots Syst. (IROS), pp. 172–179, 1994.
-  S. Luo, J. Bimbo, R. Dahiya, and H. Liu, “Robotic tactile perception of object properties: A review,” Mechatronics, vol. 48, pp. 54–67, 2017.
-  S. Luo, W. Mou, K. Althoefer, and H. Liu, “Iterative closest labeled point for tactile object shape recognition,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 3137–3142, IEEE, 2016.
-  S. Luo, W. Mou, K. Althoefer, and H. Liu, “iCLAP: shape recognition by combining proprioception and touch sensing,” Autonomous Robots, vol. 43, no. 4, pp. 993–1004, 2019.
-  S. Luo, W. Mou, K. Althoefer, and H. Liu, “Localizing the object contact through matching tactile features with visual map,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), pp. 3903–3908, 2015.
-  S. Luo, W. Mou, K. Althoefer, and H. Liu, “Novel Tactile-SIFT descriptor for object shape recognition,” IEEE Sensors J., vol. 15, no. 9, pp. 5001–5009, 2015.
J. Bimbo, S. Luo, K. Althoefer, and H. Liu, “In-Hand Object Pose Estimation Using Covariance-Based Tactile To Geometry Matching,”IEEE Robot. Auto. Lett. (RA-L), vol. 1, no. 1, pp. 570–577, 2016.
-  Y. Chebotar, O. Kroemer, and J. Peters, “Learning robot tactile sensing for object manipulation,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 3368–3375, 2014.
-  S. Luo, W. Yuan, E. Adelson, A. Cohn, and R. Fuentes, “Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2018.
-  J.-T. Lee, D. Bollegala, and S. Luo, “” touching to see” and” seeing to feel”: Robotic cross-modal sensorydata generation for visual-tactile perception,” Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2019.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.