Deep Reinforcement Learning for Tactile Robotics: Learning to Type on a Braille Keyboard

08/06/2020 ∙ by Alex Church, et al. ∙ Google University of Bristol 0

Artificial touch would seem well-suited for Reinforcement Learning (RL), since both paradigms rely on interaction with an environment. Here we propose a new environment and set of tasks to encourage development of tactile reinforcement learning: learning to type on a braille keyboard. Four tasks are proposed, progressing in difficulty from arrow to alphabet keys and from discrete to continuous actions. A simulated counterpart is also constructed by sampling tactile data from the physical environment. Using state-of-the-art deep RL algorithms, we show that all of these tasks can be successfully learnt in simulation, and 3 out of 4 tasks can be learned on the real robot. A lack of sample efficiency currently makes the continuous alphabet task impractical on the robot. To the best of our knowledge, this work presents the first demonstration of successfully training deep RL agents in the real world using observations that exclusively consist of tactile images. To aid future research utilising this environment, the code for this project has been released along with designs of the braille keycaps for 3D printing and a guide for recreating the experiments. A brief video summary is also available at https://youtu.be/eNylCA2uE_E.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Touch is the primary sense that humans use when interacting with their environment. Deep Reinforcement Learning (DRL) algorithms enable learning from interactions and support end-to-end learning from high-dimensional sensory input to low-level actions. However, most research has focused on vision, while a scarcity of data, sensors and problem domains relating to touch has relegated tactile research to a minor role. This trend continues even when DRL is applied to robots interacting physically with their environment, where most research uses proprioception or vision. Here we attempt to bridge the gap by positioning tactile DRL research in the context of a human-relevant task: learning to type on a braille keyboard.

The field of DRL has seen rapid progress, which in part is due to benchmark suites that allow new methods to be directly compared. These are mainly simulated environments, such as the Arcade Learning Environment [Bellemare2012TheAgents], continuous control environments [Duan2016BenchmarkingControl, Tassa2018DeepMindSuite] and simulated robot-focused environments [Plappert2018Multi-GoalResearch, Brockman2016OpenAIGym]. More recently, several robot benchmarking suites have been established [Ahn2019ROBEL:Robots, Kumar2019OffWorldResearch, Yang2019REPLAB:Learning, Mahmood2018SettingRobot, Mahmood2018BenchmarkingRobots] with the intention of moving the field towards physically-interactive environments.

[width=1.0]figures/arm_w_sensor.png (a)Robot arm with tactile sensor and braille keyboard(b) Example key (UP-arrow)(c)Tactile image(of UP key)

Figure 1: Robot arm with tactile fingertip (panel a) pressing a 3D-printed UP-arrow key (panel b) resulting in a tactile image (panel c). Notice how the UP-key shape is visible in the spacing of the pins in the tactile imprint.

In the area of tactile robotics, there are currently no benchmarks for evaluating and comparing new methods. In this paper, we propose a challenging tactile robotics environment which we intend will serve as a tool for experimentation and fine-tuning of DRL algorithms. Furthermore, the environment challenges the capabilities of the tactile sensor in requiring sufficient sensitivity and spatial resolution and could be used to draw comparisons between different tactile sensors. The environment consists of a keyboard with 3D-printed braille keycaps, combined with a robotic arm equipped with a tactile sensor as its end effector. The primary task in this environment involves learning to type a key or sequence of keys, and has several useful attributes: it is goal driven, uses sparse rewards, contains both continuous and discrete action settings, is challenging in terms of exploration and requires a tactile sensor with both a high spatial resolution and high sensitivity. All of these aspects help to make the task representative of other robotics and tactile robotics challenges. The environment has also been designed to need minimal human intervention and be reasonably fast, necessary characteristics for real-world applications of DRL algorithms. In addition, the components required for creating this environment are either widely available or 3D printed, which allows easier adoption in the tactile robotics community.

This paper makes the following contributions. We define 4 tasks in the braille keyboard environment that progress in difficulty from arrow to alphabet keys and from discrete to continuous actions. In addition to the physical environment, we construct a simulation based on exhaustively sampling tactile data, and use this for initial validation of DRL, including agents trained with Deep Q Learning [Mnih2013PlayingLearning], Twin Delayed Deep Deterministic Policy Gradients (TD3) [Fujimoto2018AddressingMethods] and Soft Actor Critic (SAC) [Haarnoja2018SoftApplications]. We then demonstrate that for the majority of these tasks, DRL can be used to successfully train agents exclusively from tactile observations. However, a lack of sample efficiency for the most challenging task makes it impractical to train learning agents in the physical environment. This work thus leaves open the question of whether new or untested methods can improve this sample efficiency to a point where physical training is feasible, or whether alternative tactile sensors can represent the required tactile information in a concise enough form to allow for better sample efficiency in training.

Figure 2: Braille alphabet used for the tactile keyboard environment. The space bar is a blank character, which will cause deformation of the tactile sensor. Grey squares indicate positions where no deformation of the sensor will occur. A simpler task can be defined using only the 4 arrow keys.

2 Related Work

A common approach for applying DRL to robotics is to train on accurate simulations of the robot, then transfer those learned policies to the real world, bridging what is known as the ‘reality gap’ [James2018Sim-to-RealNetworks, OpenAI2018LearningManipulation]. However, a serious issue for this approach is that the physical properties of artificial tactile sensors are highly challenging to simulate, as evidenced by a lack of use in the field compared to e.g. simulated visual environments. This issue is compounded as the task complexity increases, particularly with dynamic environments. This necessitates solutions that are capable of being trained directly on a real robot to make progress on DRL for tactile sensing applied to physical interaction.

DRL has been applied to object manipulation, particularly with dexterous robotic hands. However, the most successful examples of learning in this area do not utilise tactile data [Rajeswaran2018LearningDemonstrations, OpenAI2018LearningManipulation, OpenAI2019SolvingHand], but instead deploy simulations to accelerate learning. In these cases, observations are often made up of joint angles and object positions/orientations; in practise, obtaining this information from real-world scenarios has required complicated visual systems. In their work, OpenAI state that they specifically avoid using sensors built into the physical hand, such as tactile sensors, as they are subject to “state-dependent noise that would have been difficult to model in the simulator” [OpenAI2018LearningManipulation].

Most prior research applying RL to tactile sensing has used the SynTouch Biotac, a taxel-based tactile sensor the size of a human fingertip. Van Hoof et al. [vanHoof2016StableData] demonstrated that tactile data can be used as direct input to a policy network, trained through RL, for a manipulation task on a physical robot. In [Chebotar2016Self-supervisedLearning] dimensional reduction is used to simplify the tactile input, resulting in successful regrasping of an object utilising tactile data. In [Huang2019LearningLearning], tactile and proprioceptive data are combined to achieve the task of gentle object manipulation, where exploration is improved by using tactile data to form both a penalty signal and as a target for intrinsic motivation.

Recent work has combined tactile image observations from the Gelsight optical tactile sensor with proprioceptive information to use DRL to learn surface following [Lu2019SurfaceSensor]

. Optical tactile sensors provide tactile information in a form that is well-matched to modern deep learning techniques that have been honed on computer vision challenges. Given the recent successes of optical tactile sensors with deep learning

[Lepora2018FromSensor, Yuan2017Shape-independentSensor, Calandra2017TheOutcomes, Calandra2018MoreTouch, Hogan2018TactileTransformations, Lepora2020IEEERAM], the combination of deep reinforcement learning and optical tactile sensing appears a promising avenue for future research.

Figure 3: Dimensions of the braille keycaps in mm, designed for cherry MX mechanical switches.

3 Hardware

[width=1.0]figures/training_graphic_horz_2.png (a) Start of Training(b) Middle of Training(c) End of Training

Figure 4:

Visualisation of both the discrete action (top) and continuous action (bottom) tasks. The sensor is initialised in a random position with the task of pressing a randomly initialised goal key (green). Orange arrows indicate the actions taken at each step. Initially, the policy networks cause random actions that results in an incorrect key press (red). After some training, actions that lead to a correct key press are found but may be sub-optimal, often with cyclic movements. Eventually, policies should converge to a near-optimal path between the initial position and goal key. For the discrete case, actions are given by a DQN agent trained for 0, 30 and 100 epochs. The continuous case is an illustrative depiction of successful training.

3.1 Custom biomimetic tactile fingertip

The BRL tactile fingertip (TacTip) [Ward-Cherrier2018TheMorphologies] is a low-cost, robust, 3D-printed optical tactile sensor based on a human fingertip. As the current task involves interpreting braille keycaps, we require a tactile sensor with high spatial resolution that makes contact with an area that can lie inside a standard keycap. Conventionally, the TacTip has 127 pins on a 40 mm-diameter tactile dome [Ward-Cherrier2018TheMorphologies], which is too large for this task. The sensor tip was thus redesigned for this work to fit 331 pins of radius mm onto a mm-diameter dome of depth mm. The pins are arranged such that there is a sub-mm space between them. An example tactile image obtained from this modified sensor is shown in Figure 1c, where adaptive thresholding is used to process the raw tactile image into a binary image that makes the deformation more apparent and mitigates any changes of lighting inside the sensor.

Following recent work with this tactile sensor [Lepora2018FromSensor]

, the pre-processed image captured by the sensor is directly passed into the machine learning algorithms. This removes the need for pin detection and tracking algorithms that were necessary in previous work, and enables the advantages of convolutional neural networks to be applied to tactile data.

3.2 Braille Keyboard

For the tasks considered here, we chose a good-quality keyboard with a fairly stiff key switch: the DREVO Excalibur 84 Key Mechanical Keyboard with Cherry MX Black switches. These switches have a linear pressing profile and require N of force and mm of travel before actuation occurs. Furthermore, Cherry keys have good build quality which improves consistency across keys. The dimensions of our 3D-printed keycaps are shown in Figure 3 and the full set of keys used throughout this work are shown in Figure 2. The keyboard was chosen in order to allow the tactile robot to make exploratory contacts with a key before pressing it.

3.3 Robotic Arm

For these experiments we use a Universal Robotics UR5 robotic arm and its in-built inverse kinematics to allow for Cartesian control. This makes the environment relatively agnostic to the type of robotic arm used, with variation only arising subject to error accumulating from the lack of precision in the inverse kinematics controller. Other than the robotic arm, the environment consists of components that are either 3D printed or widely available.

4 Benchmark Tasks

Within this tactile keyboard environment, we propose 4 tasks of progressive difficulty that have the same underlying principles:

  • Discrete actions: Arrow keys.

  • Discrete actions: Alphabet keys.

  • Continuous actions: Arrow keys.

  • Continuous actions: Alphabet keys.

The distinction between arrow and alphabet keys is shown in Figure 2. Note that the space key is included in alphabet tasks and is represented by a blank character. Continuous actions constitute a positional change in the axis and a tap depth in the axis, where the sensor performs a downward movement and returns to a predefined height. This is performed sequentially to reduce the potential for damage occurring. An illustration of successful training of a RL agent is depicted in Figure 4 over the alphabet keys for both discrete and continuous actions.

The proposed task for this environment is to successfully press a goal key, which is randomly selected each episode from a subset of keys on a braille keyboard, and to do so without actuating any other keys. The start location is randomised for each episode. Hence, the agent must first determine its location, then navigate to the target position and finally press the target key. Positions of all the keys are learnt implicitly within the weights of a neural network. A positive reward of is given for successfully pressing a goal key; an incorrect key press results in episode termination and a reward of . Successful learning of this task will result in an average episodic return close to , dependent on the exploration parameters of the RL agent.

The arrow keys offer a natural way to define an easier task, since they are located away from the other keys, are fewer, and span a smaller area, which greatly reduces the size of the state space. An agent should therefore encounter more positive rewards during random exploration. Moreover, the tactile features on the arrow keys are more distinct, easing their interpretation from tactile data.

The alphabet (plus space) keys give a far more challenging task. As the state space is relatively large, learning over the full set of alphabet keys requires an approach such as Hindsight Experience Replay (HER) [Andrychowicz2017HindsightReplay] to improve data efficiency. Also, depending on the position of the sensor, some keys can be indistinguishable from each other, giving subtleties from perceptual aliasing. This feature also makes it harder to learn continuous action policies, since when using discrete actions the sensor can be arranged to be always located above the centre of a key.

Param Shared

Network

Input dim [100, 100, 1]
Conv filters [32, 64, 64]
Kernel widths [8, 4, 3]
Strides [4, 2, 1]
Pooling None
Dense layers [512,]
Output dim num actions
Activation ReLU
Initialiser variance scaling (scale=1.0)

Control

Epoch steps 250
Replay size
Update freq 1
n/ Updates 2
Batch size 32
Start steps 200
Max ep len 25
Optimizer Adam
DQN SAC TD3

RL

Discount () 0.95
0.6 (disc)
0.95 (cont)
0.95
Polyak () 0.005 0.995 0.995
LR ()
Alpha LR () n/a n/a

Explore

Initial 1.0 n/a 1.0
Final 0.1 n/a 0.05
Decay steps 2000 n/a n/a
Target entropy n/a
0.139 (disc)
-6 (cont)
n/a
Table 1:

DRL and network hyperparameters.

5 Initial Assessment Using Supervised Deep Learning

For a first step, we check that the braille keys can be interpreted by the tactile sensor without exerting enough force to activate the button. To test this, we used supervised learning to classify all keys on the braille keyboard (shown in Figure

2). Overall, we used 100 samples per key for training and 50 samples per key for validation, resulting in 3100 training images and 1550 validation images.

To make the task more representative of how humans perform key presses, we introduced small random perturbations in the action. Each tactile image was collected at the bottom of a tapping motion, where the sensor was positioned mm above the centre of each key, then moved downwards mm, finishing above the mm activation point for a key press. A random variation sampled from the interval  mm was used to represent uncertainty in the key height, which ranges from ‘barely touching’ to ‘nearly activating’ the button. A similar random horizontal -perturbation was sampled from ranges mm and a random orientation perturbation was sampled from () to add further variation in the collected data.

Tactile images were cropped and resized down to pixels (from

pixels), then adaptive thresholding was used to binarize the image; in addition to reducing image storage, this also helps accentuate tactile features and mitigate any issues from changes in internal lighting.

The network used in this task follows the architecture used for learning to play Atari games, originally proposed in [Mnih2015Human-levelLearning]. The same network architecture is used for all reinforcement learning tasks (details given in Table 1

). For supervised learning, we perform data augmentation including random shifting and zooming of the images. We also use early stopping, a decaying learning rate, batch normalization on the convolutional layers after the activation and dropout on the fully connected layers.

A near perfect overall accuracy of was achieved, demonstrating that the braille can be interpreted by the tactile sensor without activating the button, even when there is significant uncertainty about how the key is pressed.

6 Simulated Environment

Even though our focus in this work is DRL on a physical tactile robot, it is useful to have a simulated task that is similar to the physical environment yet runs much faster. While the policies trained in simulation may not be directly transferable to the physical environment, the parameters found should guide successful training of policies on the physical robot.

In the simulation of the discrete task, actions lead to locations on a virtual keyboard with a known label. The label is then used to retrieve a tactile image of the same label from the data collected for the initial assessment of supervised learning (Section VI). Thus the simulation accurately matches the physical environment, where the label of each key may not be known but a tactile image observation can be collected that will resemble the one obtained from the simulated environment. In practice, this simulated discrete environment is more complicated than the physically interactive environment, because of the perturbations introduced in the data collection that we choose not to introduce during reinforcement learning on the physical robot.

The simulation of the continuous task is more difficult to represent, because we do not have labels for every location on the keyboard. To approximate the physical environment, we collect tactile observations over a dense grid of locations spanning the keyboard, with those location stored (using mm intervals over the -dimensions and mm intervals over ). Along with the tactile image observations, we also record whether a key has been activated or not. During simulation, the position of the virtual sensor is known, which allows for a tactile image to be sampled from the collected data with the minimum Euclidean distance from the virtual sensor position.

In both cases, the simulated environment ignores effects such as error that accumulates over long sequences of robot arm moves, sensor drift, changing lighting conditions, movement of the keyboard and any other information that is not represented during the data collection stage. However, whilst these effects can occur on a real robot, the simulated tasks still offer useful information for hyper-parameter tuning and initial validation of RL algorithms.

Figure 5: Training curves showing the average episodic return for all tasks, discrete task results are shown on the top row, continuous task results are shown on the bottom row. Each epoch corresponds to 250 steps in the environment. For the simulated tasks results are averaged over 5 seeds, the physical tasks show results for 1 seed. Coloured dashed lines represent the first point at which episodic return reaches 95% of the maximum reached throughout training.

7 Tactile Reinforcement Learning

7.1 Reinforcement Learning Formulation

To define the problem, we represent the proposed tasks in a standard Markov Decision Process (MDP) formulation for reinforcement learning. An agent interacts with an environment at discrete time steps

, consisting of a state , action , resulting reward , resulting state and terminal signal . Actions are sampled from policy distribution represented by a neural network.

States are represented as a combination of a tactile image observation from the sensor, goal and previous action . Goals are selected randomly per episode and represented as a one-hot array, where the ‘hot’ value represents the class label of a target key. The previous action is required to avoid an agent becoming stuck in cyclic patterns when no tactile deformation is present on the sensor. Both are concatenated into the policy networks after any convolutional layers. Reward is sparse and binary, with when a correct button is actuated, otherwise . Activation of any button, correct or incorrect, results in a terminal signal that resets the environment with the sensor moving to a new random location on the keyboard.

In the discrete tasks, actions are selected from the set . For each movement action the tactile sensor is re-positioned above the centre of a neighbouring key where a tap action (which does not activate a key) is performed and a tactile image for the next state is gathered. The PRESS action moves the sensor in the axis by mm to activate a key.

In the continuous tasks, each action is selected from where corresponds to a tapping depth. For practical reasons, actions are bound to a finite range dependent on task. For the arrow task, the enforced and bounds are between mm and for the alphabet task between mm. In both the alphabet and arrow tasks, is bound to the range mm to ensure safe key actuation.

7.2 Reinforcement Learning Algorithms

There are several popular DRL algorithms in common use [Mnih2013PlayingLearning, Schulman2015TrustOptimization, Schulman2017ProximalAlgorithms, Lillicrap2015ContinuousLearning, Fujimoto2018AddressingMethods, Haarnoja2018SoftActor, Abdolmaleki2018MaximumOptimisation]. However, there are two major problems holding back the application of DRL to physical robotics: poor sample efficiency and brittleness with respect to hyper-parameters. Generally, on-policy methods such as Trust Region Policy Optimisation (TRPO) and Proximal Policy Optimisation (PPO) sacrifice sample efficiency to gain stability and robustness to hyper-parameters, making them difficult to apply to physical robots. Since its introduction, Deep Q-Learning (DQN) has had various improvements to address these issues, some of which are amalgamated in RAINBOW [Hessel2017Rainbow:Learning]. Off-policy entropy-regularised reinforcement learning has also separately addressed these issues with Soft Actor Critic (SAC) [Haarnoja2018SoftActor] and Maximum a Posterior Policy Optimisation (MPO) [Abdolmaleki2018MaximumOptimisation]. Both of these offer good sample efficiency, robustness to hyper-parameter selection and work with either continuous or discrete action spaces. SAC has led to the most follow-up research and a modified version has demonstrated successful training when applied to physical robots [Haarnoja2018SoftApplications]. Twin Delayed Deep Deterministic Policy Gradients (TD3) [Fujimoto2018AddressingMethods] was developed concurrently to SAC and offers similar performance.

For discrete-action tasks, we use DQN with the double [Hasselt2016DeepQ-Learning] and dueling [Wang2015DuelingLearning] improvements, and SAC adapted for discrete actions. For continuous-action tasks, we use TD3 and SAC; since SAC is applicable to both types of action space, some hyper-parameters are transferable over all tasks.

Another barrier to the application of DRL to robotics is when environments require challenging exploration combined with sparse rewards, as convergence is then very slow. A common technique commonly used to alleviate this is to use dense and shaped rewards [Popov2017Data-efficientManipulation, Mahmood2018BenchmarkingRobots, Yang2019REPLAB:Learning, Haarnoja2018SoftApplications]. However, this can require domain-specific knowledge and bias learning algorithms towards sub-optimal policies. Hindsight Experience Replay (HER) helps address this problem by replaying episodes stored in a buffer while varying the goal from what was initially intended. Here we find HER significantly improves performance and sample efficiency for all considered tasks.

Several other adjustments were also made to optimise performance. For all algorithms except TD3, the ratio of optimisation steps per environment step was increased to to improve efficiency because of the greater cost of environment versus optimisation steps; however, this hindered the stability of TD3. Additionally for both DQN and TD3 we linearly decayed the exploration parameter for an initial number of exploration steps. For SAC and TD3, polyak averaging was used for target networks, with a coefficient of , although we did find that SAC could also learn well with hard target updates.

8 Evaluation Metrics

To quantify the performance when training an agent on this benchmark, we introduce several evaluation metrics particular to this task. These can be used to compare results of this benchmark when using alternative algorithms or alternative sensors. The evaluation metrics introduced are:

  • Convergence Epoch: Measures the first epoch in which 95% of the maximum performance across the full training run is achieved and serves as an indication of sample efficiency during training. This is subject to some noise and works best when results are averaged over multiple seeds or when test results are averaged over a relatively high number of episodes ().

  • Typing Accuracy: Measures the accuracy when typing example sentences or sequences of keys. As a reward of 1 is given for correct key presses this metric is directly correlated with the average return.

  • Typing Efficiency: Measures the number of steps taken to press a key and gives an indication as to whether the policies learnt are near optimal.

When measuring typing accuracy and efficiency for the arrow task all 24 permutations of {UP, DOWN, LEFT, RIGHT} are evaluated. For the alphabet task, we provide 10 randomly generated permutations of the alphabet. This tests that a mapping between all keys is stored implicitly within the weights of the policy neural networks. During evaluation, the sensor is initialised in a random starting position for each sequence and the sensor position is not randomly reset after episode termination when a single key is pressed unlike during training. Furthermore, during evaluation we used deterministic actions to avoid mispressed keys from unnecessary exploration. This procedure gives a more consistent value in comparison with using average episodic return, which allows better comparisons to be drawn.

9 Results

9.1 Results for Discrete Tasks

For the discrete arrow key and alphabet key tasks, we find that both DQN and Discrete SAC rapidly converge to accurate performance in simulation and on the physical robot. Testing over multiple seeds in simulation shows that this learning is stable and consistently converges to an average episodic reward of near 1. The training curves for the discrete tasks for both the arrow and alphabet key environments (Figure 5) show that the task can be learned in all cases.

For the arrow task, asymptotic performance is achieved within 15 epochs, which is approximately equal to 1 hour of training time on the physical robot. For the alphabet task, convergence is longer, with asymptotic performance achieved within 60 epochs, or approximately 4 hours of training time on the physical robot. Similar sample efficiency and final performance is found for both discrete SAC and DQN across all trained agents. Whilst final performance appears slightly higher for discrete SAC, this can be explained by the target entropy causing lower levels of exploration when the agent nears convergence. Decaying the exploration parameter () to a lower value during training for the DQN agents, or evaluating with deterministic actions, results in similar final performance to discrete SAC.

9.2 Results for Continuous Tasks

The continuous control tasks are far more challenging due to their larger state spaces. For the arrow task in simulation, we find SAC is still able to achieve an asymptotic performance of near within 50 epochs. TD3 is also able to achieve performance of near within 100 epochs; however, training is less stable and is not robust to hyper-parameter changes. Due to this instability we do not evaluate TD3 in the physical environment. The training curves are shown in Figure 5 (bottom panels) for all continuous tasks that we were able to run to completion.

For the continuous arrow task, continuous control is not as well represented in simulation, this is shown by the left and middle panels in Figure 5 (bottom) not being accurate matches. We find convergence to a slightly lower average episodic return of after about 76 epochs compared to 25 epochs in simulation.

The alphabet task again offers an increase in task complexity. For a single-seeded run in simulation, we achieved a final average episodic return of for the continuous SAC algorithm and for TD3. This final performance takes significantly longer to achieve than all other tasks, with convergence occurring around 872 epochs for SAC and 1246 epochs for TD3. This will correspond to approximately 60 hours of physical robotic training time and is currently not feasible given laboratory operating constraints.

9.3 Performance on Evaluation Metrics

Task Algorithm Steps Accuracy
Convergence
Epoch

Simulation

Disc Arrow DQN 230 1.0 8
SAC_DISC 234 1.0 8
Disc Alpha DQN 1722 0.981 45
SAC_DISC 1649 0.940 39
Cont Arrow TD3 140 0.906 60
SAC 133 1.0 25
Cont Alpha TD3 1169 0.811 1246
SAC 1193 0.933 872

Physical

Disc Arrow DQN 246 1.0 8
SAC_DISC 246 1.0 6
Disc Alpha DQN 1722 0.992 24
SAC_DISC 1649 0.985 32
Cont Arrows SAC 364 0.938 76
Table 2: Results for trained agent evaluation.

An overview of the results across all algorithms used within this work are given in Table 2. These can be used as a comparative point for future work using this benchmark. For discrete tasks, we find that high typing accuracy is possible across both simulated and physical environments. In evaluation, both SAC_DISC and DQN have comparable results for typing accuracy and typing efficiency. In simulation, continuous tasks display a large reduction in the number of steps taken to activate a goal key due to the more efficient path across the keyboard that the sensor can take. However, this comes at the cost of a reduction in accuracy. These more-efficient policies are not observed in the physical task, likely due to more variety in tactile observations causing less certainty in the selected actions. In the simulated continuous alphabet task, TD3 gave high performance. However, these results were difficult to achieve consistently, making TD3 impractical for application to the physical robot.

10 Discussion

This work proposes a benchmarking tool aimed at developing the area of reinforcement learning for tactile robotics. Four tasks of varying difficulty are proposed, together with a representative simulated environment to allow for easier exploration of algorithmic adjustments. Evaluation metrics are also provided to allow for a quantitative comparison when using this benchmark with alternative algorithms or alternative sensors.

We evaluate the performance of several learning-based agents on the proposed benchmark tasks, both in simulation and on the physical robot. We demonstrate that successful learning is possible across all tasks within simulation and across 3 of 4 tasks on the physical robot. Currently, training the physical agent with continuous actions for the full alphabet task has not been achieved within the time allowed by operating constraints in our laboratory. Some example techniques that have not yet been implemented include using prioritised experience replay [schaul2015prioritized] to improve efficiency, and scheduling the ratio of optimisation steps per environment step throughout learning.

During development, we found some techniques were crucial to achieving successful learning on a physical robot. For example, HER gave sample efficiency improvements up to a factor of 10, along with boosting final performance. This efficiency was particularly evident on the alphabet tasks where rewards are less frequently encountered under random exploration. Training a DQN agent on the simulated discrete alphabet task with HER achieved final performance of within 50 epochs; comparatively, this same task without HER achieved a similar level of performance after epochs. Thus, HER was required for feasible learning of the tasks on a physical system. When using RL to solve physical robotics problems, designing the tasks to be goal orientated can allow for more general behaviour to be learnt whilst taking advantage of the benefits HER gives.

We also found that learnt optimal policies were sensitive to factors other than the algorithm hyper-parameters. For example, when using large action ranges on the continuous arrow tasks, the policies tended towards large movements in the direction of a goal key with low dependence on the current tactile observation. This behaviour arises because a relatively high average return can be found with this method alone, which causes agents to become stuck in a local optimum. Reducing the action ranges to lower values minimised this effect because the sensor was less likely to jump directly to the correct key. Thus, the average return from following this sub-optimal policy was reduced, which ultimately resulted in better policies.

Another useful technique was to create a simulation that is partially representative of the final task. Even with recent advancements, DRL is notoriously sensitive and brittle with respect to hyper-parameters, and so a fast, simulated environment helped find parameter regions that allowed for successful and stable training. For example, TD3 was so sensitive that a bad starting seed could cause minimal learning of the task. If attempting to find stable hyper-parameter regions on the physical task, multiple-seeded runs took hours or days of lab operation time. Therefore, where possible, simulating a simplified version of the problem provides valuable information for the physical task.

The braille task is designed to be representative of a multitude of tactile robotics tasks in which it may not be practically feasible to create a simulated environment via exhaustive sampling. Therefore, a main aim of this study was to achieve training from scratch in the physical environment. That said, in some circumstances, it would be interesting to explore the benefits of transfer learning from simulation to reality. For a preliminary investigation in simulation, we attempted to capture an important aspect of switching from a simulation to the physical robot, by artificially increasing the step size of the data in the continuous alphabet task from

mm to mm (mimicking that the mm intervals are a discrete approximation of the continuous physical environment). We found that learning on the higher density task could be accelerated by transferring policies trained on the lower density task. Whilst this is not entirely aligned to the problem of transfer learning between simulation and reality, these preliminary results demonstrate the potential for large sample-efficiency improvements. However, there are multiple methods of transferring trained policies from simulation to reality, opening up an avenue of future work that uses this platform to examine these sim-to-real approaches.

Previous research on DRL and tactile data has either used taxel-based sensors [vanHoof2016StableData, Chebotar2016Self-supervisedLearning, wu2019mat, Huang2019LearningLearning], or has combined optical tactile images with proprioceptive information [Lu2019SurfaceSensor]. To the best of our knowledge, this work presents the first demonstration of successfully training DRL agents in the real world, with observations that exclusively comprise high-dimensional tactile images. Though this work has only presented an evaluation of the TacTip sensor, the proposed experiments could offer valuable comparisons between alternative tactile sensors, particularly in the context of applications using recent DRL techniques. It is possible that alternative tactile sensors could represent the tactile information in a more concise form that allows for more sample-efficient learning, which is the most limiting factor found during this work and an interesting topic for future investigation. To aid with future work, the code used for this project has been released along with designs of the braille keycaps for 3D printing and a guide for recreating experiments and evaluating trained agents (available at https://github.com/ac-93/braille_RL).

References