Hierarchical Policy Design for Sample-Efficient Learning of Robot Table Tennis Through Self-Play

11/30/2018 ∙ by Reza Mahjourian, et al. ∙ 0

Training robots with physical bodies requires developing new methods and action representations that allow the learning agents to explore the space of policies efficiently. This work studies sample-efficient learning of complex policies in the context of robot table tennis. It incorporates learning into a hierarchical control framework using a model-free strategy layer (which requires complex reasoning about opponents that is difficult to do in a model-based way), model-based prediction of external objects (which are difficult to control directly with analytic control methods, but governed by learnable and relatively simple laws of physics), and analytic controllers for the robot itself. Human demonstrations are used to train dynamics models, which together with the analytic controller allow any robot that is physically capable to play table tennis without training episodes. Self-play is used to train cooperative and adversarial strategies on top of model-based striking skills trained from human demonstrations. After only about 24000 strikes in self-play the agent learns to best exploit the human dynamics models for longer cooperative games. Further experiments demonstrate that model-free variants of the policy can discover new strikes not demonstrated by humans and achieve higher performance at the expense of lower sample-efficiency. Experiments are carried out in a virtual reality environment using sensory observations that are obtainable in the real world. The high sample-efficiency demonstrated in the evaluations show that the proposed method is suitable for learning directly on physical robots without transfer of models or policies from simulation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

From ancient mythologies depicting artificial people to the modern science fiction writings of Karel Čapek and Isaac Asimov, there seems to be a clear image of what robots ought to be able to do. They are expected to operate in the world like human beings, to understand the world as humans do, and to be able to act in it with comparable dexterity and agility.

Just as today most households can have personal computers in the form of desktops, tablets, and phones, one can imagine a future where households can use the assistance of humanoid robots. Rather than being pre-programmed to do specific jobs like communicating with people, helping with kitchen work, or taking care of pets, these robots would be able to learn new skills by observing and interacting with humans. They can collectively share what they learn in different environments and use each other’s knowledge to best approach a new task. They already know their bodies well and are aware of their physical abilities. They are also aware of how the world and the common objects in it work. They just need to learn how to adapt to a new environment and a new task. If they need to learn a new skill by trying it out, they can do so efficiently. They can learn a lot from a few attempts and use reasoning and generalization to infer the best approach to complete the task without having to try it for thousands of times.

This work takes a step in that direction by building a robotic table tennis agent that learns the dynamics of the game by observing human players, and learns to improve over the strategy demonstrated by humans using very few training episodes where the agent plays against itself in a self-play setup.

1.1 Motivation

The rate of progress in creation of intelligent robots seems to have been slower than other areas of artificial intelligence, like machine learning. That is because intelligent robotics requires not only human-like cognition, but also human-like movement and manipulation in the world. As of now, the most successful applications of robotics remain in the industrial domains, where the focus is on precision and repeatability. In those environment, the expected robot motion is known beforehand and there is no need to deviate from it. However, the general usability of robots depends on their ability to execute complex actions that require making multiple decisions through time.

Deep learning and reinforcement learning have been successful in solving interesting problems like object detection, playing Atari games, and playing board games like chess and Go. These advances have made it possible to approach human-level perception and cognition abilities. While perception problems can be learned in data centers using millions of data samples and training episodes, learning general robotic skills requires interacting with physical robot bodies and environments, which cannot be parallelized. Therefore, learning robotic agents need to be very efficient in how they use training samples.

This work explores sample-efficient learning of complex robotic skills in the context of table tennis. Playing robot table tennis games is a challenging task, as it requires understanding the physics of the robot and the game objects, planning to make contact with the ball, and reasoning about the opponent’s behavior.

1.2 Learning for Robot Table Tennis

There have been many examples where application of deep learning to a problem has resulted in developing a superior approach with improved performance. For example, object classification and object detection tasks used to rely mainly on engineered SIFT features [11]. However, AlexNet [9]

demonstrated end-to-end learning of object classification on ImageNet 

[7]

using convolutional neural networks. Fig. 

1 visualizes the convolutional filters learned in AlexNet in the first layer of the neural network. These filters can be regarded as the learned equivalents to SIFT image features. In this domain, using the neural network to solve the task end-to-end allowed it to discover a suitable representation for image features that outperformed engineered features.

Figure 1: Convolutional Filters Learned in AlexNet [9] for Image Classification. The image shows 96 convolutional kernels of size 11x11x3 learned by the first convolutional layer in AlexNet. Deep learning is able to discover suitable features for the task of image classification. These learned features perform better than the engineered SIFT features.

As another example, for the tasks of speech recognition and language translation, end-to-end learning has replaced the pipelines based on human-designed acoustic models and language models with neural networks that outperform the old approaches. In the classic pipelines, the vocabularies shared between the nodes were engineered and fixed. The components in the pipeline were restricted in choosing their outputs from the hand-designed vocabularies. In the end-to-end approach, the network is free to learn and use an internal embedding for the speech data and the language data. This added freedom allowed deep learning to discover intermediate representations and features that are more suitable for solving the task.

Similarly, Mnih et al. [14] applied deep reinforcement learning to playing Atari games and demonstrated the ability of deep learning to discover a value network that can map raw pixels in the game to an expectation of future rewards.

These successes suggest that there is similar opportunity for applying deep learning to discover novel intelligent robotic behaviors. In the domain of table tennis, there is the potential for learning to discover:

  1. Better Strikes: Can the robot swing the paddle in new ways beyond what humans have tried in table tennis? In sports, one can observe leaps where a player tries a new technique and then very quickly it is adopted by other players. For example, in the early nineties handball players started using spinshots that would hit the floor past the goalkeeper and turn to go inside the goal. Can reinforcement learning discover new striking motions for hitting the table tennis ball?

  2. Better Game Strategy: There are established human strategies for playing adversarial table tennis games. Can reinforcement learning discover new overall game play strategies that are more effective in defeating a human opponent?

1.3 Challenges

General learning algorithms typically require millions of samples or training episodes to learn a task. Collecting samples for learning robotic tasks is costly, since each sample can cause wear and tear on the robot. The process is also time-consuming since interactions in the real world need to happen in real time and cannot be sped up by faster compute. In addition, robotic environments are often fragile and one cannot depend on agents learning automatically in unsupervised environments. Often, things break or objects get displaced requiring operator intervention to restore the setup to continue the training.

There is also an outer loop around the learning algorithms. Applying reinforcement learning is typically a trial-and-error process. The researchers usually develop new methods in an iterative manner by trying different approaches and hyperparameters. For every new instance of the problem, the learning algorithm is typically run from scratch.

The end-to-end learning approach based on producing and consuming more and more data is not suitable for robotics. It is possible to bring some scale to learning robotic tasks using parallel setups like arm farms. However, this amount of parallelism is not enough to overcome the physical limitations that come with learning in the real world.

Learning end-to-end policies poses another challenge, which is identifying the source of bugs or inefficiencies in one component of the implementation. In an end-to-end setup, the impact of a new change can only be observed by how it affects the overall performance of the system. Often learning algorithms are able to mask bugs by continuing to operate at a slightly reduced capacity or precision, thereby making it difficult to trace the root source of a problem after a few stages of development.

Some applications of deep learning to robotics can avoid some of the physical limitations by focusing on the perception part of the problem and ignoring learning motor skills. For example, object grasping can be approached as a regression problem, where the agent maps the input image to a grasp position and angle, which is then executed using a canned motion. However, when learning robotic skills it is very desirable for the learning algorithms to also discover novel motor skills. Learning algorithms may be able to discover new ways of handling objects that are more suitable for robots, and more effective with fewer degrees of freedom typically present in robot bodies.

A common approach to learning robotic tasks is sim2real: learning in simulation and then transferring the policy to work in the real world. With this method, learning can be done in the simulator. However, this approach requires solving a secondary problem, which is making the learned policy work with the real sensory observations and the control dynamics of the physical robot. Depending on the task, this transfer might not be any easier than learning the main problem.

Achieving breakthroughs in robotic learning most likely depends on discovering new learning approaches and new intermediate state and action representations that allow learning algorithms to spend a limited experimentation budget more strategically. Such intermediate state and action representations should be general enough to sufficiently capture the state of all policies. Yet, at the same time, they should be high-level enough to allow the learning agent to efficiently explore the space of all policies without having to try every combination.

1.4 Approach

The approach presented in this work incorporates learning into a hierarchical control framework for a robot playing table tennis by using a model-free strategy layer (which requires complex reasoning about opponents that is difficult to do in a model-based way), model-based prediction of external objects (which are difficult to control directly with analytic control methods, but governed by learnable and relatively simple laws of physics), and analytic controllers for the robot itself.

The key guiding principle in this work is to develop a learning approach that can discover general robotic behaviors for table tennis, yet is sample-efficient enough that it can be deployed in the real world

without relying on transfer learning from simulators.

The following subsections provide an overview of the main components of the approach.

1.4.1 Virtual Reality Learning Environment

The method is developed in a Virtual Reality (VR) environment which allows for capturing the same sensory observations that would be available in a real-world table tennis environment.

Although the method proposed in this work can be combined with sim2real approaches by using the models and policies learned in the VR environment as a starting point for training real-world models and policies, the emphasis in this work is on developing learning methods that would be sample-efficient in the real world. To this end, the observation and action spaces in the simulator and virtual reality environments are chosen such that they have parallels in the real world. More specifically, the state space of the learning agents is limited to the low-dimensional ball motion state and paddle motion state that can be obtained in the real world with similar sensors to what is used in the VR environment. Similarly, the action space of the learning agents is target paddle motion states, which can be tracked and controlled precisely on physical robots.

The ball motion state includes the position and velocity of the ball. In the VR environment, the ball motion state is available from the underlying simulator. In the real world, a ball tracker [22, 4]

can provide the position and velocity of the ball. Ball trackers usually track the ball velocity as well, since estimates on the current velocity of the ball can speed up the detection algorithm by limiting the search to a small region in the image and improve its accuracy by ruling out false positives. Detecting and tracking the location of a ball in a camera image is a relatively simple computer vision task. Ball tracking can be done with classic computer vision algorithms and does not require learning. The ball has a fixed geometry and a simple appearance in the camera images. A blob detection algorithm can identify the ball in the image. Given detections from two or more cameras and the camera intrinsics, the 3D location of the ball can be estimated. An advantage of using classic vision algorithms over using deep neural networks is the higher computational speed, which is critical in a high-speed game like table tennis.

The paddle motion state includes the paddle’s position, orientation, linear velocity, and angular velocity. When the paddle is attached to the robot, the paddle motion state can be obtained using forward kinematics. When learning from human games, the paddle motion state needs to be obtained from a motion tracking system. There are a variety of solutions that allow for tracking the paddle motion state with high accuracy. On the higher end, it is possible to use full-blown motion tracking systems to track marked and instrumented paddles. On the lower end, one can use off-the-shelf tracking devices like HTC Vive, which can provide position information with sub-millimeter accuracy and jitter. Fig. 2 shows two types of VR trackers that work with HTC Vive. In fact, this is the same hardware that is used for evaluations in this work when collecting human demonstrations in the VR environment. Since such trackers are bulky, the human players would be able to use only one side of the instrumented paddles. Lastly, a more custom tracking setup can use small IMU sensors attached to the paddles. Visual markers on the paddles can be used to correct for the sensory drift that is common with IMUs.

(a) A Vive tracker.
(b) A Vive tracker attached to a paddle.
Figure 2: Virtual Reality Trackers. The trackers allow the position and orientation of objects to be tracked with sub-millimeter accuracy. In the VR environment, these trackers make it possible to capture the paddle motions generated by human players. The same trackers, or any other motion tracking technology, can be used to track the motion of table tennis paddles in the real world. Photo credits: HTC.

As explained later in this section, the paddle motion state also represents the action space for the learning agents.

1.4.2 Using Low-Dimensional State

Rather than using raw pixels, the method proposed in this work grounds the models and policies in low-dimensional state, thereby reducing the dimensionality of the learning problems and improving sample efficiency. On the other hand, employing a separate component for extracting the low-dimensional ball motion state from visual inputs makes it possible to debug and fine-tune that component before integrating it into the implementation for the learning agent.

In contrast, using raw visual input as state requires carefully designing the learning environment to capture different lighting conditions, backgrounds, etc. Moreover, it would be more difficult to evaluate the ability of an end-to-end neural network in sufficiently deciphering the visual input and extracting the relevant state information.

1.4.3 Model-Based Learning

The game of table tennis, despite being a complex and fast game requiring great skill to play, has relatively simple physics compared to other tasks like robot locomotion or object grasping. In table tennis, most of the time, the ball is travelling in the air where it is only subject to gravity, drag, and Magnus forces due to its spin. The ball experiences short contacts with two types of objects: the table, and the player paddles. If the dynamics of the ball’s motion and contact are understood, it is possible to both predict the ball’s future states and to control for it by picking the right paddle motion to execute the desired contact.

The proposed method uses observations in the environment to train dynamics models that predict the future state of the ball due to its free motion and due to contact with the player paddle. Such dynamics models can inform the learning agents about the consequences of the actions they are exploring. In contrast, in end-to-end model-free learning approaches, the agents are required to implicitly learn how the environment works in order to best exploit it and increase their reward. By capturing the simple physics of table tennis in dynamics models the method allows the learning agents to focus on learning high-level behaviors, thereby improving sample efficiency.

1.4.4 Learning from Demonstrations

The observations that are needed to learn the ball motion and contact dynamics are readily available from human games. There is no need to use a robot to collect samples for training the dynamics models. Similarly, there is no need for kinesthetic teaching. The behavior of the ball and the contact forces between the ball and the table or the paddle are the same whether the paddle is carried by a robot or a human player. Contrast this with a task like locomotion. As the agent learns new gaits, it starts experiencing new joint states and new contacts with the ground, requiring any contact models to be adjusted. In table tennis, one can study the ball’s free motion and contact behavior just by observing human games in instrumented environments. While collecting robot samples is costly and time-consuming, human samples can be obtained easily and abundantly.

Players with intermediate skills may not be very precise with their movements, but they are able to put high speed and spin on the ball. Also, commercial table tennis ball launchers can produce balls with controlled velocity and spin. Observing a human player play against such a ball launcher can cover the entire space of ball motion and contact dynamics.

Intermediate players may come short in their ability to move quickly, or to control the paddle correctly to execute their desired shot. That would pose a problem if we were to train policies based on the observed human actions. Instead, this method propose to train dynamics models for the ball’s motion and contact from human demonstrations. The dynamics models are independent of the agent and stay valid as the learner’s policy changes. Also, getting the data for training them does not require operating any fragile robot setup. Just by observing human demonstrations, one can train models that can predict where the ball will go as a result of a given paddle motion at time of contact. In the inverse direction, the dynamics models help us predict how to hit a ball so that it gains the desired velocity and spin, or lands at a desired location.

1.4.5 General High-Level Actions

Instead of considering player actions on every timestep, like every 0.02 seconds in the simulator, the proposed method allows the agents to act in the environment using high-level actions with long time horizons. Below are the two main high-level actions and the observation behind their design:

  1. Ball landing targets

    : During each exchange between two players, each player’s game play can be represented by how they land the ball on the opponent’s side of the table. The effective action from each player is how they hit the ball with the paddle and the impact of that action on the ball can be represented by the position, velocity, and spin of the ball at the moment of contact with the table. This observations makes it possible to allow high-level agents to pick ball landing targets as their desired actions.

  2. Target paddle motion states: During each strike, the impact of a player’s paddle on the ball depends only on the state of the paddle during the short period of time when the paddle and the ball are in contact. In other words, all the actions taken by the players up to the moment of contact are just in service to achieving a paddle motion state at the moment of contact with the ball.

Using such abstract actions simplifies the action space for the agents. Also since the actions capture the agent’s behavior over multiple timesteps, they facilitates learning by eliminating the reward delay problem where a learning agent needs to figure out the actual action that leads to receiving a reward multiple timesteps into the future.

1.4.6 Hierarchical Policy

Instead of approaching playing table tennis as a monolithic task, this method uses a hierarchical policy that decomposes table tennis into a hierarchy of subtasks. The hierarchical policy decouples the high-level skills from low-level skills in the game of table tennis, which makes it possible to implement each skill using a different method that is more suitable for it. Moreover, the hierarchy allows each skill to be developed, evaluated, and debugged in isolation. If necessary, the skills can be given perfect observations and perfect actions to fully evaluate their individual limits and errors. Such a setup allows for identifying and addressing inefficiencies in each component of the system before they are integrated and fine-tuned together as a whole.

In the task hierarchy, low-level skills like how to move the paddle to a desired position are implemented using analytic controllers that do not require learning. Mid-level skills like land-ball are implemented using dynamics models that are trained from human demonstrations with supervised learning. Lastly, the top-level strategy skill is trained with reinforcement learning, allowing the agent to discover novel behaviors.

In contrast, learning a task end-to-end may cause the model to relearn the primitive skills over and over in various states in presence of changing inputs. In other words, an end-to-end approach needs to learn to properly generalize its behavior to invariant states. Doing so requires more training episodes.

As explained in Sec. 1.4.5 the action spaces used in the task hierarchy are such that they do not reduce from the generality of the policies. In other words, the hierarchical policy does not restrict the agent’s ability to explore the space of all possible game-play strategies and techniques. As it will be explained later, model-based policies employing human data can be more sample-efficient, while model-free policies that directly pick paddle motion states as actions can exhibit more novel striking motions at the expense of lower sample-efficiency.

1.4.7 Analytic Paddle Control

Once the agent makes a decision about a desired paddle motion state to execute on the robot at a desired time, the desired paddle motion state can be achieved in a number of ways. For one, it could be learned as a separate task using supervised learning or reinforcement learning. However, this method develops an analytic paddle controller which uses the Reflexxes trajectory planning algorithm to execute any target paddle motion state from any starting state for the robot, provided the target paddle motion state is achievable given the motion constraints for the robot.

Employing this analytic controller removes the need for learning a paddle control skill and improves the sample-efficiency of the method.

1.4.8 Learning Strategy with Self-Play

The strategy skill allows the agent to make high-level decisions about its game plan without being concerned about how they are executed. The strategy skill is the only skill in the task hierarchy that requires exploration and uses reinforcement learning to train. By focusing the learning and exploration on this skill only, the proposed method allows the agent to discover interesting strategies, both in the model-based setup using the land-ball skill, and in the model-free setup using directly the paddle control skill.

1.5 Guide to the Reader

The remainder of this work is organized as follows. Sec. 2 describes the learning environment. Sec. 3 provides a more in-depth overview of the method than what is given in this section. Sec. 4 explains the hierarchical policy design and the subtasks in the task hierarchy. Sec. 5 describes how the learning environment is partitioned into a game space and a robot space for learning with higher sample-efficiency. Sec. 6 explains the dynamics models that allows the agents to predict the future states of the game, and evaluate the predictive ability of the trained dynamics models. Sec. 7 describes the analytic paddle controller that is responsible for executing high-level paddle actions, and describes the implementation of the positioning policy. Sec. 8 describes the implementation of the model-based striking policies and evaluates them against baseline model-free implementations that learn the striking skill from scratch. Sec. 9 uses self-play to train table tennis playing strategies in the context of cooperative and adversarial games. Sec. 10 provides a discussion on the work presented in this article and outlines steps for future work, including how the proposed method can handle vision and observation noise in continuous closed-loop control. Sec. 11 discusses related work. Sec. 12 concludes the article.

2 Simulation and Virtual Reality Environments

This section describes the simulation and virtual reality environment that is used for data collection, training, and evaluation of the table tennis robot agent. First, the simulator and the virtual reality environment are introduced. Then, the reinforcement learning environment and its state and action spaces are described.

2.1 The Simulator

Figure 3: Simulation Environment. Two WidowX arms are mounted on linear actuators that allow the arms to move sideways.

Fig. 3 illustrates the simulator’s setup. Two robot assemblies are at the same height as the table. The robot assembly consists of a linear actuator and a robot arm. The arm shown in the image is a WidowX arm with five joints. The original arm has a gripper, which has been removed in this setup and replaced with a fixed link holding a paddle. The arm is mounted on a linear actuator, which allows the robot to move sideways. This configuration has a wider reach compared to a stationary robot. The linear actuator is implemented by one prismatic joint. The arm and the linear actuator are treated as one robot assembly with six joints. Fusing the linear actuator and the arm together in a single assembly simplifies inverse and forward kinematics calculations.

A different version of the simulation environment contains one robot playing against a table tennis ball launcher. The ball launcher can shoot table tennis balls with controlled initial conditions (position and velocity). By varying the initial conditions of every episode, the ball launcher makes it possible to explore the space of game conditions for the learning agents. It is also used in evaluations.

The simulation environment is implemented on top of the PyBullet [5] physics engine. Simulation objects are defined by their geometries and their physics parameters including mass, coefficient of restitution (bounciness), friction coefficients, etc. The physics simulation in PyBullet is deterministic. So, there is no inherent noise in the simulation.

The physics are simulated at . At each physics timestep, the object states and forces are recomputed and any collisions are recorded. Simulating physics at a high frequency increases the fidelity of the simulation and avoids glitches like missed collisions due to the fast motion of the table tennis ball.

2.2 Virtual Reality (VR) Setup

The simulator described in the Sec. 2.1 is connected to a virtual reality setup, allowing a human player to control a free-moving paddle. Using the VR setup makes it possible to create an immersive game environment where human demonstrations can be captured. The VR environment is a good proxy for capturing human demonstrations in the real world with instrumented paddles. In fact, the same trackers that are used in the VR setup can be used to instrument real table tennis paddles and track their motion. This setup for capturing the human demonstration data makes it more likely that the methodology and the results would transfer to the real world.

Figure 4: Virtual Reality Setup. A person is using the VR environment to play table tennis in the simulator.

The VR setup uses an HTC Vive headset, a controller (a.k.a. tracker), and two lighthouses. The components are shown in Fig. 4. The lighthouses continuously track the position and orientation of the player’s headset and the controller in the player’s hand. The HTC VR hardware uses active lighthouses and passive headset and controllers. The lighthouses emit vertical and horizontal sweeping laser lights at a fixed frequency. The headset and the controller have an array of light-detecting sensors that fire whenever they receive the laser light. Since the configuration of the sensors on the headset and controller are known, the timing of light-detection events reported by the different sensors contains enough information to decide the 3D position and orientation of each tracker. As long as a tracker is exposed to one of the two lighthouses and a sufficient number of its light sensors are visible to it, the device can be tracked with the same accuracy. So, if the paddle or the player hide the tracker from one of the lighthouses, it does not pose a problem. Fig. 2 shows two types of VR trackers that work with HTC Vive.

2.3 Learning Environment

The learning environment is implemented using the OpenAI Gym [3] API. The environment encapsulates the simulator and exposes the simulation object states as the environment state. At every timestep , the environment exposes the following information on the objects:

  • The ball motion state , which includes its 3D position

    , and velocity vector

    ;

  • The paddle motion state , which includes its 3D position , orientation , linear velocity , and angular velocity ;

  • Robot joint positions , and velocities ;

  • Most recent collision and the ball’s location and velocity at the moment of collision.

Each learning agent defines its own observation space, which is a subset of the environment state. The action space of the environment includes the six robot joints. The simulator supports position and velocity control modes for actuating the robot assembly.

There are three frequencies operating in the environment. The simulator runs at , allowing for smooth simulation of physics and control of the robot. The learning environment has a frequency of . Every , the environment state is updated based on the most recent physics state of the objects. Collisions that are detected in between two environment timesteps are accumulated and reported together. The collisions contain no time information, so they appear to have happened at the end of the environment timestep. The high-level agents operate at a lower frequency in the environment. They receive observations and choose actions only once during each ball exchange between the players. Running at a lower frequency makes learning easier for the agents. The high control frequency is appropriate for smooth control of the robot, but the agents do not need to make decisions at every simulation or environment timestep. The lower frequency shortens the reward delay between the time the agent makes a decision and when it observes the consequence.

2.4 Conclusion

This section described the simulation and virtual reality environments that are used for simulating table tennis games. The next section provides an overview of the proposed method for learning to play table tennis with high sample efficiency.

3 Method Overview

This section gives an overview of the approach and its key components. It depicts a high-level picture of how the different components work together as part of the method. Sec. 3.1) discusses decomposing the task of playing table tennis into subtasks that can be learned or solved more efficiently. Sec. 3.2 describes decomposition of the environment to separate the problem of robot control from the table tennis game play. Sec. 3.3 discusses environment dynamics models that are trained from human demonstrations. Sec. 3.4 describes an analytic robot controller which can execute desired paddle motion states (pose and velocity). Sec. 3.5 discusses using self-play to learn high-level table tennis strategies for cooperative and adversarial games.

3.1 Task Decomposition

Robot table tennis is a complex task. Therefore it is difficult for reinforcement learning agents. The proposed method decomposes the task into a hierarchy of skills where higher-level skills depend on lower-level skills. The low-level skills are easy to learn. In turn, exposing the functionality of these low-level skills as primitives to higher-level skills makes those skills less complex and easy to learn as well.

The task hierarchy offers a high-level view of the game to the learning agents. Instead of continuously making decisions at every timestep, they make one decision per rally. The agents decide only targets for the land-ball skill: given an incoming ball, find and execute a paddle trajectory to hit the ball in a way that it lands on the opponent side at a desired location and with a desired speed. Using land-ball as the action space for the game agents eliminates the reward delay problem, and consequently, the agents require fewer training episodes to learn. At the same time, the land-ball skill does not reduce from the generality of policies. Any sequence of low-level actions can be reduced to, or represented by the eventual landing state of the ball on the opponent’s side.

3.2 Environment Decomposition

In another dimension, the method decomposes the table tennis environment into two spaces: the game space, and the robot space. The game space includes only the table, the ball, and the paddle; It deals only with the game of ping pong. The robot spaces includes only the robot and the paddle; It deals only with the physics of the robot and end-effector control. The only object shared between the two spaces is the table tennis paddle.

This separation makes it possible to study and model the physics of table tennis without any robot controllers or agent policies. In particular, it permits modelling the dynamics of the game just by observing humans playing table tennis with instrumented paddles. On the other hand, isolating the robot space makes it possible to focus on the problem of paddle control without any complications from the game of table tennis.

Moreover, decomposing the environment makes it easier to replace the robot with a different model, since there is no need to retrain the game models from scratch.

3.3 Dynamics Models

Learning starts with observing human games with instrumented table tennis paddles. The human demonstrations are used to train dynamic models for the environment. These models mainly capture the physics of ball motion and paddle-ball contact dynamics. Once such dynamics models are trained, they can be used to predict the future trajectory of the ball. They can also be used to make predictions about where a given ball will end up if it is hit with a given paddle pose and velocity.

In the other direction, such dynamics models can be used to decide how to hit an incoming ball to get a desired landing location and speed. In other words, they can be used to find the right paddle pose and velocity at the time of contact to achieve a desired land-ball target. The dynamics models reduce the land-ball task to a paddle control task with a particular pose and velocity target for the paddle. Since the paddle needs to hit the ball, the paddle target also includes a time component.

3.4 Analytic Robot Control

The task and environment decomposition simplify the control task in robot table tennis. Task decomposition reduces the game-play to accurate paddle control and environment decomposition allows the robot controller to focus only on accurate paddle control and not be concerned with the ball, table, or opponent.

Instead of relying on learning, the proposed method relies mainly on analytic control to execute paddle targets. The target paddle pose is translated to target joint positions using inverse kinematics. The target paddle velocity is also translated to target joint velocities using the end-effector Jacobian. The Reflexxes Motion Library is used to computed a timed optimal trajectory starting with the current joint positions and velocities and reaching the desired joint positions and velocities at the desired time for contact. Using an analytic controller instead of learning increases the sample-efficiency of the method. It also makes it possible to switch the robot

In conclusion, the proposed method can produce table tennis agents following fixed policies with minimal experimentation on the robot itself. Experiments are needed only to calibrate the motion constraints of the robot, and to model imperfections in the underlying robot control stack, e. g. imperfections in the robot’s body and the PID controller.

3.5 Learning Strategy with Self-Play

The proposed method uses a single strategy skill whose main job is to pick high-level targets for the land-ball skill. In a cooperative game where two agents try to keep a rally going for as long as possible, a good strategy might pick targets near the center of the opponent side of the table. In an adversarial game where the agents try to defeat the opponent, a good strategy might pick targets that make it difficult for the opponent to return the ball.

For tasks like land-ball and paddle control, it is possible to evaluate any given action and determine whether it accomplishes the task. However, when looking at the whole game of table tennis and the space of all possible strategies, it is not immediately clear what action is more likely to help an agent win a point. It is exactly this part of the policy that benefits the most from reinforcement learning and evolutionary strategies and their ability to discover novel solutions.

This skill is trained with a self-play setup involving two robots. In the first self-play level, the agent plays against a fixed policy. In every subsequent level, the agent plays against a frozen copy of its most recent policy.

The strategy skill is the only skill in the hierarchy that requires every component to be in place. It requires all tasks controllers to be available. It also works across the decomposition in the environment as it engages both the robot and game spaces. Despite these complex interactions, the strategy skill remains relatively simple, since its observation and action spaces are low-dimensional. Therefore, training the skill requires far fewer episodes compared to training an end-to-end agent.

A key challenge in learning with self-play is maintaining efficient exploration. Often with self-play learning the agent converges to a narrow policy. Since the proposed method uses self-play only for training the strategy, the exploration problem remains confined at the strategy level. A failure to fully explore the space of all strategies does not reduce the coverage of the dynamics models over the space of game physics, since they are learned independently. Also, a failure in exploration does not affect the analytic robot controller. In contrast, in an end-to-end setup a failure in exploration at the highest level may restrict the predictive ability of the underlying components as well.

The strategy skill also accounts for the imperfections in the underlying skills. The land-ball skill is not perfect. It misses the target by some error distance based on predictive accuracy of the dynamics models and the precision of the robot controller. During training, the strategy skill implicitly observes the accuracy of the underlying skills through its reward signal and learns to choose better targets for them. One common problem with task decomposition is that errors can accumulate through the task hierarchy. However, since the strategy skill does not have an explicit target in the proposed method, it leaves room for the learning agent to compensate for the imperfections in the underlying skills.

3.6 Conclusion

To increase sample-efficiency in learning, the method decomposes both the task space and the environment space. The decomposition makes it possible to learn the skills one at a time and model each environment space in isolation. Therefore, decomposition increases the sample efficiency. At the same time, the decomposition is done in a manner not to reduce from the generality of the solutions. The following sections describe each component of the method in detail.

4 Task Decomposition

This section explains the different tasks that make up the control hierarchy in the proposed method. The section starts with an overview of the task hierarchy. It then lists example inputs/outputs for some of the skills. Then, it discusses each task/skill in more detail. A high-level overview of the skill implementations are also provided. The terms task, subtask, and skill are used interchangeably. Often, skill is used when learning is required and task is used when an algorithmic controller is used.

The main advantage to task decomposition is to create subtasks that be implemented with different mechanisms. For subtasks that use some form of learning, decomposition improves sample-efficiency. Also, the training and debugging process is more manageable when focusing on one skill at a time. Decomposing the task makes it easier to develop and evaluate each component before it is integrated into the whole system.

4.1 Skill Hierarchy

The proposed skill hierarchy is shown in Fig. 5. The hierarchy consists of three levels of control (high-level, mid-level, and low-level), two modes (striking and waiting), and seven tasks. At the top, there is the strategy skill, which just picks targets for the skills in the middle layer. Each skill in the hierarchy depends on skills below it and used the lower-level skills to accomplish its task. Except for the strategy skill, all other skills have parameterized targets. The striking and positioning skills provide an abstraction over the agent’s plan for a single exchange with the opponent. The strategy skill, on the other hand, provides an abstraction over the agent’s entire game plan by picking different striking and positioning targets over the course of multiple exchanges as the game goes on.

Decomposing the robot table tennis task into smaller tasks makes learning easier and more sample-efficient. The green strategy node is learned with reinforcement learning. The purple nodes have algorithmic policy implementations using dynamics models trained from human demonstrations. The blue nodes are implemented by an analytic robot controller.

Figure 5: The Skill Hierarchy. The hierarchy consists of three levels of control (high-level, mid-level, and low-level), two modes (striking and waiting), and seven tasks. At any point in the game, the agent is either in striking mode or waiting mode. When in striking mode, the agent strikes the ball using one of three different skills: land-ball, hit-ball, or directly with paddle control. Each variant of the policy uses only one of the three striking skills.

At any point during the game, the agent is either in striking mode or waiting model. In striking mode, the agent uses a striking skill to hit the ball toward the opponent. The agent can strike the ball using land-ball, hit-ball, or directly using the paddle-control skill. Fig. 5 shows the three variants of the policy with alternative striking skills in the same image. In waiting mode, the agent picks a position for itself in anticipation of the opponent’s next shot. This position is specified by a desired paddle pose and executed using the positioning skill. The agent’s mode changes automatically based on the current state of the game. As soon as the opponent hits the ball, the agent enters the striking mode. The agent stays in the striking mode until it hits the ball back. At that moment, it enters the waiting mode.

As discussed in subsequent sections, the strategy skill is implemented by a PPO policy trained through self-play (Sec. 9). The striking skills land-ball and hit-ball are implemented by algorithmic policies (Sec. 8) which employ game dynamics models trained from human demonstrations (Sec. 6). The other four skills in the hierarchy are implemented using analytic control (Sec. 7). The remainder of this section defines each skill in detail.

4.2 Strategy

The strategy skill allows the agent to make high-level decisions about its game plan without being concerned about how they are executed. Depending on the state of the game, the strategy skill either picks a land-ball target or a positioning target and passes that target to one of its underlying skills. More specifically, the strategy skill is described by the policy

(1)

where denotes the strategy policy, denotes the current ball motion state222In the current implementation, the strategy and striking skills receive only the current ball motion state as the observation. Experiments have shown that adding other observations like the agent’s paddle position or the opponent’s paddle position do not improve performance in the current implementation. It is likely that in setup where the agent can make more than one decision per exchange including such additional observations would be useful., denotes one of the three striking policies, denotes the positioning policy, denotes a target position for the paddle, and denotes the sign of the component of the paddle normal vector, specifying a forehand or backhand pose. The positioning policy is discussed in more detail in Sec. 4.6.

The strategy skill separates the tactical dimension of game play from the technique needed to execute desired movements on the robot. By providing an abstraction over the game plan, it allows the agent to explore the space of game play strategies while reusing the technical know-how captured in the underlying skills. The actions selected by the strategy skill stay active during one ball exchange, which lasts for a relatively long time interval (about 70-100 environment timesteps). So, it shortens the reward delay between the time the agent makes a decision and when it observes the consequence.

4.3 Striking Skill

During an exchange, when the agent is expected to return the ball coming from the opponent, it should execute some action to make contact with the ball and have it land on the opponent’s side successfully. The proposed method offers three variants of the striking policy which achieve this goal with different approaches. Each learning agent uses only one variant of the hierarchical policy, so for any given agent only one of the three striking policies is available. The three variants differ in the number of dynamics models they use in their implementation (ranging from three to zero) and are used in experiments to evaluate the impact of dynamics models on learning sample efficiency. The striking skill is specified by one of the three policies

(2)

where denotes the land-ball policy, denotes the hit-ball policy, denotes the paddle-control policy, denotes the current ball motion state, denotes a ball landing target, denotes the coordinate of the paddle at time of contact (which maps to the desired distance from the net), denote the normal, linear velocity and angular velocities for the paddle, denotes the time of contact between the paddle and ball, and denotes the full paddle motion state.

Each of the three striking skills has a different approach to returning the ball. The land-ball skill parameterizes the strike by a desired landing target for the ball on the opponent’s side of the table. Hit-ball parameterizes the strike by a desired paddle orientation and velocity at the time of contact with the ball. The hit-ball skill helps the agent make contact by automatically computing the required paddle position based on the predicted ball trajectory. Striking directly using the paddle control skill requires fully specifying a target paddle motion state at the time of contact. As described in Sec. 8, land-ball requires three dynamics models to work, while hit-ball requires only one dynamics model. The paddle-control skill does not use any trainable dynamics models. So the three alternative striking skills are used to evaluate the impact of using dynamics models on sample-efficiency of learning the different skills.

Sec. 4.4 and Sec. 4.5 describe the land-ball and hit-ball striking skills. Striking with direct paddle control is covered in Sec. 4.7.

4.4 Land-Ball

The objective of the land-ball skill is to execute the necessary actions to hit back the opponent ball in a way that it eventually lands at the desired target . The target specifies a target location , and a target landing speed over the opponent’s side of the table:

(3)

The target landing location specifies the desired position of the ball at the moment of landing:

(4)

where denotes ball position at the moment of landing. Note that there is no constraint on the landing time. The subscript here denotes some unspecified landing time. The target landing speed specifies the desired magnitude of the ball’s velocity vector at the moment of landing.

In the current implementation the landing target does not specify a desired landing angle for the ball. Since the trajectory of the landing ball is constrained at two points (by and ), the landing speed specifies the landing angle to some extent, as faster balls tend to have lower landing angles. However, the implementation can be extended to also include a target landing angle. In environments where the ball spin affects it motion, topspin would increase the landing angle while backspin would decrease it. In such environments a desired landing angle can further constrain the desired ball motion.

An example land-ball target can be:


In a coordinate system where the center of the table is at  m, the two-dimensional target specifies a point  m behind the net and  m to the left of the center divider on the opponent’s table. The z coordinate of the target is always equal to  m, which is the height of a standard table,  m, plus the radius of a table tennis ball,  m.

The land-ball skill is described by the policy

(5)

where denotes the paddle control policy, denotes the planned paddle-ball contact time picked by the land-ball skill, and denotes the desired paddle motion state at time , also picked by the land-ball skill.

The implementation for the land-ball policy is described in detail in Sec. 8. Given the incoming ball’s motion state , the land-ball policy predicts the ball’s future trajectory and plans to make contact with it at some time . The land-ball policy chooses a target motion state for the paddle at the time of contact. To ensure contact with the ball, the target position for the paddle is always chosen to be equal to the predicted position of the ball at that time. The land-ball policy also chooses the paddle’s desired orientation, linear velocity, and angular velocity at the time of contact. To accomplish its requested goal, the land-ball policy should pick the paddle orientation and velocity such that hitting the ball with that paddle motion state sends the ball to the landing target . The target contact time and the target paddle motion state computed by the land-ball skill are passed as a high-level action to the paddle control skill.

The land-ball skill provides a high-level abstraction over the agent’s game play during a single exchange with the opponent. This high-level abstraction does not cause a reduction in generality of behavior. Barring deceptive movements to hide the agent’s intention from the opponent, any sequence of paddle actions can be reduced to the resulting landing state for the ball.333A fully-specified landing state should capture the ball’s 3D velocity and spin at the moment of landing. Since the simulator used in this work does not model spin, the implementation uses a simplified representation for the landing state. However, the land-ball skill can be extended to accept more expressive landing targets. In other words, the specification of the land-ball skill makes it possible to specify the agent’s behavior by its desired outcome. Learning to use the land-ball skill is easier for the agents as its action space has fewer dimensions than a fully-specified paddle motion state, yet its action space can specify complex behaviors. Furthermore, land-ball’s action space maps to the geometry of the world and allows for exploiting the symmetries and invariances that exist in the domain (Sec. 6.3).

4.5 Hit-Ball

The hit-ball skill is an alternative striking skill. Rather than specifying the strike by how the ball should make contact with the table, the hit-ball skill specifies the strike by how the paddle should make contact with the ball. Unlike land-ball which aims for a particular point on the opponent’s side, hit-ball has not specific target for when the ball lands. The hit-ball skill is described by the policy

(6)

The hit-ball skill helps the agent make contact with the ball by computing the time of contact and the target paddle position based on its inputs and the predicted ball trajectory. It expects that the agent provide the other contact parameters like the orientation of the paddle encoded by the paddle normal and its linear and angular velocities . An example hit-ball target can be:


In this example, the skill is requested to make contact with the ball when it is  m away from the net ( cm in front of the robot).

The implementation for the hit-ball policy is described in detail in Sec. 8. It uses a model to predict the ball’s future trajectory. It then considers the intersection of the predicted ball trajectory with the desired contact plane which is an input parameter to the hit-ball skill. The point in the predicted trajectory that is closest to this plane is picked as the desired contact point. This point determines the contact time and the full position of the paddle at the time of contact. The other attributes of the paddle’s motion state are given as inputs to the hit-ball skill. Together with the computed paddle position, they fully specify the paddle’s motion state , which is passed as a high-level action to the paddle control skill.

Unlike the land-ball skill which can only execute strikes demonstrated by humans, the hit-ball skill can be used to execute arbitrary paddle strikes. So a strategy trained over the hit-ball skill can better explore the space of paddle strikes. At the same time, as the experiments show learning with the hit-ball skill is less sample-efficient and requires more training time.

4.6 Positioning

The positioning skill is in effect in waiting mode – when the ball is moving toward the opponent and the agent is preparing to respond to the next ball from the opponent. The objective of this skill is to move the robot to the requested positioning as as quickly as possible. Instead of specifying the requested position with the full robot pose, the skill accepts a target paddle position. This formulation makes the action space of the positioning skill independent of the robot and more likely to transfer to new configurations and new robots. The positioning skill is described by the policy

(7)

where denotes some paddle state that satisfies the requested paddle position and normal direction indicated by . Unlike the land-ball skill which requests a specific time target for its desired paddle state, the positioning skill does not specify a time. In this case, the paddle skill is expected to achieve the desired paddle state as fast as possible. In other words, it is expected to find some minimum time and achieve the target paddle at that time. An example positioning target can be:

The positioning skill maps the requested paddle position to some robot pose that satisfies it. In the current implementation, this skill requests a target velocity of zero for the joints at their target positions. However, a more complex implementation can arrange to have non-zero joint velocities at target to reduce the expected reaction time to the next ball coming from the opponent.

4.7 Paddle Control

The objective of the paddle control skill is to bring the paddle from its current state to a desired target state at time . This skill is invoked both by the land-ball skill and the positioning skill. The paddle control skill is described by the policy

(8)

where denotes the trajectory planning skill, denotes the target joint positions, denotes the target joint velocities, denotes the current joint positions, denotes the current joint velocities. An example paddle target can be:


where denotes the target paddle position, denotes the target paddle surface normal, denotes the target paddle linear velocity, and denotes the target paddle angular velocity. The paddle’s angular velocity at the time of contact can be used to better control the spin on the ball. Although the PyBullet simulator does not support ball spin, controlling the angular velocity is useful in real environments.

The paddle control skill chooses some paddle orientation to satisfy the requested surface normal. An example paddle orientation can be:


where is a four-dimensional quaternion representing the paddle orientation and are unit vectors representing the three Cartesian axes.

As discussed in Sec. 7, the policy for the paddle control skill can be implemented analytically, thereby reducing the need for training and increasing the sample-efficiency of the method. The analytic controller uses inverse kinematics and the robot Jacobian to translates the target paddle motion state into a set of joint positions and velocities such that achieving those joint states brings the paddle to the state . It then passes the target joint positions and velocities to the trajectory planning skill. An example target joint state for a 6-DOF robot assembly can be:


where the first DOF corresponds to the prismatic joint at the base of the robot assembly, and the next five DOFs correspond to the revolute joints in the arm.

The paddle control skill provides a high-level abstraction over robot control for table tennis. It allows higher-level policies to specify their desired robot actions by just specifying the impact of those actions on the paddle at a particular point in time. At the same time, this high-level abstraction does not cause a reduction in generality of behavior. Barring deceptive movements, any sequence of joint commands can be reduced to and represented by the final paddle motion state at the time of contact. It is only during the short contact time that the state of the robot affects its impact on the ball. Lastly, since the paddle control skill works with long-term actions lasting throughout one ball exchange, it allows the agent to observe the reward for its actions with little delay and learn more easily.

4.8 Joint Trajectory Planning

The trajectory planning skill is responsible for bringing the joints from their current positions and velocities and to their target positions and velocities and at the requested time , while observing the motion constraints on each joint. By doing so, it brings the paddle from its current state to the target state . The trajectory planning skill is used only by the paddle control skill and is described by the policy

(9)

where denotes the robot control skill, denote the planned joint positions, velocities, accelerations, and jerks at times , the motion constraints specify the minimum joint positions, velocities, accelerations, and jerks, and specify the maximum joint positions, velocities, accelerations, and jerks.

The trajectory planning task receives long-term targets in the future. In turn, it generates a series of step-by-step setpoints for the robot to follow on every timestep starting with the current time and leading to the target time . Each trajectory setpoint specifies target joint positions and velocities . In the implementation, setpoints are computed at . So, for the example given in Sec. 4.7 the computed trajectory would contain 720 points over 0.72 s.

The motion constraints depend on the robot’s physical properties, mass, maximum motor torques, etc, and would be different for each robot. The position limits depend on the configuration of the robot joints and links. The velocity and acceleration limits depend on the motors and their loads. Jerk limits are usually imposed to limit the wear and tear on the robot. The velocity and acceleration limits can be determined from the robot data sheets, or measured empirically. As explained in Sec. 7, trajectory planning is implemented using the Reflexxes Motion Library [10], which can evaluate the feasibility of trajectories and return the next step in under . Reflexxes assumes the motion constraints to be constant and independent of the current joint velocities. In reality, motors have limit profiles that vary based on their current load. Still, the limits can be averaged and tuned for the duration of typical motions in table tennis, which last about 0.5 s.

Once the trajectory is computed, it is sent to the robot controller to execute.

4.9 Joint Control

The objective of the joint control task is to execute the necessary control actions on the robot to follow the next setpoint in the trajectory. Typically this skill is already realized by an existing PID controller or an inverse dynamics controller. Then, the joint control task simply represents the low-level controller on the robot. The task is described by the policy

(10)

where denotes the control action at time .

At each point in time, the controller observes the current joint states , and decides the joint actions or torques to best satisfy the requested next joint states at the next timestep .

4.10 Conclusion

The skill hierarchy breaks down the task of playing table tennis into a hierarchy of simpler subtasks that depend on each other. This breakdown makes it possible to focus on each skill individually and develop and evaluate them separately. It helps isolate the source of problems and reduces the effort needed for debugging. Also, learning subtasks can be done with fewer training samples than learning the whole task at once.

The skill definitions do not impose any restriction on how they should be implemented. In order to achieve higher sample-efficiency, the approach proposed in this work uses reinforcement learning and supervised learning strategically. The robot control skills and the positioning skill are implemented using an analytic robot controller that does not require learning. The striking skills are implemented by algorithmic policies that use dynamics models trained from human demonstrations using supervised learning. Human games provide observations on the physics of how the ball moves, and how it is affected by contact with the table or player paddles. Using the demonstrations to train dynamics models instead of policies has the advantage that it allows the agent to exhibit different game-play strategies beyond what is demonstrated by humans. Since the strategy requires complex reasoning about opponents and is difficult to learn in a model-based way, model-free reinforcement learning is used to learn the strategy skill.

The next section discusses how decomposing the environment simplifies the implementation of skills and improves the method’s sample efficiency. The skill implementations are discussed in future sections. Sec. 7 discusses an analytic controller that implements paddle control, joint trajectory planning, and joint control skills, plus the positioning skill. The striking skills land-ball and hit-ball are discussed in Sec. 8. Finally, Sec. 9 discusses how model-free reinforcement learning is employed to discover creative table tennis strategies in cooperative and adversarial games using the skills described in this section.

5 Environment Decomposition

The skill hierarchy offers a decomposition of the problem on the behavioral axis. On a different axis, the proposed method decomposes the robot table tennis environment into two spaces: the game space, and the robot space. The environment decomposition is shown in Fig. 6. The game space is concerned with the physics of the table tennis game involving a table, ball and paddles, independently of how the paddles are actuated. The robot space is concerned with the control of a robot that has a table tennis paddle attached to the end of its arm. The robot space does not deal with the table tennis ball or the table. The paddle is the only shared object between the robot space and the table tennis space. Also, note that the game space and its constituent objects are exactly the same in robot games and human games, which makes it possible to learn the game dynamics from human demonstrations and use them for robot games.

Figure 6: Decomposition of the Robot Table Tennis Environment. The environment is divided into two spaces: the robot space and the game space. The robot space deals with the physics and control of the robot. The game space deals with the dynamics of table tennis. The only shared object between the two spaces is the table tennis paddle. The decomposition makes it possible to learn the dynamics of table tennis from human demonstrations without using a robot. On the other hand it simplifies the robot control problem to just accurately controlling the paddle.

5.1 Game Space

Fig. 7 shows the game space and visualizes some of the variables which are used in the definition of the skills in Sec. 4. The game space contains only the table, paddle, and ball.

Figure 7: The Game Space of the Environment.

5.2 Robot Space

The robot space deals with the control of the paddle attached to the robot’s arm. Fig. 8 shows the robot space. The only entities that exist in this space are the links and the joints of the robot assembly and the paddle. The robot space is not concerned with any of the game objects, including the ball, the table, or any opponent. The main task in the robot space is paddle control, which requires taking the paddle from its current state to a target state at time .

Figure 8: The Robot Space in the Environment.

5.3 Separating Physics of the Robot from the Physics of the Game

The key advantage of decomposing the environment is that it allows for separating the physics of table tennis from the physics of robot control. This decomposition makes it possible to create models of the interactions between the paddle, the ball, and the table independently of how the robot agent acts in and manipulates the environment.

Since the table tennis space does not include the robot, it is possible to train dynamics models for it just by observing human games or human practice sessions against a ball launcher. Such models capture the physics of table tennis and can be used to predict the future states of the game given past observations. As shown later in Sec. 8, these models are used to create a functioning table tennis agent without using any training episodes.

Furthermore, this decomposition simplifies the problem of robot control, as the robot space does not contain any elements of the game besides the paddle that is attached to the robot. When attached to the end of the robot arm, the paddle becomes the robot end-effector. Control of an arm end-effector is a well-studied problem in robotics, so the proposed method uses classic control methods to develop an analytic controller for moving the paddle. In contrast, when the robot is part of the table tennis environment, robot control becomes embedded in the agent’s behavior. In such a monolithic environment, the table tennis agent would need solve the robot control problems as well. For example, if the agent controls the robot in joint space, it would need to implicitly solve the inverse kinematics problem to be able to interact with objects in the environment. However, inverse kinematics has an analytic solution and does not require training samples. Using an analytic controller where possible decreases the complexity of the agent and increases sample-efficiency as no training samples are used by the agent to learn the internal dynamics of the robot.

5.4 One Game, Different Robots

An advantage of decomposing the environment into the game and robot spaces is that the robot can be replaced with a completely different type of robot without affecting the dynamics models that are trained on the game space. All that is needed to get the new robot to play table tennis is to supply a new robot controller that knows how to drive the new robot’s end-effector properly. As long as the new controller can move the paddle attached to the robot as demanded, the new agent can continue to use the existing game dynamics models from the previous setup. The game strategy learned on an old robot might transfer to some extent to a new robot as well. However, since different robots have different reaches and different motion constraints, the strategy skill would need to be retrained for best exploit the physics of the new robot.

5.5 Reduction in Dimensionality

Another benefit to decomposing the environment is that it lowers the dimensionality of the inputs to the models and policies. In a monolithic environment, a typical choice for the observation space is to include the joint states in addition to the state of the world objects. Consider the problem of predicting the trajectory of the incoming ball. The future states of the ball only depend on the state of the ball and its impact with the table. Yet, in a monolithic environment, the joint states are always part of the observations. Similarly, the state of the robot paddle depends only on the joint actions, and is independent of the ball’s position. On the other hand, when learning to land the ball on the opponent’s table, the agent needs to line up the paddle with the ball. For this task, the agent needs to take into account both the state of the paddle and the state of the ball.

In a monolithic environment, more samples or more training episodes are required for the agent to learn when the dimensions are independent and when they interact with each other.

5.6 Interaction with Task Decomposition

In the proposed method, environment decomposition is used in combination with the task decomposition in the skill hierarchy. In the skill hierarchy in Fig. 5, the strategy and land-ball skills interact mainly with the game space of the environment. The positioning skill, the paddle control skill and the skills below it interact only with the robot space of the environment. However, the usefulness of decomposing the environment does not depend on using task decomposition as well. Even with an end-to-end policy which treats robot table tennis as one monolithic task, decomposition of the environment into game and robot spaces can facilitate learning.

In a model-based setup, decomposing the environment allows the agent to learn something about the robot and something about the game from each training episode. Regardless of how good the current policy is, each training episode provides the agent with a sample from which it can learn how the robot’s body responds to the chosen actions. At the same time, the episode provides the agent with observations showing the dynamics of interactions between objects in the table tennis environment.

As an example, consider the task of producing striking motions with the goal of making contact with the ball. In a monolithic environment, as the policy improves and the agent learns to hit the ball, the distribution of the inputs to the dynamics models may change considerably. In a decomposed environment, an agent the impact of its joint commands on the motion of the robot paddle will observe the same outcome regardless of whether the strike is successful at making contact with the ball or not. Similarly, the agent observes the physics of the ball motion and the interaction between objects whether contact is made or not. The experience from before the agent’s policy works well transfers fully to after when it learns to make contact.

5.7 Conclusion

Separating the environment and robot models creates a setup where the table tennis dynamics models can be trained independently from the robot’s control mechanism. This separation helps increase sample efficiency during training, as any shortcomings in the robot control stack would not limit the space of observations for the game dynamics models. Similarly, in the other direction, any shortcomings in the agent’s table tennis strategy would not hinder an exploration in the space of the robot’s end-effector behavior. The next section describes the dynamics models that make predictions over the game space of the environment.

6 Learning Dynamics Models from Human Demonstrations

This section describes the dynamics models that can make predictions over future states in the game space in the table tennis environment. Sec. 6.1 describes some of the design choices behind implementing dynamics models as neural networks. Sec. 6.2 describes the ball-trajectory and landing prediction models. Sec. 6.3 describes the normalization process for reducing the number of dimensions for the dynamics models and thereby reducing the number of samples needed for training them. Sec. 6.4 describes the process for collecting training samples in the VR environment. Sec. 6.5 evaluates the trained dynamics models.

6.1 Learning Dynamics with Neural Networks

This section discusses why in this work human demonstrations are used to train dynamics models and why the dynamics models are implemented as neural networks, as opposed to physics models or a combination of the two.

6.1.1 Learning Dynamics Instead of Policy

Human demonstrations can be used to teach a learning agent about optimal actions to choose. In other words, the demonstrations can be used to directly train a policy. In this work, the human demonstrations are used only to train dynamics models of the game space of the environment. It has the advantage that the dynamics models transfer to different policies. Also, not using the human policies allows the strategy agent to come up with novel approaches to playing table tennis different from what was tried by the humans.

6.1.2 Using Physics vs. Neural Networks

A good approach to implementing the dynamics models is to use a combination of physics models and neural networks. The physics models would include parameters for gravity, coefficients of friction, restitution, the drag and Magnus forces acting on the ball, etc. The values for these parameters can be obtained by computing a best fit over observations from the training data. Such physics models can produce an approximation over the future states of the objects. Then, neural networks can be trained to predict only a residual over the predictions from the physics models. Using neural networks in combination with physics models is expected to increase the sample efficiency.

However, since the proposed method is evaluated only in simulation, all dynamics models are implemented using neural networks and no physics models are used. The reason behind this choice is to keep the task challenging and avoid modeling the known physics implementation of the simulator. The physics simulations in Bullet are deterministic. If the right parameters are included in the physics models, their values could be discovered rather easily. By relying only on neural networks, the experiments evaluate whether the dynamics models are able to model physical interactions over long time horizons without using any priors on Newtonian physics.

6.2 Dynamics Models

The models are used to make two types of predictions:

  1. Future trajectory of the current ball: Once the opponent hits the ball, the learning agent predicts the future trajectory of the incoming ball and picks a desired point and time of contact for hitting the ball back.

  2. Landing location and speed resulting from a paddle action: The learning agent uses landing predictions to choose among potential paddle actions to best satisfy the requested landing target. The landing model predicts the eventual landing location/speed of the ball on the opponent side given the pre-contact ball and paddle motion states.

The following sections go over each dynamics model in detail.

6.2.1 Ball-Trajectory Prediction Model

The ball trajectory prediction model receives the estimate on the current state of the ball and predicts a sequence of ball positions and velocities at future timesteps. The model produces a full trajectory to give the agent multiple options for picking a contact point for hitting the ball with the paddle. The model is described by the function

(11)

where denotes the ball trajectory prediction model, denotes the estimate on the current ball motion state, and denote predictions on the future ball motion state over the next timesteps.

Figure 9: Ball-Trajectory Prediction Model. Given an estimate on the current position and velocity of the ball, predicts the future position and velocity of the ball over multiple timesteps. The sequence model has two LSTM layers followed by a fully-connected layer. The figure shows the model unrolled through time for better visualization. At all timesteps, the model receives the same input , which represents the current ball position and velocity.

The model’s network architecture is shown in Fig. 9. It is a sequence model, however, it receives the same input at every timestep. The model outputs form a complete trajectory for the ball’s future position and velocity over multiple timesteps.

Given the training data, the model learns to predict the physical forces that operate on the ball. In the simulator, the free motion of the ball in the air is affected by its velocity, gravity, and a drag force that is proportional to the velocity of the ball. The Bullet simulator does not simulate ball spin and the Magnus forces causing the ball’s trajectory to curve. The motion of the ball after contact with the table is affected by the friction coefficients (lateral, rolling, spinning) between the ball and table, and the restitution coefficients of the ball and table. In order to predict the future states of the ball, the neural network learn to implicitly model these physical forces. The ball trajectory may include arcs after its contact with the table. The training data allows the model to learn about the location and geometry of the table and predict whether the ball will collide with the table and how its trajectory is affected after contact.

6.2.2 Landing-Prediction Model

The landing-prediction model allows the agent to make predictions about the eventual landing location and speed of the ball given pre-contact ball and paddle motion states. Suppose the land-ball agent has predicted a trajectory for the incoming ball and has picked a particular point as a desired contact point for hitting the ball with the paddle. Given a candidate paddle motion state at time of contact , the landing-prediction model predicts where and with what speed the ball will land on the table if the paddle motion state is achieved.

The landing-prediction model is described by the function

(12)

where denotes the landing model, and denotes the ball’s predicted position and speed at landing time.

Figure 10: Forward Landing-Prediction Model. It is a feed-forward network with two fully-connected hidden layers. Given pre-contact ball and paddle motion states, the model predicts the eventual landing location and speed of the ball when it hits the opponent side. Such a prediction can inform the agent about the outcome of available actions.

The model’s architecture is shown in Fig. 10. It is a feed-foward network, which produces its output in one step. Since the land-ball policy is only concerned with the eventual landing location and speed of the ball, a feed-forward is preferred since it is faster at inference time and easier to train.

The training data for this model can be limited to the ball trajectories that actually hit the table after contact with the paddle. Alternatively, the notion of landing location can be extended to include positions off the table. In that case, the landing location is picked to be the last location for the ball before it falls below the plane of the table’s surface.

The landing-prediction model is used by the land-ball skill to search in the space of candidate paddle actions and select one that is more likely to achieve the desired landing target. In other words, given a set of potential paddle actions , the land-ball policy selects the action whose predicted landing location and speed is closest to the requested target .

6.2.3 Inverse Landing-Prediction Model

The search process for selecting paddle actions is computationally expensive. The inverse landing-prediction model addresses this issue by directly predicting the paddle action given a desired landing target.

The landing-prediction model is a forward model, as it predicts the future state of the environment given observations from the current state and an action. It is possible to train an inverse landing model from the same data that is used to train the forward landing-prediction model. The inverse landing model is described by the function

(13)

where denotes the inverse landing model.

Figure 11: Inverse Landing-Prediction Model. It is a feed-forward network with two fully-connected hidden layers. Given pre-contact ball motion state and a desired landing target, the model predicts the paddle motion state right before contact. The predicted paddle motion state can be used as an action for the paddle control skill.

The model’s architecture is shown in Fig. 11. Given a pre-contact ball motion state and a desired landing location and speed, the inverse landing model predicts the pre-contact paddle motion state that was in effect. The predicted paddle motion state can be used as an action for the paddle control skill.

Ignoring the noise in the environment and imperfections in paddle control, the forward landing-prediction model should have a single answer. That is, if the pre-contact ball and paddle motion states are accurately known, there is only one possible outcome for the landing location and speed of the ball. The same statement does not hold for the inverse landing model. In general, there might be more than one way to hit a given ball back toward a given target. However, it is still useful to build an inverse landing model by training a neural network on all observed trajectories from human data. Such a model would predict the mean of all observed paddle actions taken by humans. Due to the nonlinearities in the action dimensions, it is possible that the mean action would not be a good solution to the inverse landing problem. However, it can serve as a starting point for a search using the forward landing model.

The inverse landing model is more complex than the forward landing model, since the paddle motion state has more dimensions than the landing target . In the proposed setup, is 12-dimensional, while has up to four dimensions. Note that some dimensions do not need to predicted. For example, the paddle position can be decided directly based on the predicted position of the ball at the moment of contact, and the height of the landing target is equal to the height of the table plus the radius of the ball.

6.3 Domain Invariances and Data Normalization

This section describes the normalizing transformations that are applied to the collected training data when training the dynamics models discussed in Sec. 6.2.

Normalizing the data reduces the dimensionality of the inputs and outputs to the models and improves sample efficiency. Consider a trajectory containing a number of observations on the position and velocities of the ball and the paddle. In the table tennis domain, the following invariances hold:

  • Translation invariance across : Shifting the coordinates of the ball and paddle positions by a constant amount for all the points in the trajectory produces a valid trajectory. This transformation does not affect the object orientations or velocities.

  • Rotation invariance around : Rotating all object poses and velocity vectors around any line orthoghonoal to the plane produces a valid trajectory.

  • Rotation invariance around paddle normal : Rotating the paddle poses around the surface normals produces a valid trajectory. The force exerted by the paddle on the ball depends on the contact angle between the ball and the paddle the object velocities before contact. But it is not affected by the placement of the paddle handle. Rotating the paddle handle around the paddle surface normal does not change the contact dynamics.

  • Inversion invariance for paddle normal : Inverting all three elements of the paddle normals produces a valid trajectory. Flipping the paddle normal amounts to changing a forehand to a backhand. As long as the paddle position and velocity stay the same, a forehand and backhand contact have the same impact on the ball.

These variances can be exploited when training the dynamics models from observation trajectories in two ways:

  1. Data Augmentation: For any collected trajectory, random perturbations based on the explained invariances can be consistently applied to all points to generate augmented training trajectories.

  2. Data Normalization: The collected trajectories can be normalized to remove the redundant dimensions.

Data augmentation has the advantage that it has has a simple implementation. It just results in generating more data. The clients that query the dynamics models do not need to be modified. Another advantage to data augmentation is that it can factor in the position of the table and its impact on the trajectory. For example, a ball that bounces once on the table may not hit the table at all if its and coordinates are shifted too much. Similarly, if an augmented trajectory hits the net, it can be removed. The disadvantage to data augmentation is it introduces additional hyperparameters. What translation and rotation values are likely to generate trajectories that are valid and likely to occur during the game? How many augmented trajectories should be generated from each recorded trajectory to sufficiently capture different types of transformations over object states? Lastly, the expected accuracy of the data augmentation approach is upper bounded by the data normalization approach.

The data normalization approach does not add any new trajectories. Instead, it just modifies the collected trajectories to remove the redundant dimensions. For example, all ball trajectories can be shifted so that the initial coordinates of the ball are at . It has the advantage that it does not introduce any hyperparameters and does not increase the size of the training dataset. Also, reducing the number of dimensions simplifies the problem for the neural network, whereas with data augmentation the neural network needs to capture the invariances in its hidden layers. The disadvantage of normalization is that it can not model the table. For example, normalizing the trajectory of a bounced ball implicitly assumes that the ball will bounce in any direction and will never hit the net. Lastly, querying a model trained on normalized data requires normalizing the inputs and un-normalizing the outputs. So it complicates the implementation.

The implementation in this work uses data normalization. Since the models are not aware of the location of the table and the net, it is up to the higher-level skills like the strategy skill to pick targets that are feasible and increase the agent’s reward. In the case of the ball-trajectory prediction model, assuming that the opponent’s ball will always bounce on the player’s table is harmless. The agent can plan to respond to a bouncing ball. If that does not happen, the agent simply wins a the point. For the landing model, the strategy skill is expected to pick targets such that the actions recommended by the models is effective. For example, if the strategy skill picks a landing target close to the net, it should pick a lower landing velocity to avoid hitting the net. Therefore, the dynamics models do not need to be aware of the location of the table and the net.

The following section explains the normalizing transformations that are applied to the data used for training each dynamics model.

6.3.1 Normalizing Ball Trajectories

The ball trajectories are normalized as follows:

  1. All ball motion states are shifted such that the coordinates of the first ball motion state become zero. Suppose the position of the first ball in the original trajectory is

    (14)

    where denote the coordinates of the ball’s position. Then all points in the original trajectory are transformed to points in the normalized trajectory as

    (15)

    In particular, the first point is transformed to such that

    (16)

    This transformation does not affect the ball velocity vectors.

  2. All ball positions and velocity vectors are rotated such that the component of the velocity vector of the first ball becomes zero. More specifically, the objects and their velocity vectors are rotated around the vertical line

    (17)

    by an an angle equal to

    (18)

    where denotes the angle between the two enclosed vectors, denotes the projection of the velocity vector onto the horizontal plane specified by , and is the unit vector parallel to the axis.

These transformations remove three of the six dimensions in the input to the ball-trajectory prediction model. Therefore, they simply the job of the neural network that is modeling ball trajectories.

6.3.2 Normalizing Landing Trajectories

The landing trajectories are normalized as follows:

  1. All ball motion states and paddle motion states are shifted such that the coordinates of the pre-contact ball become zero. Suppose the position of the pre-contact ball in the original trajectory is

    (19)

    Then all ball motion states and paddle motion states in the original trajectory are transformed in the normalized trajectory as

    (20)
    (21)

    This transformation does not affect any velocity vectors or paddle orientations.

  2. All ball and paddle poses and velocity vectors are rotated such that the component of the velocity vector of the pre-contact ball becomes zero. More specifically, the objects and their velocity vectors are rotated around the vertical line

    (22)

    by an an angle equal to

    (23)
  3. All paddle orientations are replaced by paddle normals.

  4. All paddle normals with negative components are inverted. In other words, all backhand paddles are replaced with forehand paddles.

The first two transformations above remove three of the six dimensions from the ball motion state input to the landing-prediction model. The third transformation removes one of the 13 dimensions from the paddle motion state input to the model. The last transformation cuts the space of three of the paddle motion state dimensions in half. Therefore, normalizing the landing trajectories makes it easier for the neural network to predict landing targets.

It is useful to note that the trajectories are not invariant to translation across . Changing the height of the ball and paddle invalidates the trajectory if the ball contacts the table at any point. On the other hand, for the landing-prediction model, data augmentation is used to generate augmented trajectories with reduced ball and paddle heights. The landing model is not concerned with the motion of the ball after its contact with the table. Given a ball trajectory that collides with the table at the end, it is possible to compute where the collision point would be if the ball was shot from a lower height. The same does not hold for increasing the height of the ball. This property is used to generate multiple landing trajectories with lower heights from each original landing trajectory.

6.4 Data Collection

The data for training the dynamics models is collected in a VR environment that is integrated with the simulation environment. The VR environment is specially-designed for the purpose of data collection only. The paddle strike trajectories and ball motion data collected from human demonstrations

Fig. 12 shows the data collection process in the VR environment. A player is controlling the simulated paddle by moving the real VR controller in their hand. The player returns the balls coming from a ball launcher on the other side of the table. The paddle and ball trajectories are recorded and used in training the dynamics models.

Figure 12: Data Collection in VR Environment. The VR environment allows a human player to control a simulated paddle by moving the VR controller in the real world. The simulated paddle follows the VR controller (visualized as the black object holding the paddle). The player returns the balls coming from a ball launcher on the other side of the table. The paddle and ball trajectories are recorded and used in training the dynamics models.

Since the VR hardware has only one headset and does not support two players, the data collection environment emulates a two-player game by having the ball launcher match the direction and velocity of the balls which are successfully returned by the player. If the player lands the ball on the other side, the simulator sends the next ball to the corresponding location on the player’s side of the table, making the player respond to their own shot. This setup allows the distribution of ball trajectories to be closer to what might be observed in a real two-player game. If the player sends the ball out, the next ball is shot from the ball launcher.

Once the data is collected, it is used to extract ball motion trajectories and landing trajectories. The ball motion trajectories start at the moment when the ball launcher throws the ball and captures the motion state of the ball in every subsequent timestep. The landing trajectories start a few timesteps before contact between the ball and paddle happens and continues until the ball lands on the opponent side, or crosses the plane at the surface of the table. The ball-trajectory prediction model is trained on the ball trajectories and the landing-prediction model is trained on the landing trajectories. For training the landing-prediction model, only two timesteps of the trajectory are used: a timestep before contact happens, and the final timestep when the ball has landed or has gone out. Since the dynamics models are normalized, a ball that goes out contains useful information as well, since the same paddle motion can be useful for landing a similar ball from a different region of the table.

6.4.1 Data Augmentation

The data collected from human demonstrations contained about 300 successful paddle strikes where the human player was able to make contact with the ball. Strikes where the ball hit the edge of the paddle were removed from the dataset, since that type of contact is not the behavior that the agent is expected to learn from. To speed up data collection, a data augmentation process was used to generate more samples from the 300 human demonstrations. During data augmentation, the paddle and ball trajectories observed during demonstrations were replayed in the simulator with small amounts of noise to produce additional samples. Each additional sample created this way is counted as an extra sample.

6.4.2 Subsampling

A subsampling process is used to extract multiple training samples from each trajectory. Since the sensors have a frequency higher than the environment, each trajectory can be subsampled with multiple offsets. For example, in a landing trajectory, there are multiple sensory observations of the ball and paddle in the 20 milliseconds (one environment timestep) leading up to contact between the ball and paddle. Any one of those observations can be used for training the forward and inverse landing-prediction models. So, with the subsampling process, every landing trajectory can be used to produce 20 samples.

In addition, the ball trajectories are replicated by considering any of the observations in the first ten timesteps as the starting point for the remainder of the trajectory. The training samples extracted with subsampling were not counted as additional samples.

Although not attempted in the current implementation, there is also an opportunity to extract additional landing trajectories by considering a height reduction and computing where the landing location would have been if the height of the paddle and the ball at the moment of contact were reduced. The new landing location can be computed as the intersection of the observed trajectory and an imaginary plane of where the table would be given the height adjustment. This same process can not be used for augmenting free-moving ball trajectories, as the ball-trajectory prediction model is expected to predict the behavior of the ball after it bounces off the table as well, and that behavior can not be computed since reducing the initial height of the ball changes the contact parameters for when it hits the table. The height reduction technique was not used in the current implementation.

6.5 Evaluation

The dynamics models are trained with two dataset sizes. The smaller dataset contains about 7 000 successful paddle strikes where the human player was able to make contact with the ball. The larger dataset contains 140 000 strikes. Once trained, the dynamics models are evaluated on 1000 strikes generated against a ball launcher. These strikes are not part of the training or evaluation datasets. The observed ball and paddle trajectories resulting from the 1000 strikes are recorded and compared against the predictions from the dynamics models. The following sections report the mean errors for the ball-trajectory and landing-prediction models.

6.5.1 Ball-Trajectory Prediction

Fig. 13 and Fig. 14 show the average position and velocity errors in predicted ball trajectories. Each evaluated ball trajectory contains 30 timesteps of observations corresponding to 0.6 seconds of data. This amount of time is usually enough for the ball to cross the table and reach the striking player. As the plots show, the average position error stays less than 1 cm after 25 timesteps, and the average velocity error stays less than 0.1 m/s after 25 timesteps.

There is little difference between the accuracy of the model trained on the large dataset and the model trained on the small dataset. With data normalization, the number of inputs to the ball-trajectory prediction model is reduced to three. Moreover, the subsampling process generates many more samples from each recorded ball trajectory. In addition, the physical laws governing the behavior of a bouncing ball are relatively simple to learn. Therefore, it seems this model does not need so many samples to learn to predict the behavior of the ball.

Figure 13: Mean Position Error in Ball-Trajectory Predictions. The plot shows the mean position error over 1000 ball trajectories containing 30 timesteps of observations each. The error reported is the Euclidean distance between the predicted position and the observed position of the ball. The average position error stays less than 1 cm after 25 timesteps.
Figure 14: Mean Velocity Error in Ball-Trajectory Predictions. The plot shows the mean velocity error over 1000 ball trajectories containing 30 timesteps of observations each. The error reported is the Euclidean distance between the predicted 3D velocity and observed 3D velocity vectors for the ball. The average velocity error stays around 0.02 m/s for the first 20 timesteps and starts climbing from there. The 20th timestep is around the time when the ball hits the table. It is likely that predicting the behavior of the ball after contact is more challenging than predicting its free motion for the model. At any rate, the prediction error on velocity remains low compared to the magnitude of observations (around 6 m/s).

6.5.2 Landing Prediction

Table 1 shows the mean position error over 1000 landing predictions from models trained on the small and large datasets. The error is about 0.19 m when the model is trained on 7 000 samples and about 0.114 m when the model is trained on 140 000 samples. The landing-prediction model is more complex than the ball-trajectory prediction models, since its inputs include both the ball and paddle states. Moreover, the model is expected to predict the eventual position of the ball after it has travelled for 1-2 meters. These reasons might be why the landing-prediction model benefits from more training data.

Samples
Mean Position Error
7 000 0.190 m
140 000 0.114 m
Table 1: Mean Position Error for the Landing-Prediction Model. Mean position error for models trained from 7 000 trajectories and 140 000 trajectories.

6.6 Conclusion

The dynamics models discussed in this section are only concerned with the physics of the table tennis environment. They do not deal with the physics of the robot. Therefore, the models can be trained from data collected from human games or practice sessions against a ball launcher. The evaluations show that models are able to predict the motion of the ball over multiple timesteps in the future. Absence of noise makes the simulation outcomes deterministic, and easier to predict. Yet, the experiment results show that the models have the ability to capture the physics of interactions between the objects in the environment. Sec. 10 describes an extension to the method that can handle observation noise as well. The next section describes the analytic paddle controller that can be used to execute target paddle motion states to land a given ball at a desired target.

7 Paddle Control Policy

This section discusses the implementation of the paddle control skill. Sec. 7.1 revisits the definition of the paddle control task and defines some variables used in the rest of the section. Sec. 7.2 describes an analytic paddle controller, which is derived mathematically based on the kinematics of the robot links and the motion constraints on the robot motors. Sec. 7.3 describes an analytic paddle-dynamics model that allows higher-level skills to make predictions about expected paddle motion states resulting from executing the high-level paddle targets with the paddle control skill. Lastly, Sec. 7.4 discusses an alternative implementation for the paddle control skill that uses learning.

To increase sample-efficiency, the proposed method uses the analytic paddle controller and does not rely on training to learn the internal dynamics of the robot. The alternative controller that uses learning is studied in an ablation experiment in Sec. 8.

7.1 Paddle Control Problem

As described in Sec. 4.7, the objective of paddle control skill is bring the paddle from its current state at time to the desired state at time . The paddle target requested from the paddle skill includes the paddle pose and its time derivative :

(24)

The paddle pose in turn includes the paddle position and surface normal :

(25)

Note that in this formulation, does not fully specify the paddle pose, as there are generally many possible paddle orientations that satisfy the specified paddle normal . Specifying the paddle pose with its surface normal instead of a fully-specified orientation has the advantage that it gives the paddle skill the freedom to choose any orientation that satisfies the normal. Also, it is easier to replicate paddle normals from human demonstrations than paddle orientations. The forces exerted from the paddle on the ball at the moment of contact depend on the paddle’s normal and stay fixed if the paddle is rotated around its surface normal vector. So any possible orientation that satisfies the given normal will have the same impact on the ball.

The time derivative of the paddle pose includes the paddle’s linear velocity and angular velocity :

(26)

The paddle’s linear and angular velocity at the time of contact affect the forces exerted on the robot and affects the ball’s trajectory after contact.

7.2 Analytic Paddle Control

The analytic paddle controller uses 3D geometry, inverse kinematics, the robot Jacobian, and Reflexxes trajectory planning to achieve the desired paddle motion state . It works through the following steps to move the paddle from its current state at time to a desired state at time :

  1. Find a paddle orientation that satisfies the desired paddle normal in .

  2. Map the target paddle pose to target joint positions .

  3. Map the target paddle velocity to target joint velocities .

  4. Compute a joint trajectory starting with current positions and velocities and reaching the target joint states exactly at time .

  5. Use robot’s controller (e. g. a PID controller) to execute joint commands between times and to follow the computed trajectory.

The following sections describe each step in detail.

7.2.1 Mapping Paddle’s Normal to Orientation

The analytic controller computes a paddle orientation based on the requested paddle position and surface normal . First, it uses inverse kinematics to find a intermediate pose that satisfies only the requested position :

(27)

where denotes the inverse kinematics function starting with canonical robot rest poses, and denotes the joint positions corresponding to a intermediate pose that satisfies .

In the coordinate system introduced in Sec. 2.1 and shown in Fig. 3, the coordinate of the paddle normal points toward the opponent. A normal with a positive coordinate specifies a forehand paddle, while a normal with a negative coordinate specifies a backhand paddle. The function in Eq. 27 runs the IK optimization starting with either a canonical forehand or backhand rest pose for the robot depending the coordinate of the requested paddle. Inverse kinematics is typically implemented as a local optimization process that iteratively uses the robot Jacobian to reduce the distance between the current pose and the requested pose. So, the solution found by inverse kinematics depends on the initial pose of the robot. Starting the search with a rest pose leads to an answer that is closer to that initial pose, and therefore likely to be well within the robot’s reachable space.

Once the intermediate solution is found, forward kinematics is used to compute the corresponding paddle pose for this solution:

(28)

where FK denotes the forward kinematics function. Assuming that the requested target paddle position is reachable by the robot, should satisfy that position, i. e. , one should have:

(29)

Next, the corresponding paddle normal at the intermediate solution is computed. Then a minimum rotation between and the target paddle normal is computed as:

(30)

where denotes a 3D rotation that can move to . Applying the rotation to the intermediate paddle orientation produces the desired paddle orientation:

(31)

where denotes the desired paddle orientation and denotes the paddle orientation corresponding to the intermediate paddle pose.

Due to its construction, is guaranteed to have the desired paddle normal . Also, because it is constructed with a minimum rotation from a feasible pose , it is likely that is feasible by the robot as well.

7.2.2 Mapping Paddle’s Pose to Joint Positions

Inverse kinematics can be be used to find some joint positions to satisfy the paddle pose subject to the physical limits of the robot and the limits on the range of positions for its joints:

(32)

where IK denotes the inverse kinematics function. In other words, IK maps the desired paddle pose to a robot pose .

In general, there are multiple solutions to the IK problem. The implementation in this work uses null-space control to prefer robot poses that are closer to some canonical forehand and backhand poses.

7.2.3 Mapping Paddle’s Linear and Angular Velocities to Joint Velocities

To map the desired linear and angular velocities for the paddle to some joint velocities, the end-effector Jacobian is computed at pose :

(33)

where denotes the Jacobian at . Eq. 33 can be rewritten as:

(34)
(35)
(36)
(37)

where and denote the time derivatives of and . In other words, the Jacobian establishes a relationship between the paddle’s linear and angular velocity and the robot’s joint velocities.

In order to solve for given , the Jacobian needs to be inverted. To handle non-square Jacobians when the robot assembly has more than six joints, and also to avoid failing on singular matrices, the pseudo-inverse method is employed to invert the matrix:

(38)

Then, the required joint velocities at target can be obtained as:

(39)

The current joint positions and velocities can be obtained directly by reading them from the robot’s controller. Therefore, the paddle control policy can analytically obtain the inputs it needs to pass to the trajectory planning skill as described in Eq. 8:

(40)

7.2.4 Trajectory Planning

At this point the problem of executing the paddle motion state is reduced to executing target joint states given the initial joint states , as described in the trajectory planning skill in Eq. 9. This task is accomplished by employing Reflexxes to compute a joint state trajectory between times and . Reflexxes is an analytic algorithm which computes the intermediate joints states between times and solely based on a set of motion constraints defined on individual joints:

(41)

Reflexxes is able to produce trajectories at the desired control frequency, which is in this implementation. It is a fast library and can plan for the trajectory and return the next step typically within .

Reflexxes can compute trajectories that take the minimum time, or trajectories that complete precisely at some specified time . The implementation in this work uses both modes for different skills:

  1. The positioning skill is active when the agent is awaiting the opponent’s action. Its objective is to put the robot in some pose that is suitable for responding to the next incoming ball. So, for this skill, it is desirable to reach the target position as soon as possible. When the positioning skill is active, the paddle skill requests a trajectory that reaches the target in minimum time.

  2. The objective of the land-ball skill is to hit the ball back at a planned contact time . When producing trajectories for this skill, the paddle skill is always given a desired target time . In such cases, usually the robot starts moving slowly and builds up speed toward the target just at the right time to allow it to achieve the desired joint velocities exactly when it is reaches the pose specified by the joint positions .

It is possible that before the robot reaches the target of the positioning skill, the opponent hits the ball back and the land-ball skill becomes active again. In that case, the trajectory planned for the positioning skill is not completed, and the trajectory for the land-ball skill starts with the current joint states as its initial condition.

There are situations where no feasible trajectories exist that satisfy the constraints. For one, the initial joint states at time or the final joint states at time might violate the position and velocity constraints of the robot. In the hierarchical setup, this may happen due to the higher-level skills requesting a paddle motion state that requires executing joint velocities beyond the limits of the robot. In such cases, already violates the constraints in Eq. LABEL:eq:reflexxes. Even with conservative limits on the paddle’s velocity , the joint velocities may end up being high when the paddle is close to the singularity points of the robot. In such regions, the inverse Jacobian matrix computed in Eq. 38 contains elements with large magnitudes. In situations where the requested final joint positions and velocities are invalid, the analytic controller does not send any commands to the robot.

Another class of infeasible trajectories are those with insufficient time. For example, if a higher-level skill demands that the paddle moves from one side of the table to the other side in 0.1 seconds, the trajectory would violate the acceleration and jerk limits of a typical robot. In such cases, Reflexxes can still compute a minimum-time trajectory towards . However, due to having insufficient time, at time the robot will be at some state somewhere in between the starting state and the target state. This state can be queried and used to evaluate the expected outcome of the action under consideration.

(42)

7.2.5 Joint Control

Once a trajectory is computed, it needs to be executed on the robot. The joint trajectory includes joint positions, velocities and accelerations for each timestep. When the control frequency is high enough, the points on the trajectory are so close to each other that just sending the joint positions to a PID controller can control the robot smoothly. Some robots have PID controllers that also handle velocity targets. For such controllers, the joint velocities from the trajectory can also be fed to the robot’s controller. Another option for joint control is inverse dynamics control, where the kinematics and inertial properties of the robot links are used to directly compute the forces or torques for joints. In either case, the underlying controller that is available for the robot implements the policy

(43)

where denote the current joint positions, velocities, and accelerations, denote the desired joint positions, velocities and accelerations at the next timestep, and denotes the joint control command to execute in order to achieve the desired joint states at the very next timestep.

7.3 Paddle-Dynamics Model

The objective of the paddle control skill is to control the robot in a way to achieve the requested paddle motion state at time . As outlined in Sec. 7.2, this skill is implemented with an analytic controller. However, the solution found by the analytic controller may not always achieve the desired paddle target due to the following reasons:

  1. Failure in inverse kinematics: The desired paddle motion state may not be physically feasible for the robot. The target paddle position may be out of reach for the robot, or it may have an orientation that is not possible for the robot’s anatomy.

  2. Failure in trajectory planning: The desired contact time may be too close. In that case, the trajectory planning skill cannot move the robot to the target in time without violating the specified motion constraints.

The analytic controller can predict the error in achieving the paddle target due to either of the above two causes. If there is enough time for the trajectory to reach the target while satisfying the motion constraints, then the trajectory would be feasible. As shown in Eq. 42, the final point on the trajectory computed by Reflexxes contains information about where the robot will be at completion time . When the trajectory is feasible by the given time , the final joint positions and velocities will be equal to the planned target . In other words, the following holds:

(44)
(45)

When the time given is not enough to complete the trajectory, can be used to predict the final paddle motion state at time . First, forward kinematics is used to predict the paddle pose from executing :

(46)

where denotes the predicted resulting paddle pose.

Then, the end-effector Jacobian is used to produce a prediction for the paddle’s final linear and angular velocities given the predicted final joint velocities :

(47)

Equations  4647 combined specify a prediction for the full paddle motion state at time :

(48)

In Sec. 7.2, inverse kinematics (Eq. 32), the robot Jacobian (Eq. 39), and Reflexxes (Eq. 42) were used to map the paddle target to joint positions and velocities . In this section, forward kinematics (Eq. 46) and the robot Jacobian (Eq. 47) are used to map to a predicted paddle motion state resulting from running the analytic controller. Combining the above equations allows us to define a forward paddle motion state model, which given the current paddle motion state and the target paddle motion state produces a prediction for the paddle motion state resulting from the analytic controller:

(49)

where is the forward paddle-state model under the motion constraints specified in Eq. LABEL:eq:reflexxes.

The prediction shows the expected state of the paddle at the planned contact time with the ball. This prediction can be used in conjunction with the forward landing model from Sec. 6.2.2 to inform the agent about the expected landing location and velocity resulting from executing the paddle action .

The controller discussed in Sec. 7.2 and the dynamics model discussed in this section are entirely analytic. They are derived mathematically based on the kinematics of the robot links and the motion constraints on the robot motors. This approach increases sample efficiency since no training episodes are being spent on learning robot control. Moreover, the abstract control space exposed by the analytic controller makes the remaining parts of the problem easier to learn with reinforcement learning.

7.3.1 Learning Paddle Dynamics

One of the advantages of using a trajectory planning module like Reflexxes is that it generates smooth targets which already account for the physical motion limits of the robot. This smoothness and continuity in the targets can hide away some imperfections in the underlying robot controller. For example, if the PID gains are not tuned very well, they would cause smaller errors or oscillations when the targets are closer to each other.

However, robot controllers and robots are ultimately imperfect and imprecise. There are various causes for deviations to exist between the expected behavior and observed behavior of a robot. Such deviations exist due to:

  • Imperfections in the controller’s implementation, gains, and other parameters.

  • Round-trip delay of executing commands on the robot.

  • Malfunctioning or worn out robot parts.

  • Mechanical backlash caused by gaps between components.

  • Misspecified motion constraints. If the velocity, acceleration, and jerk limits given to Reflexxes are higher than the actual limits of the robot, Reflexxes would compute trajectories that are beyond the physical limits of the robot.

It is possible to extend the notion of the robot’s model to also capture such imprecisions in control of the robot. Unlike the analytic paddle model in Sec. 7.3, which was derived mathematically, it is best to learn a dynamics model over robot’s control by experimenting and observing the robot’s behavior. A neural network can be trained to predict inaccuracies in control regardless of the underlying cause. As the robot is executing the target joint motion states specified by , the resulting joint positions and velocities at time can be recorded and used as labels for training a model. The trained model can then make predictions of the form:

(50)

where denotes the forward robot model, and denote the expected joint position and velocity observations at time .

This constitutes a forward prediction, which when combined with Eq. 49 can produce a more accurate prediction about the future state of the paddle. Such a forward prediction can be used to produce a more accurate estimate on the landing location of a particular strike specified by .

In the other direction, the same training data can be used to learn a corrective robot model as in:

(51)

where denotes the inverse robot model, and are alternative targets such that if they are requested from the robot, it is expected that the observed joint states at time would be close to the actual targets . In other words, it is expected that:

(52)

The inverse robot model can be used to adjust the joint position and targets before they are passed to the trajectory planning skill to increase the likelihood that the requested paddle motion state is going to be achieved.

7.4 Learning Paddle Control

The main approach in the proposed method uses the analytic paddle controller discussed in Sec. 7.2. This section discusses an alternative implementation for the paddle control skill using learning. The learning approach is undesirable when an analytic solution exists. The learning approach discussed here is implemented and evaluated in an ablation experiment in Sec. 8.

As shown in the task hierarchy from Sec. 4, the paddle control skill depends on the trajectory planning skill, which in turn depends on the joint control skill. However, it is possible to treat paddle control as a single problem. Combining the task specifications for these three skills from eqs. 10, 9 and 8 produces a contracted definition for the paddle control skill as:

(53)

Eq. 53 relates the high-level paddle control policy with as target to the low-level joint control actions over multiple timesteps from to .

Given this formulation, the paddle control task can be treated as a learning problem where the objective is to find the right joint commands to bring the paddle to the desired state at the desired time. A basic approach to learning paddle control with standard reinforcement learning may use random actions on joints with the intention of gradually discovering the right actions that move the paddle toward the requested targets. This approach is not desirable, since it is very inefficient and random joint actions can break the robot.

An alternative approach may consider action spaces that span over time intervals longer than one timestep. Effective strikes generally have some continuity in joint motions. An effective strike usually maintains the direction of motion for most joints. So, one can sample a set of fixed velocity/acceleration profiles for each joint and use those to create paddle strikes. However, it is hard to determine whether such motion profiles would cover the space of all useful strikes. Another problem is the issue of the initial pose of the robot at the start of the strike. If the paddle is already in front of the robot, a successful strike may need to bring the paddle back first, before it can be moved forward again to hit the ball with enough force. Moreover, the robot may be in the middle of some motion and have non-zero velocities on some joints when a new strike is to be executed. These requirements greatly increases the space of possible motions that need to be tried in order to develop a high-level paddle controller via training.

7.5 Positioning Policy

The positioning skill is active when the agent is awaiting the opponent’s action, i. e. , when the opponent is striking the ball. This skill is invoked either if the episode starts with the launcher sending the ball toward the opponent, or right after the agent hits the ball back toward the opponent. The skill stays active until the opponent makes contact with the ball, at which point the striking skill becomes active.

The positioning skill receives a paddle position target from the strategy skill. The objective of the skill is to move the paddle to the requested position as quickly as possible. Note that the requested paddle target has no specified time.

The paddle position is a proxy to the robot’s pose. Specifying the paddle pose instead of the robot pose has the advantage that it makes the policy less dependent on the specific robot that is in use. Moreover, a paddle position can be specified with three values, while the full robot pose typically requires specifying six or more joint positions.

As discussed in Sec. 4.6, the positioning skill is defined by the policy:

(54)

where denotes the paddle control policy, denotes some paddle motion state that satisfies the requested paddle position and normal direction indicated by . The paddle skill is expected to achieve paddle motion state as fast as possible.

The positioning skill is implemented analytically. It simply computes a fully-specified paddle motion state which satisfies the paddle position and forehand/backhand pose indicated by and requests from the paddle control skill, which in turn executes it using the trajectory planning skill. To compute , the positioning skill first uses inverse kinematics to compute a robot pose as in:

(55)

where denotes the inverse kinematics function starting with some canonical robot rest pose, and denotes the joint positions that satisfy the requested paddle position . Satisfying is achieved by starting the IK search with a canonical pose which has the same forehand or backhand direction as specified by . Since IK algorithms perform a local search, a carefully-chosen canonical forehand and backhand pose allow the positioning policy to satisfy without flipping the forehand/backhand orientation. Once target joint positions are computed, forward kinematics is used to compute a paddle pose from them:

(56)

Assuming that the requested target paddle position is reachable by the robot, the paddle pose should satisfy that position. Otherwise, will get as close as possible to . The fully-specified paddle motion state should also include the paddle’s linear and angular velocities. In the current implementation, the positioning skill requests a stationary target by setting the target velocities to zero:

(57)

However, a more complex implementation could request to have non-zero joint velocities at target to reduce the expected reaction time to the next ball coming from the opponent. The full paddle motion state is then sent to the paddle control policy to execute.

7.6 Conclusion

This section discussed the analytic controller setup that handles paddle control, trajectory planning, and joint control tasks. It also discussed the analytic paddle-dynamics model which predicts the expected paddle motion state resulting from executing a desired paddle motion state . The next section explains how the analytic paddle controller and model can be used in conjunction with the dynamics models trained from human data over the environment’s game space to implement the land-ball policy.

8 Striking Policies

The dynamics models trained from human demonstrations allow the agent to predict the trajectory of an incoming ball, and predict the landing locations resulting from hitting the incoming ball with given paddle strikes. This section describes how the model-based land-ball policy uses these dynamics models and the paddle control policy described in the previous section to execute arbitrary landing targets. The land-ball policy is evaluated on a target practice task with random landing targets. In order to determine whether using arbitrary strikes from human demonstrations imposes a restriction on the land-ball policy, the policy is also evaluated on dynamics models trained from data directly generated on the robot. To evaluate the sample-efficiency of the model-based land-ball policy, the land-ball task is also learned directly using a model-free reinforcement algorithm. Lastly, the alternative hit-ball striking policy is described. The hit-ball policy does not use the strikes demonstrated by the humans and is suitable for learning new striking motions.

8.1 Model-Based Land-Ball Policy

The objective of the land-ball skill is to execute a paddle strike to send an incoming ball with motion state to a landing target consisting of a target position and speed at the moment the ball lands on the opponent’s side of the table. Fig. 15 illustrates the implementation of the model-based land-ball policy using three dynamics models: ball-trajectory prediction, forward landing-prediction, and inverse landing-prediction. Algorithm 1 outlines the policy steps in detail.

Figure 15: Land-Ball Policy Using Forward and Inverse Dynamics Models.
input : Current ball motion state
input : Desired landing target
foreach  such that is reachable do
      
end foreach
repeat
       emit next action from
until robot paddle hits the ball or episode ends
Algorithm 1 Model-Based Land-Ball Policy Algorithm

Fig. 16 demonstrates the different stages of the land-ball policy’s algorithm. Given the incoming ball’s motion state , the policy predicts the ball’s future trajectory . The predicted ball trajectory contains future ball position and velocity observations at the resolution of the environment timestep (20 ms). There are multiple options for selecting a subset of the predicted trajectory

as potential striking targets. In the current implementation, a heuristic is used to select all points that lie between the two planes

 m,  m, corresponding to a 20 cm band in front of the robot assembly. This band typically contains 2-3 predicted observations in the predicted ball trajectory.444Another potential solution for selecting striking targets from the trajectory is to use a heuristic to prefer balls that are closer to a certain preferred height for the robot. It is also possible to leave this decision entirely up to a higher-level skill like the strategy skill. In other words, the strategy skill could additionally specify the striking target by requesting a desired height or distance from the net for the point of strike. These points are highlighted as light-green balls in Fig. 16. Considering multiple potential striking points allows the land-ball policy to come up with multiple striking solutions and pick one that is most likely to satisfy the requested landing target .

Figure 16: Demonstration of the Land-Ball Policy.

For each potential striking point </