Reinforcement learning have been a trending topic recently with the dissemination of Deep Learning techniques[bengio2007scaling, lecun2015deep]. Various interesting applications have been made, from playing games like Atari [mnih2013playing], Go [silver2017mastering] and StarCraft [vinyals2019grandmaster], to controlling actuators (muscles) on full body motion control simulation [lee2019scalable] or a robotic hand solving Rubik’s cube [akkaya2019solving], those are some examples in a vast range of applications. Here we want to experiment with high level controls of a character in a virtual 3D environment to do simple tasks like collect or fetch an object.
The chosen character is a dog named DogBot. It has a set of animations (stand, walk, trot, run, jump, etc) and a heuristic traditional controller. Deep reinforcement learning is applied on it to select between the aforementioned actions at each instant to accomplish a desired task. This learned behavior can be then controlled by a higher heuristic or reactive pattern, which in its simplest form can enable or disable the fetch or any other behavior.
This framework can be seen as an explicit hierarchical approach111[barto2003recent] presents an overview of hierarchical reinforcement learning with the theory of semi-Markov Decision Processes, recent techniques and challenges of this field.
presents an overview of hierarchical reinforcement learning with the theory of semi-Markov Decision Processes, recent techniques and challenges of this field.(figure 1), the lower level is the heuristic controller and animations, the mid level are the learned behaviors, each one comprising of a single task, and the final level abstracts the choice among tasks222One example of a higher level abstraction would be using command voice to select between tasks, it would be similar to products like Amazon Alexa and Microsoft Cortana, but working together with previous learned behaviors.. This allows each level to be treated with some separation and freedom. One could use reinforcement learning in a end-to-end solution or mix heuristics with separated learned behaviors.
While there are plenty of successful cases of low-level training, developing such learning environments and characters demands a lot of additional work compared with the standard animation stack and also requires fine tuning. For such cases, achieving a desired artistic movement/animation can be challenging or not viable at all. Alternatively, current high level training examples are mainly used to solve already posed problems with fixed rules, such as games from a player perspective.
Our approach differs from those in its fundamental principle: it does not solve an existing game, but creates it own game environment and rules to learn a desired task. This gives the freedom of balancing between the learning difficulty, which accounts for the generalization, and the control over the actions which are the heuristics and animations. All that with access to the underlying environment model333Access to the environment model gives the possibility of shaping the reward process easily and possibly avoiding or softening one of the fundamentals problem of reinforcement learning: The credit assignment[sutton2018reinforcement]..
The mixed use reinforcement learning and traditional animation simplifies the artistic control over the final result. It gathers the best of each world: the traditional animation and heuristic gives control over the set of actions whereas the reinforcement learning gives the possibility of the agent generalizing for different situations.
2 Environment and Agent Modeling
The modeling of the virtual environment and agent were made with Unity Editor 2019.3 using its physics engine simulation.
The environment is a Unity scene where the character can act and interact with objects. It consists of a square gray plane m with a white border of m diameter, which visually delimits the area of interest. While it is possible to walk past this area, if the agent go past it, a final state (game over) is reached leading to a reset. The figure 2 shows the top view of the environment in the Unity editor.
Inside the delimited area the following objects can be found (figure 3):
(Simple Geometry) Cube with m edge
(Complex Geometry) Coin with m diameter
The choice of objects is arbitrary, but this is an advantage of virtual environments: What is seen by the agent don’t need to be the same which is seen by a player, hence the freedom of using simple objects and rendering for the agent, albeit a complex scene may be displayed to the player. Another important point is the size of the objects, with later usage of visual information at low resolution ( pixels) rises the need of big objects.
An important conceptual distinction of environments is whether it is completely or partially observable. Although it’s described as an environment property here, it is directly related to the agent’s perception of its world.
Completely Observable - All information of the environment state is observable.
Partially Observable - Some of the information of the environment state is observable.
One example of a completely observable environment is:
Assuming the previous environment area without obstacles and with a single collectible object as goal. If the observations are set as the agent position and direction, the direction to the target, the target position and the distance and position of the border. It is a completely observable configuration in the sense that independent of the agent state (position and direction) it always have complete information of itself and the target to complete its goal444“All information” is relative to the important information to complete a given task. While it is not possible to draw the complete scene from the given observations, it is sufficient to complete the task at any state/time without the need for information such as object shape, color, etc. For this simple case is possible to define if it is completely observable, but complex environments may not be trivial to assert such definition.. Indeed one could even write a heuristic to solve this task.
In contrast, following the same example, if only visual information were used, (i.e, a 2D image from the agent’s vision cone), depending of the agent position it may or may not “see” the target, neither know its exact position. In this case the sensing of the agent varies depending of its state and it is always a partial view of the environment which may or not contain relevant information.
This differentiation of how the environment is perceived and how much information is available per observation is an important component of learning performance.
The agent’s goal can be put as a general mobility task, given a stimulus it needs to reach the target point. Inside this general task two specialized tasks are implemented:
Collecting an object: reach the object position, when the agent collides with it the object is removed from the scene, i.e., collected.
Fetch: reach the object and go back to its initial position. It also can be thought as reaching two objects, for example, the stick and then who threw it.
While these can all be cast as essentially the same task, their difference comes of how they are modeled inside the virtual environment and which information is available to the agent to complete them. That said, they are practical examples of mobility tasks.
The agent is an entity abstraction which is itself a behavior policy. Nevertheless, when referred without a specific policy it can be imagined as an entity with sensors collecting observations and actuators interacting, where a policy can be plugged-in linking an observation with a given action555In programming it would be equivalent to the concept of an interface which define empty methods with input and output, where the implementation itself would be the policy. as show in the figure 4.
The important points in a agent are: what it observes to take actions, how both the observations and actions are encoded and how the associated rewards are distributed, these topics are further discussed in the following sections.
The Unity character controller is the lower level hierarchy of agent’s actuators. It can be considered intrinsic to the agent and it controls the agent’s velocity, turn speed, gravity and other effects of the character physical properties. The detailed list of the parameters used in the training are in the section 3.1, while the figure 5 shows its interface inside the Unity Editor.
An observation is any kind of sensing made from the agent itself or the environment state. Those need to be encoded by numbers that serves as input to the behavior policy. Here, two types of observations are employed:
Vector observation: complete observable, composed of hand-crafted features:
Normalized direction to target: ,
Normalized distance to border: , ,
Normalized agent forward direction: ,
Normalized agent up direction: ,
Agent local position (agent’s position referent to the center of the environment):
Visual observation: partially observable, down-sampled from the original rendered image, 3rd-person-like camera aligned with agent forward direction:
2D image: matrix 666DogBot’s experiment uses RGB images, but these virtual environments allows the use, for example, of depth maps or any other 2D input.
The first encoding takes a total of 20 numbers (floats) as observation. These were calculated from the agent/environment state and are much like as the observations taken by Puppo, the Corgi777Puppo is a demo from Unity 3D where a dog uses a learned behavior acting direct in its joints (low level control). We cite it here, because it is used for later comparisons.[puppo], which is a demo made by Unity. However there are some changes to fit our modeling. First, the Normalized distance to the border is added, because it is relevant in an unbounded environment to keep the agent inside the desired area. Last all the joint angles and torque information were removed. The Unity Puppo works on a low-level control of the joint angles and torque while this new agent works on a high-level control with animations. Those differences are related to how the agent senses itself, while how it senses the rest of the environment and its target remained the same.
Next, the visual encoding consists of 84×84×3=21168 numbers (floats), in a 2D RGB image. This image is the direct rendering of the agent’s view (figure 6). This kind of observation is interesting for many applications, despite being much more complex and less complete than the first encoding in 20 numbers, because only partial information can be inferred from it, it can be more general in some aspects. For example if a variable number of collectibles is admitted in the environment, the first encoding would not be able to handle that variation, but using a 2D image the agent could deal with many objects (because the image is already a partial view of the environment).
One differentiation about those observations is with respect to its constraints. They can be:
Self-contained - Depends only on the agent state and sensors.
Environment-model dependent - Depends on the environment underlying mechanics.
With those criteria, the 2D image is an agent self-contained sensing while the vector observation needs access of the underlying environment model to be processed and fed to the agent (i.e. target position). This is directly related to the applicability and generalization of the agent. The disadvantage of a model-dependent agent is: it cannot be placed in a different environment where it does not have access to the underlying model.
The DogBot’s actuator(s) is a character inside Unity. It is composed of various animations and a controller (with blend trees and alikes) which receives four parameters controlling the X-Axis velocity, the Y-Axis rotation and booleans jump/crouch.
For the actions encoding two schemes were used continuous and discrete:
Continuous action space:
Forward and backward movement:
Steering left and right:
where and are threshold parameters intrinsic to the agent controller.
Discrete action space:
Forward and backward movement:
Each item is an action branch, and can be choose simultaneously. For the case of Jumping/Crouching, despite being separated branches, priority is given to Jump action over the Crouch as it is not physically possible to do both at the same time. While the encoding for actions are arbitrary, they reflect the ranges of the Unity character controller such as max velocity and turn speed, which are configurable but from the agent’s perspective are intrinsic to its actuator.
The entire universe of reinforcement learning is based on encouraging the best behavior through rewards (much like it is done when teaching a trick to a pet), in other words, rewarding good actions according. It can also be posed as a optimization problem of maximizing the total reward [sutton2018reinforcement].
In this scenario two perspectives are brought up: developing a good reward signal is the key to being able to solve this optimization problem, yet most of times it is not as easy to qualify a given action and state pair individually, but only the final outcome of a sequence of actions and states. In theory, even for the cases where only the final outcome is rewarded, in the limit after many (infinite) experiences it would be possible to learn an optimal behavior policy. Sadly, in the real world limited resources are available, hence the art in learning good policies lies in modeling good environments, rewards and algorithms as much as possible888Here good performance means reduced time and sample complexity..
The DogBot agent experiments with two types of reward:
Per action reward:
Positive reward is assigned if the agent gets closer to the target, negative reward is assigned if it distances itself from it. The formula used in this case is or if the agent achieved the objective.
Positive reward is given when the agent reach its destination, i.e., the collectible.
In all cases, leaving the training area leads to a negative reward of ending the episode. Also a small negative reward is given at each time step . It was chosen to be close to so the accumulated penalty would not saturate the total reward signal. This is widely used as a time penalty to stimulate the completion of a task in the shortest time.
It’s also important to classify these rewards in another way: The first reward needs a broader knowledge of the environment to be calculated depending both of the agent and environment underlying state. Conversely, the second could be completely assigned with the agent’s sensing, but brings in the credit assignment problem which can be translated as findinghow much each previous state contributed to the present reward?
This differentiation is interesting because it resembles the agent self-contained observation property and have implications in the learning performance. In this case, even for the model dependent reward, after the learning process it does not prevent the agent to be used in a new unknown environment.
The tools used for training were Unity Editor 2019.3 and its machine learning framework ML-Agents[juliani2018unity] in its version v0.13.1. All experiments use the Proximal Policy Optimization [schulman2017proximal] algorithm. Experiments were run only once given that they are computationally demanding. While the results are expected to be similar between runs, when analyzing the results some variability should be taken in account.
The ML-Agents framework exposes some parameters related to it’s built-in models (network architecture), they will be specified inside the section 3.1. Other specific details of the network itself are omitted because they are the default of ML-Agents and can be found in their online documentation [juliani2018unity]. This choice was made because the focus is on the development of the agent, environment and reward signal instead of network architecture.
The training of the “Unity Puppo, the Corgi” uses the same configuration presented in the table 2. The environment is the same provided by [puppo], ported to the newer version of ML-Agents(v0.13.1) used here. The only change to the original environment was its training area size to match DogBot’s area size.
|Moving turn speed||deg/s||Turn speed when not stationary|
|Stationary turn speed||deg/s||Turn speed when stationary|
|Jump power||m/s||Vertical velocity applied when jumping|
|Forward Velocity||m/s||Maximum forward velocity|
|Backward Velocity||m/s||Maximum backward velocity|
|Gravity multiplier||Multiplier for gravity simulation|
|Anim speed multiplier||Multiplier for animation time scale|
|batch size (Continuous)||Number of samples used for each optimization step for continuous action space.|
|batch size (Discrete)||Number of samples used for each optimization step for discrete action space.|
|buffer size||Number of samples collected for each policy update.|
Number of neurons per hidden layer.
|num layers||Number of hidden layers used for the model.|
|learning rate||Initial learning rate for training.|
|max steps||m/s||Number of total simulation steps (actions) taken for training.|
|Number of times each collected observation is used for training.|
|time horizon||Horizon for learning, it represents how far in time steps one action can influence a past reward.|
|gamma||Discount factor, it represents how much of a n-future reward () is assigned to a present action in the form .|
|Curiosity strength||Strength of the curiosity intrinsic reward signal.|
|Curiosity gamma||Discount factor for the curiosity reward.|
|Visual encoding type||nature_cnn||Type of architecture used for the convolutional layers for the visual observation encoding.|
|Max episode length||5000||Maximum number of steps until an episode ends.|
collectibles which randomly re-spawn when collected. The evaluation metric is the Score, (the number of objects collected over all episodes) and Reset, (the number of times the agent was reseted due to leaving the training area). It ran forepisodes or steps.
|Exp||Obs. Type||Act. Type||Train Env.||Reward Type||Test Env.||Score||Reset|
|1||Vector||Discrete||1, box||Per action||1, box||1027||61|
|2||Vector||Continuous||1, box||Per action||1, box||2324||82|
|3||Vector||Discrete||1, box||Sparse||1, box||4||670|
|4||Vector||Continuous||1, box||Sparse||1, box||0||0|
|5||Visual||Discrete||1, box||Per action||1, box||1370||7|
|6||Visual||Continuous||1, box||Per action||1, box||2883||0|
|7||Visual||Continuous||24, boxes||Sparse||1, boxes||12||0|
|8||Visual||Continuous||24, boxes||Sparse||24, boxes||7119||3|
|9||Visual||Continuous||1, box||Per action||24, boxes||6163||0|
|Exp||Obs. Type||Act. Type||Train Env.||Reward Type||Test Env.||Score||Reset|
|10||Visual||Continuous||1, box||Per action||1, box||4606||599|
|11||Visual||Continuous||24, boxes||Sparse||1, box||1502||241|
4.1 Training Convergence
The figure 7 presents the comparison of Puppo and DogBot training convergence. This result is not directly comparable with the following figure 8, because the modeling999Here the modeling accounts for observation, reward, and final state, besides the changes to fit the high level type control proposed. used is the the same from Puppo, for comparison purposes, which differs ours.
High and low level control
Low level control gives the possibility of an end-to-end learning, but in the present case it converges slower than high level learning. Another downside is that it may requires more work for developing such agents and fine tuning to specific actions. For virtual scenarios and entertainment tasks it may be more convenient to use traditional animations, which are widely available and offers easier control over the final result.
Action branches and time interval
One difficulty when training with all action branches active at the same time was that it would not learn anything. Starting from a random policy, the agent would be stuck alternating between forward, backward, left and right. This problem with the axis movement is probably due to the interval between the actions being short and not allowing the agent to commit to a give action for enough time101010This is a direct limitation of the character controller which smooths the transition between animations, and in our case, the time between transitions were greater than the agent’s decision interval. until a strong reward signal were obtained.
Nevertheless increasing the action interval would lead to unwanted delays in the reaction time of the agent. Four solutions which worked well are:
Giving a slightly reward for moving forward.
Adding a bias in the actions favoring the forward movement.
Training the action’s branch separated.
Restricting the Y-Axis to forward movements only.
Visual observation achieved better results than vector observation when using the same reward system. It also could learn faster in the environment with many collectibles, where vector observation can’t be applied. Despite it requiring more computational power, it shows that passing raw inputs may achieve better performance than having hand crafted features.
Continuous actions space achieved better results than the discrete action space, yet its convergence was slower. One point to notice is the batch size for continuous and discrete action space were different, and they were not extensively searched for the best result. The use of smaller batch size for discrete actions space is justified because the larger the batch size more samples are averaged in the training process and the in-between or average of discrete or categorical actions may not make sense at all, hence it is advised to use small batch sizes for these cases. On the other hand, for continuous actions the average between samples makes sense or exists111111Another way to think about it is how their sampling is done.
For discrete space each action has a given probability, which are disjoint, the classes itself are not related to the encoding or numbers used to represent it.
For continuous space the final value is sampled from a distribution with a given mean and variance, hence the need to approximate it from various sample steps obtained in the training process.
or numbers used to represent it. For continuous space the final value is sampled from a distribution with a given mean and variance, hence the need to approximate it from various sample steps obtained in the training process..
Using a per-action reward strategy worked in all cases, while sparse reward only worked in the scene with multiple objects. It is clear that even with the drawback of the agent possibly exploiting the reward instead of learning the true objective a per-action strategy is more reliable.
Another crucial point is that the per-action reward developed a good exploration policy when the agent was not seeing the collectible. In these cases it ran to the opposite side of the training area, while the agent trained with sparse reward usually get stuck running in circles until it is close enough to notice the collectible. It explicits how having access to the underlying environment when developing the reward system can be beneficial.
The sparse reward, given only when an object is collected, can’t be exploited, but makes the learning process entirely dependent of the environment’s difficulty. For those cases a good approach is using a curriculum121212Curriculum learning is an approach of increasing the environment difficulty according with the agent learning which was proposed in [bengio2009curriculum], starting with very easy tasks and increasing the difficulty as the agent performance increases. An example of such curriculum could be starting the environment with many objects or small area and decreasing/increasing it according as the agent learns.
Multiplicity of collectibles
Having multiple objects in the scene completely changed the result of using sparse reward, from unable to learn anything to achieving the best result in the test environment with multiple collectibles.
Giving the agent an environment where it can often achieve a goal even with a random policy is the key point to use sparse reward. It is a good strategy to be used with curriculum learning, in this case, the difficulty level could be easily controlled by the amount of collectibles. While it was not tested here, one can believe that an agent trained with such curriculum could perform better especially in the test case where it performed bad with a single collectible.
Two variations where tested, changing the size of the environment and the number of collectibles. Those variations shown interesting behaviors.
First the agent trained with multiple collectibles, (which was not able to perform well in the single collectible environment), had a good performance when the size of the environment was reduced. From visual inspection, what was happening was: the agent had a short sight and this effect was attenuated by testing on a smaller area.
Then, the agent trained with a single object performed very well with multiple collectibles, but not as good as the one trained with many objects. Also from visual inspection, the agent would only turn to the right, which indeed works but is not the optimal behavior with multiple objects. In the case of a single object, which most of times was far away, it would not change much, but in the case of multiple objects it was a limiting factor.
These behaviors reinforces the idea of the need for variation in the training environment. It also shows how details of the environment can introduce undesirable side effects or completely ruins the learning of the desired behavior.
Although the agent could somewhat generalize well to multiple and single objects it wasn’t as good as the performance of the trained environment. When running on a smaller area it could not cope with the delimited marks and the agent left the area much more than when using the original size. One possible cause is the intrinsics of the character controller which does not allow the agent to turn fast and was accentuated by the decrease in the training area.
For achieving good generalization it must experience great variation of the environment and goal, which was only partially provided. Yet, this exposition need to be done without a steep difficulty variation, or the agent may not be able to learn anything.
Some factors affecting the agents behavior were covered by the agent and environment design, yet it is far from being extensive. Other than that, factors such as network architecture and algorithms were not evaluated. Using architectures with memory and different algorithms could have influenced in both the convergence and quality of the learned policies. An example of a possibly improvement from using architectures with memory is for the case where the agent run in circles, it is doable to look around when the agent can’t see the goal, but doing it repeatedly could be avoided if memory of past states where present.
One of the key features to develop is certainly the reward system. Making the agent able to have feedback even with a completely random behavior is a must, be it through a per action reward or a easier environment where rewards are not much sparse. Even so, it may get stuck if the action space is complex or if there is not enough time to commit to each individual action. This was exemplified by the lack of convergence when using the full action space despite having a per action reward. Nevertheless, with sparse reward and increased amount of objects in the scene leaded to convergence.
The type of observation used also shown some less intuitive behavior: while the vector observation had all the data needed to heuristically solve the problem, the visual observation had a better result. It may be due to encoding which is compact and human understandable, but may not be the best for learning, since it relies on common human knowledge and concepts. Letting the model extract the features from visual observation is computationally expensive for training, but it achieved better policies and is easily extensible to new environments.
Modeling and training agents to complete tasks, (in the way one would expect a human being to do so), is a complex problem. Good policies were learned, although side effects and not so much intuitive results were obtained too. This explicits how an agent can exploit unnoticed details of its environment in unpredictable ways for good, bad and ugly things. Although the agent may solve the given task, expecting human like (in our case dog like) behavior may mislead the modeling and comprehension of the results.
Time and sample complexity are common metrics for evaluating performance, but human resources should also be taken in account. Developing good environments demands a good amount of human resources which can be more expensive than computational resources. This alone can motivate the research for general good practices and guides.
Here, nor the neural network architecture/parameters neither the environment graphics complexity were evaluated, but the modeling of the environment, agent and reward. Certainly these aspects are important, but much has been done in those areas already and could be easily “plugged in” or replaced in DogBot agent, for example a more complex visual system, i.e. a segmentation/detection neural network when using photo-realistic rendering, etc.
Last, there is no perfect formula to achieve good results. Part of it is the art of modeling the system and part is building insight from past failures and adapting, which itself is much the idea of (meta) reinforcement learning.