DeepAI
Log In Sign Up

VECA : A Toolkit for Building Virtual Environments to Train and Test Human-like Agents

05/03/2021
by   Kwanyoung Park, et al.
Seoul National University
0

Building human-like agent, which aims to learn and think like human intelligence, has long been an important research topic in AI. To train and test human-like agents, we need an environment that imposes the agent to rich multimodal perception and allows comprehensive interactions for the agent, while also easily extensible to develop custom tasks. However, existing approaches do not support comprehensive interaction with the environment or lack variety in modalities. Also, most of the approaches are difficult or even impossible to implement custom tasks. In this paper, we propose a novel VR-based toolkit, VECA, which enables building fruitful virtual environments to train and test human-like agents. In particular, VECA provides a humanoid agent and an environment manager, enabling the agent to receive rich human-like perception and perform comprehensive interactions. To motivate VECA, we also provide 24 interactive tasks, which represent (but are not limited to) four essential aspects in early human development: joint-level locomotion and control, understanding contexts of objects, multimodal learning, and multi-agent learning. To show the usefulness of VECA on training and testing human-like learning agents, we conduct experiments on VECA and show that users can build challenging tasks for engaging human-like algorithms, and the features supported by VECA are critical on training human-like agents.

READ FULL TEXT VIEW PDF

page 3

page 6

page 11

page 13

page 14

page 15

10/19/2020

Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration

In this paper, we introduce Watch-And-Help (WAH), a challenge for testin...
05/17/2019

Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence

Learning agents that are not only capable of taking tests but also innov...
03/13/2019

VRKitchen: an Interactive 3D Virtual Environment for Task-oriented Learning

One of the main challenges of advancing task-oriented learning such as v...
01/12/2022

Toddler-Guidance Learning: Impacts of Critical Period on Multimodal AI Agents

Critical periods are phases during which a toddler's brain develops in s...
09/27/2013

Madeup: A Mobile Development Environment for Programming 3-D Models

Constructionism is a learning theory that states that we learn more when...
07/09/2020

ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation

We introduce ThreeDWorld (TDW), a platform for interactive multi-modal p...
01/07/2018

Building Generalizable Agents with a Realistic and Rich 3D Environment

Towards bridging the gap between machine and human intelligence, it is o...

Introduction

Environments 3D Large-scale Exten sible Physics FPP Vision Audio Tactile Multi-agent Inter action
ALE ALE X X X X X X X X X
DeepMind Lab DeepMindLab O X X O X X X X
OpenAI Universe OpenAIUniverse O O O X O X X X X
VizDoom VizDoom O X O X O X X O X
Arena Arena O X O O X X X O X
Malmo Malmo O O O X O X X X O
Gibson GibsonEnv O O X X O X X X X
MINOS MINOS O O X X O X X X X
House3D House3D O O X O X X X X
HoME HoME O O O O O X O X
AI2-THOR AI2-THOR O X O O O X X O O
VECA O X O O O O O O O
Table 1: Comparison of various virtual environments with VECA. O indicates that the environment supports the perception and also it reflects the characteristic of human perception. 3D : Supports 3D environment. Large-scale : Number of environments is more than . Extensible : User can implement and add novel environments and tasks. indicates that the user is possible, but difficult to add novel environment and tasks due to limitations such as lack of visual editor or heavily specialized API. Physics : Supports physical properties (e.g. collision, friction) FPP Vision : Renders first person perspective vision. Audio : Renders audio perception. Tactile : Renders tactile perception. Multi-agent : Supports multiple agents in single environment. Interaction : Supports comprehensive interactions. For example in Malmo, hitting with a pickaxe breaks the block when it targets block while it gives damage when it targets monsters.

It has long been an important research topic to understand how humans learn and build human-like agents BuildingMachine; HumanLikeAgent

. In particular, human intelligence has been a role model for many modern learning machines as an interpretable and data-efficient general intelligence. For example, deep learning methods, inspired by the human brain structure, have outperformed previous state-of-the-art algorithms and even human intelligence in various domains such as image classification 

ImagenetSurpass, complex games such as Go or Starcraft 2 AlphaGo; Starcraft2. Although there is no need to precisely duplicate human intelligence (which is error-prone and imperfect), human intelligence is still an attractive target to learn and get inspiration on how learning works DeepLearningCritical. Human intelligence learns by experience; collecting rich multimodal perception (such as vision, audio, tactile) from an environment ObjectPerception; PerCog and actively interacting Active1; LearningPlay with it. For instance, developmental psychologists have studied that general understanding of objects develops in an early stage of the toddler without any supervision, by receiving multimodal feedback during interaction on objects such as mouthing, chewing, and rotating Gibson; Piaget

. Like human intelligence, it has also been argued that artificial intelligence could benefit from multimodal 

de2017guesswhat; ngiam2011multimodal and interactive caselles2019symmetry; hermann2017grounded learning. To build agents that learn like human, it is crucial to provide rich multimodal perception and interactions for the agent. Although prior works reveal that human-like agents learn from rich multimodal perception and active interaction with the environment, we still lack an understanding of how and what

the agent learns. In particular, there are important unexplored problems, such as how human-like agents should be evaluated, and through what tasks human-like agents are trained. Such limitations motivate to develop an extensible toolkit, rather than a fixed set of tasks, to support the research community to train and test novel approaches and algorithms for human-like agents. However, it is challenging to design an environment that can provide the agent with rich multimodal perception and comprehensive interactions while also being extensible to develop custom tasks. An intuitive way would be to develop a robot with an actionable body and sensors, but building such a human-like robot is highly costly. Also, it is difficult to test premature agents that may break robot hardware. Moreover, accelerating the training process is not straightforward since the robot has to perceive and interact in the real world. Another promising approach is to train the agent in a virtual environment. Existing environments such as game environments for reinforcement learning  

ALE; OpenAIUniverse; DeepMindLab; Malmo; VizDoom; Arena have diverse games to evaluate the agent with some of the platforms are extensible. However, those environments oversimplify dynamics and perceptions of game avatars, which makes them inappropriate for developing human-like agents. There have been recent efforts to simulate realistic indoor environments  GibsonEnv; AI2-THOR; HoME; House3D; MINOS, but they do not support active interactions or lack of rich multimodal perceptions, and also hard or even impossible to implement custom tasks. In this light, we propose a novel VR-driven toolkit, VECA, which enables to train and test emerging human-like agents in a virtual environment. VECA provides essential features to train human-like agents: 1) rich human-like perceptions, 2) comprehensive interaction capability with the environment, 3) extensibility for implementing various custom tasks. Using VECA, developers can easily create a custom environment where an agent can take rich sensory inputs, learn cognition while interacting with the environment, and perform necessary actions to solve complex tasks. More specifically, VECA is implemented over the Unity engine, which provides not only realistic physics and rendering but also an user-friendly visual editor to the users. VECA provides a humanoid agent equipped with a humanoid avatar, which receives human-like perception and supports joint-level physical actions, as well as animation-based actions. To accurately simulate the agent’s perception and action, VECA internally simulates the environment using low-level APIs of Unity instead of using the default simulation loop of Unity. Moreover, VECA provides a network-based python API, which enables training python-coded agents in VECA from external servers. To motivate and formulate VECA, we showcase a set of environments equipped with 24 tasks for human-like agents, which reflect essential features in human learning (e.g., understanding contexts of objects, multimodal learning, multi-agent learning, joint-level locomotion and control). Those environments can be directly used or modified to train and test human-like learning algorithms. We show the usefulness of VECA on training and testing human-like learning agents with various use cases. In particular, we analyze the results on the proposed tasks with widely used reinforcement learning algorithms and study the effect of the quality of perceptions. Our analysis shows that the performance across various tasks has noticeable differences, varying from easily solvable tasks to challenging tasks. We also show that the agent’s performance can be reduced by , if the spatialization, stereo feature of audio perception is removed, and by when the tactile perception is removed. Those results show that users can build challenging tasks for engaging human-like algorithms using VECA, and features of perceptions supported by VECA are critical to building environments for human-like agents. The contribution of this work can be summarized as follows:

  • We propose a novel VR-based toolkit named VECA to train and test human-like agents in a virtual environment. VECA is the very first tool that provides rich human-like perceptions and interaction capability with a human avatar, which will serve as the cornerstone to develop innovative human-like models and algorithms.

  • We provide a network-based python API, which can be used to train agents from external servers without graphic devices.

  • We provide a set of tasks, datasets, and playgrounds, which can be directly used or modified to train and test various human-like learning algorithms.

  • By conducting various experiments on VECA, we show that users can build challenging tasks for engaging human-like algorithms using VECA, and features of VECA are important for training human-like agents.

Background & Related Works

Researchers have used various ways to train artificial intelligence: (1) datasets, (2) real robots, (3) virtual environments. However, they are not suitable for developing human-like agents, motivating us to build VECA to promote research on next-generation human-like agents.

Pre-collected datasets.

The most common approach to train an agent is to use pre-collected datasets such as Imagenet Imagenet, Audioset AudioSet. However, this approach requires developers to collect a large volume of data to build accurate models, which is costly and time-consuming. Moreover, this approach is inappropriate for training agents that learn by interaction.

Training in reality.

Another approach is to use a robot and train it in the real world iCub; iCubTactile. However, building robot hardware requires significant time, effort, monetary cost. Furthermore, training agents in reality makes the training process hard to parallelize or accelerate, since resources and time scale of training process are bounded to those of the real-world. Those difficulties motivate the usage of virtual environments.

Game-based environments.

Recently, several game-based environments ALE; OpenAIUniverse; DeepMindLab; Malmo; VizDoom; Arena have been proposed and adopted to train and test agents with various learning algorithms (e.g., reinforcement learning PPO

, imitation learning 

GAIL). However, tasks, dynamics, and perceptions from such game environments are oversimplified and do not apply to real-world problems.

Realistic Indoor Simulators.

To overcome the problems of game-based environments, realistic virtual environments such as AI2-THOR AI2-THOR, MINOS MINOS, House3D House3D, Gibson GibsonEnv, HoME HoME have been proposed. On top of those, VECA aims to take one step further by including critical points in human learning. Among those, the most related environments with VECA is HoME and AI2-THOR. HoME provides a large-scale multimodal environment, which renders audio perception and enables physical interactions. However, it lacks tactile perception and diversity of interactions (only physical interaction). Also, it is hard to implement novel tasks due to lack of visual editor. AI2-THOR supports object-specific interactions, which enables the agent to experience diverse circumstances. However, it does not support various modalities. Moreover, those environments do not reflect human characteristics, since those are not designed for this purpose. In short, VECA aims to support human-like modalities and diverse interactions, also extensible for easily implementing custom tasks.

VECA Toolkit

Figure 1: VECA architecture.

VECA toolkit provides useful features for developing virtual environments to train and test human-like agents. As shown in Fig. 1, VECA provides a humanoid agent and an environment manager, enabling the agent to receive rich human-like perception and perform comprehensive interactions. To provide an extensible toolkit for the users, we chose Unity as an engine to simulate and develop the environment, a state-of-the-art game engine equipped with realistic rendering, physics, and user-friendly visual editor. On top of that, users can train and test python-coded agents from external servers using network-based communication interface provided by VECA.

Humanoid Agent

This component simulates an actionable agent with multimodal perception, which can be trained by human-like learning algorithms. The agent interacts with the virtual environment through its humanoid avatar, capable of performing joint-level actions and pre-defined actions such as walking, kicking. The agent receives four main perceptions: vision, audio, tactile, and proprioception (including vestibular senses), which can be perceived with various configurations.

Humanoid Avatar & Action.

We modeled a humanoid avatar with an accurate collider and human-like joint-level motion capability, which allows the agent to perform joint-level actions and physically interact with the environment. Although there are many humanoid assets in Unity, they are mainly designed for games and do not precisely model bodies and actions. In this light, we model a humanoid agent with Blender3D software and integrate it into a Unity asset, which is easy to customize. Specifically, the avatar has 47 bones with 82 degrees of freedom with hard constraints similar to human joints, and skin mesh represented with 1648 triangles, which adaptively changes according to the orientation of the bones. However, training an agent from joint-level actions and physical interactions may not be practically plausible for complex tasks. To mitigate the user’s burden of training those tasks, we also support an animation-based agent, which supports stable primitive actions (e.g., walk, rotate, crawl) and interactions (e.g., grab, open, step on) by trading off its physical plausibility. Specifically, we classify objects with their mass: For the light objects, the agent is relatively kinematic and apply the collision force only to the object. For a heavy object, the object is relatively kinematic and pushes the agent out of the collision. On top of that, VECA provides an user-friendly interface where the users can implement interactions between agent and object. Similar to

AI2-THOR AI2-THOR, the object is interactive with the agent when the object is visible (rendered by agent’s vision and also closer than 1.5m) and not occluded by transparent objects. When there are multiple interactable objects, the agent could choose among those objects or follow the default order (closest to the center of the viewport).

Vision.

To mimic human vision, we implement a binocular vision using two Unity RGB cameras. The core challenge for vision implementation lies in simulating the diverse variants of human vision in the real-world. For instance, human vision varies with age: the infant’s vision has imperfect color and sharpness babyColor; babyBlur. To simulate these various factors, we design multiple visual filters (e.g., changing focal length, grayscale, blur) and allow the users to use these features in combination to simulate a particular vision system.

Audio.

The auditory sense allows human to recognize events in the blind spot and even roughly estimate the audio source’s position. This is possible because the auditory sensor uses the time and impulse difference between two ears and receives the sound affected by its head structure  

HRTF. A common approach to simulate hearing is to use AudioListener provided by Unity and utilize the Unity SDK for 3D spatialization. However, we found that this approach makes it impossible to design tasks with multiple agents or to simulate multiple environments in a single application since Unity supports only one listener per scene. To provide human-like rich auditory sense to the agent, we address the problems in two folds. First, we adopt the LISTEN HRTF (head-related transfer function) dataset LISTEN-HRTF to simulate the spatial audio according to the human’s head structure. Second, we allow multiple listeners in the scene and supports various effects such as 3D spatialization or reverb. To support those features, each listener maintains the list of audio sources, calculates the distance from the agent and the room impulse RIR of the audio source, and simulates the audio using those pieces of information. Note that we also enable the developer to listen to the agent’s audio data using the audio devices for debugging purposes.

Tactile.

Tactile perception takes an important role when the agent physically controls and interacts with the object, but it is challenging to model in Unity. A common approach to implement tactile perception is to model the body as a system of rigid bodies and calculate the force during the collision with other objects by dividing the collision impulse with the simulation interval. However, this approach makes the tactile perception sensitive to the simulation interval since rigid body collision happens instantly. For example, if we halve the simulation interval for accurate simulation, then the collision force is doubled, although the same collision happened. Moreover, Unity only imposes the total impulse of the collision, so the tactile data would be incorrectly calculated when there are multiple contact points, which frequently happens in real-life situations (e.g., hand grabbing a phone). To simulate tactile perception, we approximate the collision force using Hooke’s law, which has not only been adopted for implementing virtual tactile sensors Tactile1 but also used for real tactile sensors Tactile2. Specifically, we model the human body as rigid bones covered with flexible skin (which obeys Hook’s law). When the collision occurs in the soft skin, we calculate the force using Hooke’s law. If the collision occurs in the bone, the collision is handled by the physics engine and the force is calculated as the maximum force. We normalize the tactile input by dividing the force with the maximum force, which is calculated by Equation 1 with the spring displacement with the maximum displacement .

(1)

For the position of sensors, we distribute six sensors on triangle formulation for each triangle in the mesh of the agent. Each sensor senses the normal component of the normalized pressure to its triangle using Eq. 1.

Proprioception.

VECA also provides proprioceptive sensory data. The raw forms of human proprioceptive sense, such as pressure sensors from muscles and joints, are hard to simulate and give unnecessary noise to the desired proprioceptive information. Thus, we directly provide physical information, such as bone orientation, current angle, and angular velocity of the joints, as proprioception.

Figure 2: Workflow diagram about how VECA can be used in training and testing human-like agents. On the environment side, the user designs tasks and implement environment using assets of Unity. Using VECA, the users can import humanoid agent with human-like perception and interaction, define interactions between object and agent, with no concern about management of the inner simulation loop for the environment. On the algorithm side, the user designs and implements novel human-like learning algorithms using python. Using VECA, the user can realize those algorithms with the environment.

Environment Manager

We design an environment manager that collects observations and performs actions with the avatar in the Unity environment. The main challenge is that the time interval between consecutive frames fluctuates in Unity. This is because Unity focuses on providing perceptually natural scenes to the user and adapt the frame rate to available computing power. Also, communication latency affects the time intervals; for instance, when there is a large communication delay, the simulation time is fast-forwarded to compensate for the delay. Although the simplest way would be to limit the users to implement the environment in FixedUpdate() (which is called in a fixed rate), this approach makes Update() function, which is mainly used in Unity, out of sync. To address the problem, we separate the clock of the virtual environment from the actual time by implementing the time manager class VECATime, which replaces the Time class in Unity. Specifically, we enable VECA to simulate time-dependent features (e.g., physics, audio clip, animation) with constant time steps. For the physics, VECA adopts the physics simulator provided by Unity to simulate physics for the constant time step. For other time-dependent features, VECA searches for all objects with corresponding features in the environment and explicitly controls the simulation time to enforce a constant time interval.

Communication Interface

This component manages the data flow between the agent algorithm (Python) and the environment (Unity), and provides a socket-based network connection to support training agents on remote servers. Since Unity does not support rendering visual information in servers without graphics devices, the Unity environment needs to be executed in a local machine with displays to get visual data or visualize the training status. This component allows the communication between the agent in the remote server with the Unity environment in the local machine.

Figure 3: Set of tasks and playgrounds supported by VECA.

How to Use VECA

VECA provides user-friendly interfaces to design custom environments, tasks, agents, and object-agent interactions as shown in Fig. 2. Firstly, users need to create a virtual environment. For this, the user could rely on Unity, which has many existing assets and APIs to implement virtual environments. Secondly, users need to define the perceptions and actions of an agent. For this, VECA provides a humanoid avatar and useful APIs to customize it. In specific, users need to implement three main functions (similar to ML-Agents ML-Agents): (1) AgentAction(action) to make the agent perform a particular action, (2) CollectObservation() to collect various sensory inputs, and (3) AgentReset() to reset the agent. To be more specific, users can implement AgentAction(action) by importing existing interactions or implementing new interactions over VECAObjectInteract

interface. Finally, the user builds the environment (including the agent) as a standalone application. Those environments could be used to train the agents with novel learning algorithms in python, where many machine learning libraries exist, using the python API provided by VECA.

Various Tasks Towards Human Intelligence

To motivate VECA, we provide various tasks that represent building blocks of human learning, as shown in Fig 3. We focused on four essential aspects in early human development: joint-level locomotion and control, understanding contexts of objects, multimodal learning, and multi-agent learning. Note that these categories are neither disjoint nor unique.

Joint-level Locomotion & Control.

We provide a set of tasks that feature learning joint-level locomotion and object-humanoid interaction with diverse objectives. Humanoid agent control has been studied for human assistance and also to understand the underlying human cognition. However, developing a human-level control is extremely challenging akkaya2019rubiksOpenAI. Prior environments are domain-specific and difficult to integrate the human-like aspects. VECA supports joint-level dynamics down to finger knuckles, physical interactions, and tactile perception. With this capability, we build a set of challenging joint-level control tasks {TurnBaby, RunBaby, CrawlBaby, SitBaby, RotateCube, SoccerOnePlayer}. We also incorporate multimodal learning tasks (GrabObject) and tasks related to understanding objects (PileUpBlock, PileDownBlock).

Understanding Contexts of Objects.

We provide various tasks {ColorSort, ShapeSort, ObjectNav, BabyZuma’s Revenge, MazeNav, FillFraction} which aims to understand contexts of objects. Learning abstract contexts of objects is one of the key features in human learning ContextHuman1; ContextHuman2, and is getting increasing attention in artificial intelligence ContextAI1; ContextAI2. Thanks to the extensibility of VECA, users can implement tasks and environments with various useful contexts, including properties (ColorSort, ShapeSort, ObjectNav), functionality (BabyZuma’s Revenge), and mathematical meanings (FillFraction).

Multimodal Learning.

For multimodal learning, we provide tasks {KickTheBall, ObjectPhysNav, MoveToTarget} which features in learning how to incorporate multiple sensory inputs. Multimodal learning plays a big role in human learningPerCog; ObjectPerception, and have also been suggested to be beneficial for training artificial intelligencede2017guesswhat; ngiam2011multimodal. Using rich perceptions provided by VECA, users can develop tasks and environments for multimodal learning, such as vision-audio (KickTheBall) and vision-tactile (ObjectPhysNav, MoveToTarget) learning.

Multi-Agent Reinforcement Learning.

We provide a set of multi-agent RL tasks in which agents need to cooperate or compete with others to solve the tasks. Multi-agent learning is critical in human development  MultiAgentHuman1; MultiAgentHuman2 and is a rising topic for artificial general intelligence  MultiAgentAI1; MultiAgentAI2. However, it is difficult to generate a multi-agent task and to integrate the multimodal perception and active interactions (including interactions between agents) to the task. With the multi-agent support of VECA, we build a competitive task (SoccerTwoPlayer), a cooperative task (MultiAgentNav), and a mixed competitive-cooperative task (PushOut). Some of them also include other features such as multimodality (MultiAgentNav) or joint-level physics (SoccerTwoPlayer).

Supervised Learning.

Although VECA is designed for interactive agents, supervised learning still plays an important role in training/testing machine-learning agents. For example, some users may pre-train their agent with supervised tasks or evaluate their agent using supervised downstream tasks. In this light, we also provide labeled datasets collected from VECA for several supervised learning problems, such as image classification, object localization, sound localization, and depth estimation. See the appendix for a detailed description of the datasets.

Playgrounds

VECA provides four exemplar playgrounds, as shown in Fig 3. Those playgrounds vary in numbers of props, furniture, and its structure. It allows controlling the complexity of tasks such as ObjectNav, KickTheBall, MultiAgentNav, i.e., the same tasks could be tested in different complexity. For the details of the playgrounds, please refer to the appendix.

Experiments

Figure 4: Visualization of training progress over average reward of agents trained by PPO (Proximal Policy Optimization) and SAC (Soft Actor-Critic). The agents are trained on subset of tasks provided by VECA: ObjectNav, KickTheBall, MultiAgentNav, GrabObject, which represents essential aspects in early human development. Best viewed in color.
Figure 5: Learning curve (average reward by step) for KickTheBall task with different quality of auditory perception and GrabObject task with/without tactile perception. Best viewed in color.

We performed a set of experiments to show and evaluate the effectiveness of VECA on training and testing human-like agents. Please note that although it is ideal to evaluate the agents with human-like learning algorithms, we used conventional reinforcement/supervised learning algorithms since they are yet to be fully developed due to a lack of tools like VECA.

Results on Tasks for Human-like Agents

Experiment Setup.

Among various RL algorithms, we applied PPO PPO and SAC SAC, widely used state-of-the-art reinforcement learning algorithms that can be used to learn VECA-supported example tasks. For the tasks, we picked one representative task per each aforementioned task categories: GrabObject, ObjectNav, KickTheBall, MultiAgentNav. We experimented with all compatible playgrounds to also observe the effect of playgrounds on the agent’s performance.

Results.

As shown in Fig. 4, performance of learning algorithms show noticeable differences, varying from easily solvable tasks (ObjectNav, KickTheBall) to highly challenging tasks (MultiAgentNav, ObjectNav and KickTheBall in VECAHouseEnv, VECAHouseEnv2). It shows that there is a room for novel human-like learning algorithms to strike in, and also indicates that using VECA, users can build challenging tasks for engaging human-like agents. Also, for the KickTheBall and ObjectNav tasks, the performance of the agent and its training speed significantly differ with the playground. In detail, learning KickTheBall task in the VECASimpleEnv and VECASingleRoomEnv playground shows noticeable performance enhancement, but the performance increase is negligible in the VECAHouseEnv and VECAHouseEnv2 playground. It shows that the users can moderate the difficulty of the task using the playgrounds.

Effectiveness of Multimodal Perception

Experiment Setup.

We compared the performance of the agents trained with PPO in various quality of auditory and tactile perception, to show that the perception in VECA takes an essential role in training human-like agents. For auditory perception, we trained an agent to perform the KickTheBall task in VECASingleRoomEnv with varying audio quality: stereo + HRTF, stereo (without HRTF), and mono (without spatialization). For tactile perception, we trained an agent to perform the GrabObject task trained with and without tactile perception. We evaluated the performance of the agent with the average reward of the agent. For technical details, please refer to the appendix.

Results.

As shown in Fig. 5, the existence and the quality of the perception play an important role in the agent’s performance. On the auditory perception, the agent without spatialized audio got a low score since the agent couldn’t use the audio information to figure out the direction of the ball. The agent without HRTF was able to use the information of audio to get a high score, but it required much more training time to learn it. The agent with full spatialized audio could get the maximum score with short training time. On the tactile perception, there exists a noticeable difference between agents with and without tactile perception in terms of final performance and its training speed. These results show that the features of perceptions provided by VECA are critical to the agent’s performance.

Conclusion

We proposed VECA, a virtual toolkit for building virtual environments to train and test emerging human-like agents. VECA provides a virtual humanoid agent with rich human-like perceptions, a joint-level physics, and an environment for the agent to interact, facilitating the development of the human-like agents. VECA also provides an environment manager managing the internal simulation loop of the environment for the agent, and a communication interface, which enables the users to train the agent using python-based learning algorithms. Our experiments show that various tasks towards human intelligence can be easily generated with VECA, and they are challenging to solve with recent RL algorithms. Moreover, multimodal perception of VECA plays an important role in training human-like agents. We believe that the features of VECA would be useful for developing environments to train and test human-like agents.

Acknowledgement

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-01371, Development of brain-inspired AI with human-like intelligence).

References

Dataset for Supervised Learning

We collected the data in VECASimpleEnv to prevent multiple objects is included in the vision datasets. All dataset consists of 80000 image/audio/tactile data, collected in various perspectives, orientations.

Object classification (Vision).

The agent has to classify images by its included object. There are three types of objects: doll, ball, and pyramid.

Object classification (Tactile).

The agent s to classify four shapes: pyramid, sphere, cube, cylinder. The object is dropped from above the hand. The agent receives the tactile data sensed by hand during 128 time-steps (total 0.512s).

Distance estimation (Vision).

The agent has to estimate the distance between the camera and the object. Note that the distance uses the meter unit, so it may not be practical to apply regression without normalization.

Object recognition (Vision).

The agent has to estimate the bounding box of the object projected to the camera. The bounding box is represented as (x, y, h, w), which indicates x and y coordinate of the center, height, and width.

Sound localization (Audio).

The agent has to localize the direction of the sound. The agent receives spatialized audio data for 0.2s.

Training Details & Baseline results

We used NVIDIA GeForce GTX 1060 6GB to train the agent in Ubuntu 16.04. Results are averaged with five runs. We used the architecture the same as those of reinforcement learning experiments. We trained the model for 50 epochs. Baseline results are shown in Table

2.

Task(Perception) Result (metric)
Classification (Vision) 99.7% 0.02% (top-1 accuracy)
Classification (Tactile) 94.8% 0.64% (top-1 accuracy)
Recognition (Vision) 76.1% 1.52% (top-1 accuracy)
Distance (Vision) 1.90 0.30 (relative abs. error)
Localization (Audio) 0.80%

0.03 (cosine similarity)

Table 2: Baseline result of datasets.

Training Details: RL

For our experiments, we used Intel(R) Core(TM) i5-9600KF with 3.70GHz for simulating the environment in Windows 10 and used NVIDIA GeForce GTX 1060 6GB to train the agent in Ubuntu 16.04. Results are obtained with single run.

Figure 6: Architecture used for experiments. For SAC, the model outputs the Q value. For supervised learning problems, the model outputs the corresponding results.

Perception/Action

Vision.

We sampled binocular RGB vision data in a resolution of 84x84. For the GrabObject task, the agent always looks toward the object (i.e., the object is always located at the center of each vision).

Audio.

We sampled the audio data at the rate of 22050Hz and converted the audio data to frequency-domain by FFT with the window size of 1024.

Tactile.

For the GrabObject task, the agent can sense the tactile perception sensed by taxels from bones in hand. We say the taxel is from a bone when the taxel is closer to that bone than any other bones.

Action.

For the GrabObject

task, we set the torque applied to each joint as action and used applyTorque() function to move the agent according to the action. For other tasks, we set the velocity vector of walking as the agent’s action and used the walk() function to move the agent.

Helper Rewards

To accelerate the training process of RL tasks, we gave helper rewards to the agent.

KickTheBall.

We gave a helper reward calculated as , where is the angle between the velocity vector and the displacement vector to the ball.

ObjectNav & MultiAgentNav.

We gave a helper reward calculated as . and is value of velocity vector projected to forward and left direction. is when the object is visible to the agent. is / when the object is left / right side of the agent.

GrabObject.

To encourage the agent to move its hands to the object, we gave a helper reward calculated as , while sum-of-distance ( is the position vector of left hand/right hand/object). On top of that, we also gave a negative quadratic penalty, which is calculated as where is the action vector, to prevent agent from performing extreme actions.

Hyperparameters & Architectures

Throughout the experiment, we used the architecture as shown in Fig. 6. For the unused perceptions, the corresponding component is ignored. For the hyper-parameters used in RL algorithms, please see Table 3.

Locomotions with Humanoid Agent

To show the physical capability of provided humanoid agent, we demonstrate locomotions with the humanoid agent: turning its body (Fig. 7) and sitting (Fig. 8).

Figure 7: Humanoid agent turning its body.
Figure 8: Humanoid agent sitting from lying down.
Parameter PPO SAC
Learning rate Adaptive to KL Div. 2.5e-4
Number of workers 8 8
(GAE) 0.95 N/A
Clipping ratio 0.2 N/A
Entropy coefficient 0.03 0.01
Gradient norm clipping 5 5
Discount factor 0.99 0.99
Optimizer Adam Adam
Training epoch/batch per update 4/4 N/A
Value function Coefficient 0.5 N/A
Batch size N/A 64
Table 3: Hyperparameters of algorithms used in RL experiments.
ObservationUtils float[] getImage (camera, height, width, grayscale) Return the image of the camera with the size of (height, width). Returns grayscale image if grayscale is true, otherwise returns RGB image.
float[] getSpatialAudio (head, earL, earR, env) Return the raw spatilized audio data with audio sources included in env based on the position of head and both ears (earL, earR).
float[] getSpatialAudio (head, earL, earR, env, Vector3 roomSize, float beta) Return the raw spatilized audio data, also including the room impulse generated by the room size of roomSize and reflection coffecient beta.
float[] getTactile (mesh) Return the tactile data with the mesh of the agent.
float[] getTactile (taxels, taxelratio, debug) Return the tactile data with the taxels of the agent. If debug is true, each taxel displays a blue line which represents the direction and impulse of tactile perception. To prevent massive display of lines during debugging, user could disable some of the taxels using taxelratio.
GeneralAgent void AddObservations (key, observation) Add the observation vector in its data buffer with its key.
Table 4: Example of VECA APIs related to perception of the agent.
Class Name Function Description
VECAHumanoid ExampleInteract void walk (walkSpeed, turnSpeed) Make the agent walk. Speed and trajectory of the agent is determined by walkSpeed and turnSpeed.
void kick (obj) Make the agent kick the object (obj).
void grab (obj) Make the agent grab the object (obj).
void release () Make the agent release the grabbed object.
void adjustFocalLength(newFocalLength) Adjust the focal distance of the cameras.
void lookTowardPoint(pos) Make the agent look at 3-dimensional point. Note that this function doesn’t rotate the head : it only rotates the camera(eye). Also, the agent will look at the point until releaseTowardPoint() is called, even when the agent is moving.
void rotateUpDownHead(deg) Rotate the head in the Up-Down plane. Head goes down when deg0. Please note that this function doesn’t have any constraints: Excessive rotation might distort the mesh of the agent.
void rotateLeftRightHead(deg) Rotate the head in the Left-right plane. Head goes right when deg0. Please note that this function doesn’t have any constraints: Excessive rotation might distort the mesh of the agent.
void makeSound(audiodata) Make a voice according to the audiodata. Please note that the agent also percieves its voice.
bool isVisible(obj, cam) Check if the object obj is visible by the camera cam. Note that the object is considered visible even if it is occluded by transparent objects.
bool isInteractable(obj) Check if the object obj is interactable with the agent. The object is considered interactable when 1) the object is visible, 2) there is no transparent objects between the object and the agent 3) the distance is closer than 2.5m.
ListVECAObjectInteract getInteractableObjects() Returns the list of objects which is interactable with the agent.
VECAObjectInteract getInteractableObject() Returns a single object which is interactable with the agent. If there are multiple interactable objects, it chooses the object which is closest to the center of agent’s viewport.
VECAHumanoid PhysicalExample Interact ListVector3 GetVelocity() Returns the velocity of the bones.
ListVector3 GetAngVelocity() Returns the angular velocity of the bones.
Listfloat GetAngles() Returns the current angle for all joints (length is same to degree of freedom).
void ApplyTorque(normalizedTorque) Apply the torques to the joint according to the normalizedTorque. Input must be the length same to the degree of freedom and within [-1, 1].
void updateTaxels() Update the taxels (used in tactile calculation) according to the orientation of the bones and the mesh of the skin. In default, 6 taxels are distributed in triangle formulation for each triangle of the mesh.
Table 5: Example of VECA APIs related to action of the agent.
Figure Task Observation Action space Description
KickTheBall Vision, Audio Continuous / Discrete A ball with a buzzing sound continuously comes out from one of the sidewalls. The agent needs to go closer to the ball to kick it. The agent has to use auditory perception to predict the ball’s position since its field of view is limited.
ObjectNav Vision Continuous / Discrete In the environment, there are multiple objects, including the target object. When the agent has navigated to the target object, then the agent receives +1 reward. If the agent has navigated to the wrong object, then the agent receives -1 reward.
ObjectPhysNav Vision, Tactile, Proprioception Continuous Same as ObjectNav, but have to learn the task with joint-level actions.
MultiAgentNav Vision, Tactile, Proprioception Continuous / Discrete Agents are distributed in each room, and the object is in one of those rooms. All agents receive +1 reward when each agent navigates to the object. Without cooperation, the agent must traverse all rooms and find the desired object. The agent who found the object must notify other agents by making a sound, enabling other agents to navigate the object using auditory perception.
GrabObject Vision, Tactile, Proprioception Continuous Learn how to recognize and grab an object with joint-level actions. The agent is rewarded by the vertical position difference when the object is close enough to each hand (preventing the agent from ”throwing” the object).
MoveToTarget Vision, Tactile, Proprioception Continuous Learn how to recognize and move the object to the target position with joint-level actions. The agent is rewarded by the difference of distance between the target position and the object’s position.
TurnBaby Tactile, Proprioception Continuous Learn how to turn its body with joint-level actions. The agent is rewarded by the orientation of its torso and hip.
RunBaby Tactile, Proprioception Continuous Learn how to run with joint-level actions. The agent is rewarded by the velocity to the direction of its torso.
CrawlBaby Tactile, Proprioception Continuous Learn how to crawl with joint-level actions. The agent is rewarded by the velocity and the orientation of its torso and hip.
SitBaby Tactile, Proprioception Continuous Learn how to sit with joint-level actions. The agent is rewarded by the position of its torso and head.
RotateCube Tactile, Proprioception Continuous Learn how to rotate the cube to target rotation using its hand with joint-level actions. The agent is rewarded by the difference in distance between the target orientation and the cube’s orientation.
Babyzuma’s revenge Vision, Audio Continuous & Discrete Similar to Montezuma’s revenge, the agent has to learn complex sequence of interactions to complete the level in sparse reward settings, with vision and audio senses. For instance, the agent has to: 1) navigate to the drawer, 2) open the drawer, 3) grab the key (in the drawer), 4) open the door to clear the first room.
MazeNav Vision Continuous / Discrete Adopted the navigation experiment from RLRL2. The agent is rewarded when the agent navigates to the object. The agent can repeat five trials for each maze.
ShapeSort Vision Continuous & Discrete The agent has to sort the objects according to their shape. The agent is rewarded when the agent grabs the objects and puts the object to the correct shape basket.
ColorSort Vision Continuous & Discrete The agent has to sort the objects according to their color. The agent is rewarded when the agent grabs the objects and puts the object to the correct color basket.
FillFraction Vision Continuous & Discrete The agent has to fill three circles using the fraction circle objects. The agent can not put the object when the sum of the fraction exceeds the circle. The agent is rewarded when the agent puts the object to the circle and is additionally rewarded when it is full.
PileUpBlock Vision, Tactile, Proprioception Continuous The agent has to pile up the (red) blocks with joint-level actions. The agent is rewarded according to the sum of the y-axis position of red blocks.
PutDownBlock Vision, Tactile, Proprioception Continuous The agent has to put down (only the red) blocks with joint-level actions. The agent is rewarded according to the negative sum of the y-axis position of red blocks. Additionally, the agent gets a negative reward when the blue blocks fall.
SoccerOnePlayer Vision, Tactile, Proprioception Continuous Put the ball into the goal with joint-level actions. Note that there is no rule: e.g., the agent may use hands.
SoccerTwoPlayer Vision, Tactile, Proprioception Continuous The attacker must put the ball into the goal, and the defender must block the ball from the goal with joint-level actions. Note that there is no rule: e.g., the agent may use hands and block or tackle the opponent player.
DodgeCar Vision Continuous / Discrete The toy car moves around in the room fast. The agent must dodge the car.
ChaseCar Vision, Tactile, Proprioception Continuous & Discrete The toy car moves around in the room fast. The agent must chase and interact with the car, but must not collide with the car.
PushOut1v1 Vision, Tactile, Proprioception Continuous The agent must push other agents out of the ring to win. The agent is equipped with joint-level actions.
PushOut2v2 Vision, Tactile, Proprioception Continuous Agents must push agents of the opponent team out of the ring to win. The agent is equipped with joint-level actions.
Table 6: Description of various tasks provided by VECA.