AirCapRL: Autonomous Aerial Human Motion Capture using Deep Reinforcement Learning

by   Rahul Tallamraju, et al.
Max Planck Society

In this letter, we introduce a deep reinforcement learning (RL) based multi-robot formation controller for the task of autonomous aerial human motion capture (MoCap). We focus on vision-based MoCap, where the objective is to estimate the trajectory of body pose and shape of a single moving person using multiple micro aerial vehicles. State-of-the-art solutions to this problem are based on classical control methods, which depend on hand-crafted system and observation models. Such models are difficult to derive and generalize across different systems. Moreover, the non-linearity and non-convexities of these models lead to sub-optimal controls. In our work, we formulate this problem as a sequential decision making task to achieve the vision-based motion capture objectives, and solve it using a deep neural network-based RL method. We leverage proximal policy optimization (PPO) to train a stochastic decentralized control policy for formation control. The neural network is trained in a parallelized setup in synthetic environments. We performed extensive simulation experiments to validate our approach. Finally, real-robot experiments demonstrate that our policies generalize to real world conditions. Video Link: Supplementary:



page 1

page 8


Active Perception based Formation Control for Multiple Aerial Vehicles

Autonomous motion capture (mocap) systems for outdoor scenarios involvin...

Navigating Intersections with Autonomous Vehicles using Deep Reinforcement Learning

Providing an efficient strategy to navigate safely through unsignaled in...

LBGP: Learning Based Goal Planning for Autonomous Following in Front

This paper investigates a hybrid solution which combines deep reinforcem...

Jointly Learning to Construct and Control Agents using Deep Reinforcement Learning

The physical design of a robot and the policy that controls its motion a...

Can a Robot Become a Movie Director? Learning Artistic Principles for Aerial Cinematography

Aerial filming is becoming more and more popular thanks to the recent ad...

Robot Sound Interpretation: Combining Sight and Sound in Learning-Based Control

We explore the interpretation of sound for robot decision-making, inspir...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human motion capture (MoCap) implies accurately estimating 3D pose and shape trajectory of a person. 3D pose, in our case, consists of the 3D positions of the major human body joints. Shape is usually parameterized by a large number (in thousands) of 3D vertices. In a laboratory setting MoCap is performed using a large number of precisely calibrated and high-resolution static cameras. To perform human MoCap in an outdoor setting or in an unstructured indoor environment, the use of multiple and autonomous micro aerial vehicles (MAVs) has recently gained attention [1, 2, 3, 4, 5]. Aerial MoCap of humans/animals facilitates several important applications, e.g., search and rescue using aerial vehicles, behavior estimation for endangered animal species, aerial cinematography and sports analysis.

Realizing an aerial MoCap system involves several challenges. The system’s robotic front-end [2] must ensure that the subject i) is accurately and continuously followed by all aerial robots, and ii) is within the field of view (FOV) of the cameras of all robots. The back-end of the system estimates the 3D pose and shape of the subject, using the images and other data acquired by the front-end [1]. The front-end poses a formation control problem for multiple MAVs. In this letter, we propose a deep neural network-based reinforcement learning (DRL) method for this formation control problem.

Fig. 1: An illustration of an aerial MoCap system where MAV agents learn formation control policies based on MoCap performance rewards.

Below, we describe the drawbacks in state-of-the-art methods and highlight the novelties in our work to address them.

In existing solutions [1, 2, 3] the front and back end are developed independently – the formation control algorithms of the existing aerial MoCap front ends assume that the person should be centered in every MAV’s camera image and she/he should be within a threshold distance to each MAV. These assumptions are intuitive and important. Also, experimentally it has been shown that it leads to a good MoCap estimate. However, it remains sub-optimal without any feedback from the estimation back-end of the MoCap system. The estimated 3D pose and shape are strongly dependent on the viewpoints of the MAVs. In the current work, we take a learning-based approach to map and embed this dependency within the formation control algorithm. This is our first key novelty.

Existing approaches [2, 3, 4, 5] depend on tediously obtained system and observation models. State-of-the-art solutions to formation control problems, which involve perception-related objectives, derive observation models for the robot’s camera and the desired subject to compute real-time robot trajectories [2, 6, 7]. Since these observation models are based on assumptions on the shape and motion of the subject, sensor noise and the system kinematics, the computed trajectories are sub-optimal. We overcome the aforementioned issue by addressing the formation control for aerial MoCap as a multi-agent reinforcement learning (RL) problem. This is the second key novelty of our approach. We let the MAVs learn the best control action given only the subject perception observable through the MAV’s on-board camera images, without making any assumptions on the observation model.

The key insights which enable us to do this are i) the sequential decision making nature of the formation control problem with MoCap objectives, and ii) the feasibility of simulating control policies in synthetic environments. We leverage the actor-critic methodology of training an RL agent with a centralized training and decentralized execution paradigm. At test time, each agent runs a decentralized instance of the trained network in real-time. We showcase the performance of our method in several simulation experiments. We evaluate the quality of the generated robot trajectories using the pose and shape estimation algorithms in [8], [9] and [1]. Additionally, we compare our new approach with the state-of-the-art model-based controller from [2]. A demonstration and comparison with the method of [2] on a real MAV is also presented. Code and implementation details of our method is provided in the supplementaty material.

Ii Related Work

Aerial Motion Capture Methodologies: A marker-based multi-robot aerial motion capture system is presented in [4]. Here, pose of the person and the robots are jointly estimated and optimized online. A multi-robot model-predictive controller is used to compute trajectories which optimize the camera viewing angle and person visibility in the image. Marker-based methods suffer from tedious setup times, and optimal control methods for trajectory following can lead to sub-optimal policies for motion capture due to perceptual objectives. A markerless aerial motion capture system using multiple aerial robots and depth cameras is proposed in [5]. They use a non-rigid registration method to track and fuse the depth information from multiple flying cameras to jointly estimate the motion of a person and the cameras. Their approach works only indoors and the initial registration step can take a long time similar to other marker based method setups. In one of our previous works, [10], we introduced a vision-based (monocular RGB) markerless motion capture method using multiple aerial robots in outdoor scenarios. The pose and shape of the subject and the pose of the cameras are jointly estimated and optimized in [10]. While our other previous work [2] introduces a front-end of our outdoor aerial MoCap system, [10] describes the back-end.

Perception-Aware Optimal Control Methods for Target Tracking: In [6], a perception-aware MPC generates real-time motion plans which maximize the visibility of a desired static target. In [11]

a deep learned optical flow algorithm and non-linear MPC are jointly utilized to optimize a general task-specific objective. The optical flow dynamics are explicitly embedded into the MPC to generate policies which ensure the visibility of target features during navigation. An occlusion-aware moving target following controller is proposed in

[12]. Here, metrics for target visibility are utilized to navigate towards a moving target and constrained optimization is leveraged to navigate safely through corridors. In the above works, the motion plans are generated only for a single aerial robot to track a single generic target. In our previous work [2], a non-linear MPC based formation controller for active target perception is introduced for target following. The controller assumes Gaussian observation models and linearizes system dynamics. Using these, it identifies a collision-free trajectory which minimizes the fused uncertainty in target position estimates. In contrast to that, in our current work we learn a control policy to explicitly improve the quality of 3D reconstruction of human pose. An implicit perception-aware target following behavior evolves out of the controller for both single and multi-agent scenarios.

Learning based Control for Aerial Robots for Perception Driven Tasks: Optimal control methods are computationally expensive, require explicit estimation of the state of the system and world, and depend mostly on hand-crafted system and observation models. Thus, it can often lead to sub-optimal behaviors. A model-predictive control guided policy search was proposed in [13]

where supervised learning is used to obtain policies which map the on-board aerial robot sensor observations to control actions. The method does not require explicit state estimation at test time and plans based on just input observations. In

[14] authors used a deep Q-learning based approach for cinematographic planning of an aerial robot (or MAV). A discrete action policy was trained on rewards that exploit aesthetic features in synthetic environments. User studies were performed to obtain the aesthetic criteria. In contrast to that, our current work proposes single and multi-agent MAV control policies that reward the minimization of errors in body pose and shape estimation. A proximal policy optimization (PPO) based distributed collision avoidance policy was proposed in [15]. A centralized training and decentralized execution paradigm was leveraged to obtain a policy that maps laser range scans to non-holonomic control actions. In [16] the authors propose an A3C actor-critic algorithm to develop reactive control actions in dynamic environments. Each agent’s ego observations and an LSTM-encoded dynamic environmental observations are inputs to a fully connected network. Their goal is to obtain a fully distributed control policy. In contrast to the aforementioned works, we propose a model-free deep reinforcement learning approach to the MoCap-aware MAV formation control problem. In our work, a policy neural network directly maps observations of the target subject to control actions of each MAV without any underlying assumptions of the observation model or system dynamics.

Iii Methodology

Iii-a Problem Statement

Let there be a team of K MAVs (with quadcopter-type dynamics) tracking a person P. The pose of the MAV in the world frame at time is given by , where denotes the 3D position of the MAV’s center in Cartesian coordinates and denotes its orientation in Euler angles. Each MAV has an on-board, monocular, perspective camera. It is important to note that the camera is rigidly attached to the MAV’s body frame, pitched down at an angle of . The global pose of the person is given by . and are the body’s 3D center and global orientations, respectively. denotes the 3-D position of a joint from a total of fourteen joints considered for the MoCap of the subject. Ground truth joints considered are visualized as circles in Fig. 2. The MAVs operate in an environment with neighboring MAVs as dynamic obstacles. Their task is to autonomously fly and record images of the person using their on-board camera. The formation control goal of the MAV team is to cooperatively navigate in a way such that the error in 3D pose estimates of the subject is minimized.

Iii-B Formulation as a Sequential Decision Making Problem

Intuitively, the accuracy of aerial MoCap depends on the following two factors.

  • The subject should always remain completely in the FOV of every MAV’s camera, occupying maximum possible area on the image plane.

  • The subject is visually encapsulated from all possible directions (viewpoints).

Based on these intuitions and experimentally derived models for single and multiple camera-based observations, in our previous work [2]

we approached this problem using a model predictive control (MPC) based formation controller. The MPC objective was to keep a threshold distance to the subject while satisfying constraints that enable uniform distribution of viewpoints around the subject. Additionally, a yaw controller ensured that the subject was always centered on the image plane. As discussed in the introduction, this method is hard to generalize because to i) it is agnostic to how the 3D pose and shape was estimated by the back end, and ii) it needs carefully derived observation models.

To address these issues in this work we take a deep reinforcement learning-based approach. We model this formation control problem as a sequential decision making problem for every MAV agent. Dropping the MAV superscript , for each agent the problem is defined by the tuple , where is the state-space, is the observation-space, is the action-space, is the environment transition model, and is the reward function. At each time instance , an agent at state has access to an observation using its cameras and on-board sensors. The agent then chooses an action , which is conditioned on using a stochastic policy , where represents parameters of a neural network. The agent experiences an instantaneous reward from the environment indicating the goodness of the chosen action. We approach the problem without any underlying assumptions or knowledge about the environment transition model . To this end, we leverage a model-free deep reinforcement learning method to train the agents. We will further describe the states, observations and actions in detail. Due to ease of notations and to keep the RL training computationally tractable, we will consider 2 MAV agents in this letter, i.e, . Rewards are described later when we discuss our proposed methodology in sub-section III-C.

Fig. 2: The ground truth body joints (left) and estimated pose and shape overlaid (right).

Iii-B1 States and Observations

Each agent’s environment state, , includes the MAV pose , its neighboring MAV’s pose and the MoCap subject’s pose .


The observation vector

is given by (1). Its first two components are the measurements of the person ’s position and velocity made by the agent in its local Cartesian coordinates. This is given by . The third component of the observation vector is the measurement of the relative yaw orientation of the person with respect to the robot’s global yaw orientation, denoted by . Here we emphasize that we make no assumptions regarding the uncertainty model associated with these measurements. However, we assume that these measurements are available using a vision-based detector. In our synthetic training environment we directly use the available ground truth position and orientation of the person and the MAV to compute these measurement. In real robot scenarios we use Vicon readings to calculate it. The fourth component is the 3D position measurements to the neighboring MAV agent in the local Cartesian coordinates of the observing agent. This is given by . The fifth component is the measurement of the relative yaw angle orientation of the person with respect to the neighboring robot’s global yaw orientation, denoted by .

Iii-B2 Actions

Action is sampled from the control policy for an input observation . In our formulation, actions consist of egocentric 3-D linear translational velocity of the agent, given by and a rotational velocity about its z-axis. The chosen action defines a way-point , which is obtained as , for the agent in the world frame. is provided to low-level geometric tracking controller (Lee controller) [17] of the agent. , as defined before, denotes the current 3D position of the agent. is a rotation matrix. Thus,


Iii-C Proposed Methodology

Training multiple agents to achieve multiple objectives is a complex and computationally demanding task. In order to have a systematic comparison we first develop our approach for a single agent case and then for multi-agent scenario. Meaning, we train (and then evaluate and compare) two different kinds of agents, and hence, networks. These are i) a single agent with only MoCap objectives, and ii) multi-agents (2 in our case) with both MoCap and collision avoidance objectives.

We hypothesize that, using the first kind of network, an agent will learn to follow the person and orient itself in the direction of the person in order to achieve accurate MoCap from the back-end estimator. On the other hand, using the second network, the agents will learn how to avoid each other and distribute themselves around the person to cover all possible viewpoints. We also hypothesize that the best navigation policies for the robot(s) for the MoCap task should significantly depend on the MoCap’s accuracy-related rewards, while other rewards may or may not be required.

Iii-C1 Network 1: Single Agent Network

All variants of single agent network use the following states and observations, where the superscript 1 denotes single agent network.


The actions for all single agent network variants consist of as stated in (2). They are all trained on a moving subject. These variants differ only in their reward structure as described below. The rewards are computed at every timestep. However, for sake of clarity we drop the subscript from the reward variables.

Network 1.1 – Only Centering Reward

In this variant we only reward the agent based on the intuitive reasoning of keeping the person as close as possible to the center of the image from the MAV agent’s on-board camera. It is calculated as follows.


where is the distance between the center of the person’s bounding box on the image to the image center, measured in pixels. is a weighting constant. Note that keeping the person centered in each frame is not the goal of this work. As per the above-stated hypothesis, centering reward may not be required at all. Thus, Network 1.1 will only serve as a comparison benchmark to highlight that a MoCap’s accuracy-related reward is explicitly required.

Network 1.2 – SPIN Reward

In this variant of the network we reward the agent based on the output accuracy of the MoCap back end. For this, we use SPIN [8], a state-of-the-art method for human pose and shape estimation using monocular images. At every time-step of training, we use SPIN on the image acquired by the agent and compute an estimate of corresponding to all 14 joints. In the synthetic training environment we have access to the true values of these joints, denoted by, . SPIN reward is then given by


where and is a weighting constant.

Network 1.3 – Weighted SPIN Reward

Network 1.2 rewards the agent equally for the accuracy of each joint. However, the joints further away from the pelvis (also mentioned as the root joint), like hands or foot, have a greater tendency to be in an erratic motion than the ones closer to the root, like hips. To account for this, in the network variant 1.3 we penalized the outward joints more and hence define a Weighted SPIN reward as,


where and s are positive weights that sum to 1.

Network 1.4 – Centering and Weighted SPIN Reward

The last variant of the single agent uses a summed reward given as .

Iii-C2 Network 2: Multi-Agent Network

All three variants of the multi agent network, described below, use the state as defined in (1). The observations for Network variants 2.1 and 2.2 are equal to (1) without as these variants are trained on a static subject. In these two variants the action space excludes yaw control. Hence, during their training, we use a separate yaw controller to always orient the agent towards the person. On the other hand, Network 2.3 is trained with the full observation space as stated in (1) on a moving subject, and it uses the full action space is as stated in (2). Meaning, Network 2.3 also includes yaw-rate control.

The difference in the reward structure is described below.

Network 2.1: Centering, collision avoidance and AlphaPose Triangulation Reward (Trained with Static Subject)

In this variant we use a sum of three rewards , and . Here, is same as defined in (4). rewards avoiding collisions by penalizing based on the distance from the neighboring robot. It is computed as


where m in our implementation.

is a simplified MoCap-specific reward in a 2-agent scenario, which we obtain using a triangulation-based method. AlphaPose [18] is a state-of-the-art human joint detector which provides body joint detections on monocular images. At every time step we use it on the images obtained by the agent and its neighbor to obtain and , respectively. Using known camera intrinsics and extrinsics (from self-pose estimates) for both agents, a point in the image plane and its corresponding view from another camera, we can estimate the 3-D position of the point using a least squares formulation (equation (14.42) in [19]). Therefore, by using and , we estimate the 3D positions of all 14 joints of the subject as and compare it to ground-truth joint positions . Thus, is given by


where .

Fig. 3: Single Agent Network: Variants of this network are trained with different rewards as described in sub-subsection III-C–1.
Network 2.2: Centering, collision avoidance and Multiview HMR Reward (Trained with Static Subject)

In this variant we use a sum of three rewards , and . The first two are same as (4) and (7), respectively. rewards the agent based on the output accuracy of the MoCap back end using images from multiple agents. For this, we use MultiviewHMR [9]. It is a state-of-the-art method for human pose and shape estimation using images from multiple viewpoints. At every timestep of training, we use it on the image acquired by the agent and its neighbor to compute an estimate of corresponding to all 14 joints. The reward is then given by


where and the weights are as described in the previous section.

Network 2.3: Centering, continuous collision avoidance and Multiview HMR Reward (Trained with Moving Subject)

In this variant we use a sum of three rewards , and . Here and are same as (4) and (9). The continuous collision avoidance reward is given as follows.


where m and m. is obtained using the potential field functions as described in our previous work [20] (equation 3). Furthermore, the value of is clamped to .

Network 2.4 + Potential Field: Centering and Multiview HMR Reward (Trained with Moving Subject)

In this variant, we use a sum of two rewards, namely, (4) and (9). The key difference in this case w.r.t. Network 2.3 is that here we use a potential field-based collision avoidance method [20] as a part of the environment during the training to keep the robots from colliding with each other at all times. It is not embedded in the reward structure and hence, the robots are not explicitly penalized for it. Testing of this network, during experiments, was also performed with potential field-based collision avoidance as a part of the environment.

Fig. 4: Multi Agent Network: Variants of this network are trained with different rewards as described in sub-subsection III-C–2.

Iv Experiments and Results

Iv-1 Training Setup in Simulation

We train and our networks in simulation. We use Gazebo multi-body dynamics simulator with ROS and OpenAI-Gym to train the MAV agents. For the MAV agent we use AscTec Firefly model with an on-board RGB camera facing down at 45 pitch angle w.r.t. the MAV body frame. We run 5 parallel instances of Gazebo and the Alphapose network on multiple computers over a network to render the simulation. The policy network is trained on a dedicated PC which samples a batch of transition and reward tuples from the network of computers to update the networks. We use a simulated human in Gazebo as the MoCap subject and generate random trajectories using a custom plugin. Details of the network architectures, training process, libraries, instructions on how to run the code, etc., are provided in the attached supplementary material.

Iv-a Simulation results

In this sub-section we evaluate our trained policies in Gazebo simulation environment. We create a test trajectory for the simulated human actor for s on which it walks with varying speeds. The best policy of each network variant, as described in subsection III-C, is run 20 times while the actor walks the trajectory. Thus, results from a total of s of evaluation run of each network variant is obtained.

For single agent experiments, in addition to the DRL-based methods, we run 4 other methods: i) ‘Network 1.4 + AirCap’, ii) Orbiting Strategy, iii) Frontal-view Strategy and iv) MPC-based approach [2]. For multi-agent experiments we run 2 additional methods: i) ‘Network 2.3 + AirCap’ and ii) MPC-based approach [2]. All these were also run 20 times for 120s each to allow comparison with our DRL-based policies. ‘Network 1.4 + AirCap’ and ‘Network 2.3 + AirCap’ imply running the networks with ‘true observations’ instead of directly using simulator-generated ground-truth observations. To this end, we ran the complete AirCap pipeline [2]

during the test by replacing only the MPC-based high-level controller with the DRL policy in it. It executes an NN-based person detector, a Kalman filter-based estimator for person’s 3D position estimation (not orientation), cooperative self-localization of the MAVs using simulated GPS measurements with noise as well as communication packet loss. More details regarding this are provided in the supplementary material associated with this article. ‘Orbiting Strategy’ is essentially a ‘model-free’ approach in which a robot orbits around the person at a fixed distance in order to increase the coverage. In ‘Frontal-View Strategy’ a robot maintains a fixed distance to the person and attempts to always keep the frontal view of the person in the camera image. Below we discuss the results for single and multi-agent network variants and other aforementioned methods.

Fig. 5: Simulation results of Single Agent Network variants.

Iv-A1 Single Agent Network Variants

In order to compare the network variants, we use 2 metrics, i) centering performance error (CPE) and ii) MoCap performance error (MPE). CPE is computed as the pixel distance from the center of the bounding box around the person in the agent’s camera image to the image center. MPE, for single agent networks is simply , as defined for the reward in (5). To compute this, the SPIN method [8] is run on the images acquired by the agents during testing.

Note that the metric which quantifies the MoCap accuracy of any method in this paper is MPE (the right side box plots in Fig. 5 and 6). CPE is a metric that we plot only to make the policy performance intuitively explainable and understand ‘what’ the learned RL policies are doing to achieve a good MPE.

Figure 5 shows the error statistics of the aforementioned metrics. The grey background behind any box plot signifies that the method could not keep the person, even partially, in the MAV FOV, thereby completely losing him/her, for at least some duration of the experiment runs. In these cases, the box plot represents errors computed only for those timesteps when the person was at least partially in the FOV.

MPE plots in Fig. 5 for single robot experiments show that for all methods the medians of the MPEs are very similar to each other. This is the most significant result, especially because we can demonstrate that in terms of accuracy our DRL-based approach is on par with the state-of-the-art MPC-based approach [2] (or fixed-strategy methods), without the need for hand-crafting observation models and system dynamics (or pre-specified robot trajectories). Furthermore, the MPE for network 1.4 and 1.2 also has significantly less variance of MPE compared to all other methods. Due to these reasons, Network 1.4 and Network 1.2 are the two most successful approaches for the MoCap task.

From Fig. 5 plots, we also see that Network 1.4 keeps the person centered much more than Network 1.2, 1.3 or MPC. This is expected because Network 1.4 is rewarded for centering the person in the image in addition to SPIN-based MoCap rewards. Network 1.2 or 1.3, on the other hand, only has SPIN-based MoCap rewards. Nevertheless, the MPE of Network 1.4 is only slightly better than that of Network 1.3. This signifies that centering the person in the image does not have a great impact on the accuracy of the motion capture estimates.

Network 1.1, which often lost the person in its FOV, outperforms all other methods in its CPE performance for the duration it could ‘see’ the person. This is expected as it is trained with only centering reward. Even though its MPE mean for the person-visible duration is similar to other networks, the variance of its MPE is higher than the other networks. Moreover, the fact that it could keep the person in FOV only of the time as compared to for other networks (1.2–1.4) makes it less desirable even for the MoCap task.

The median MPE of ‘Network 1.4 + AirCap’ is very similar to all other methods. However, it should be noted that there is one drawback in ‘Network 1.4 + AirCap’. As the ‘ground truth observations’ are not used in this method and the simulated person can rapidly make sudden direction changes, the person is much more susceptible to go out of the FOV of the MAV’s camera. Since the network never learned to ‘search’ for the person who is out of the FOV, the method has to ‘wait’ until the person walks back in the FOV. The cooperative estimation method of the AirCap pipeline helps in this regard as the person might still be in another robot’s FOV. For a single robot case this is also not possible. Thus, ‘Network 1.4 + AirCap’ loses the person for 35% of the time.

The strategy-based methods struggle to keep the person, even partially, in the MAV camera’s FOV. While the ‘Orbiting Strategy’ was able to keep the person in the FOV for 73% of the total time of all experiments combined, the ‘Frontal-View Strategy’ managed to do that only 20% of the total time. This is because when the person changes his direction or speed of motion, the robot could fly around to reposition itself in the front of the person, thus losing him during the transition. On the other hand, our successful DRL-based approaches, i.e., Network 1.2, 1.3 and 1.4, never lose the person from the camera FOV. Based on this analysis, we can conclude that the strategy-based methods, while being ‘model-free’, still have a major drawback of losing the person often, if not very carefully hand-crafted. Our DRL-based approaches ‘explore’ the space of these strategies and finds the most suitable one in their policies.

Fig. 6: Simulation results of Multi-Agent Network variants.

Iv-A2 Multi-Agent Network Variants

The MPE in the multi-agent case is also simply , as defined for the reward in (5), but instead of using SPIN as in the single agent case, here it is computed by running Multiview HMR [9] for pose and shape estimation on every simultaneous pair of images acquired by both the agents during the evaluation runs. Network 2.1 and 2.2 were trained and tested on a static person. On the other hand, Network 2.3 and Network 2.4 + Potential Field were both trained and tested with a moving person (in the same way as for the single agent experiments). The remaining two methods in the multi-agent case were also tested with moving persons.

Figure 6 shows the error statistics of multi-agent simulation experiments. The best performing network in multi-agent case is Network 2.3. It is very similar to the MPC-based method in terms of the MPE median value (See Fig. 6 right side) and has much less MPE variance than MPC. This is a very significant result as MPC required observation models of the subject and our DRL-based approach in Network 2.3 did not. In the MPC approach, the viewpoint configurations for the MAVs emerge out of the joint target perception models. In contrast, in the DRL-based approach the MAVs directly learn the viewpoint configurations from experience. We also notice that the rewards based on a triangulation method assist, to some extent, in achieving acceptable MoCap performance (see results of Network 2.1). However, they remain inferior to the Network 2.3, which used the sophisticated approach taken in Multiview HMR [9] for reward computation.

Furthermore, we find that in terms of MPE, ‘Network 2.3 + AirCap’ is close to both Network 2.3 and MPC. Similar to ‘Network 1.4 + AirCap’, the ‘Network 2.3 + AirCap’ also loses the person from the robots’ FOV. However, it is present in at least one robot’s FOV for approx. 97% of the total experiment duration. The increased visibility in the multi-robot case is due to the cooperative estimator module of AirCap pipeline. This assessment signifies the usability of our method in real robots with real observations.

Next, we find that the policy learned by ‘Network 2.4 + Potential field’ was able to achieve MPE median value comparable to Network 2.3 but at the cost of slightly higher MPE variance and loss of person from at least one robot’s FOV for several periods (13% of total duration). This experiment further signifies the key benefit of our DRL-based approach in Network 2.3. It overcomes the need for knowing models, strategies as well as any ad-hoc collision avoidance techniques. In Network 2.3 the learned policy not only achieves good MoCap performance, but it also naturally learns to avoid collisions with the teammates. In the video associated to this paper (also available here – we show how well Network 2.3 performs. The networks for the moving person, however, did not ensure very good centering of the person in the image (see the left side of Fig. 6) as compared to the MPC-based approach. Despite this, their MPE performances are only slightly poorer than MPC (MPE median difference is approx. 0.05m only). This further signifies that centering the person on the image has a very low effect on MoCap performance.

Fig. 7: A snapshot of the real robot experiment.

Finally, for the multi-agent case, we find that the medians of the MPEs for all multi-agent networks were substantially lowered compared to the MPEs obtained by single-drone experiments (from 0.7m to 0.22m). This highlights the benefit of using multiple drones and hence multiple views to improve MoCap performance.

Iv-B Real Robot results

In order to validate our approach in a real robot scenario, we used a DJI Ryze Tello drone. It consists of a forward looking camera capturing images at hz. The drone is controllable using an SDK with ROS interface. Tello has the functionality of vision-based localization, which is highly inaccurate. Hence, we performed experiments within a Vicon hall with markers on top of the drone to estimate its position and velocity. The tracked subject wore a helmet with Vicon markers. Vicon-based position estimate of the person was used to compute the observations for the neural network.

Fig. 8: Real Robot Experiments: Comparison of single agent network variant 1.1 and MPC-based [2] approach.

We performed experiments with Tello drone and compared our DRL-based approach using Network 1.1 with state-of-the-art MPC-based approach [2]. These were performed for approximately s and s, respectively. Figure 7 shows an external camera footage of the experiment and the on-board drone view with pose and shape overlay using SPIN. As the ground truth pose and shape of the human subject in real experiment is not available, we only compare the following criteria. We compare i) the length and breadth of the bounding box around the person in the drone images, and ii) proximity of the person to the center of those images, calculated as pixel distance from the image center to the center of the bounding box around the person. The bounding boxes are computed by running Alphapose [18] method on the images recorded by the drone. Figure 8 presents the statistics of these evaluation criteria. We notice that the performance of both approaches is similar in terms of the person’s proximity to the image center, with our DRL-based approach performing slightly better. However, we observe that the MPC-based approach is consistently able to keep a larger size (projected height) of the person in the images. This is due the fact that the MPC’s objectives enforce it to keep a certain threshold distance to the person. As the DRL-based approach has no such incentive, it varies its distance to the person more, therefore causing a greater variance in the projected height of the person. On the other hand, this enables our DRL-based approach to change its relative orientation with respect to the person such that she/he is is observed from several possible sides. This is evident by the greater variance in the projected width of the person on the images. This property of our DRL-based approach will benefit pose and shape estimation methods, as demonstrated in the simulation experiments.

V Conclusions and Future Work

In this letter, we presented the first deep reinforcement learning-based approach to human motion capture using aerial robots. Our solution does not depend on hand-crafted system or observation models. Formation control policies are directly learned through experience, which is obtained in synthetic training environments. Through extensive experiments and comparisons we find that DRL-based agents learn extremely good policies, on par with carefully designed model-based (MPC) or model-free, fixed strategy-based methods. These policies even generalize to real robot scenarios. We also find that multiple agents learn even better policies and outperform single agents in performing MoCap.  The learning objective (MoCap accuracy) is far simpler to construct than deriving system or observation models [2]. Moreover, strategy based methods, as shown in our experiments, can have various drawbacks, such as losing the person from the field of view. To overcome that, each drawback must be identified and addressed within the fixed strategy. A DRL-based approach overcomes the need for fixing a strategy by ‘exploring’ the space of such strategies. Thus, a major conclusion of our work is that DRL-based approaches are likely the ideal way forward for aerial MoCap systems. Eventually, an end-to-end approach of learning actions directly from images is needed to overcome the need for an additional person-detection method, that has been used so far (e.g., SSD multibox in AirCap [3]). To this end, we are improving our training by using SMPL body models in richer, photorealistic simulated environments.

Our approach would also be applicable in a real robot setting with ‘real observations’ while achieving accuracy similar to an MPC-based approach [2]. Nevertheless, this is valid only for those durations when the person is not lost from the FOV of all cameras. In order for the policy to ‘search’ for the person, network training should be done with the AirCap pipeline’s ‘real observations’. This would involve running several DNN-based detectors and keeping track of delayed measurements. Furthermore, our approach is limited in terms of scaling up to more agents. While addressing this will require more sophisticated network architecture, it should be noted that 2 to 3 aerial robots may be enough to achieve a good MoCap accuracy [1].


The authors would like to thank Prof. Dr. Heinrich Bülthoff for his constant support and providing us the access to the Vicon tracking hall in MPI for Biological Cybernetics. The authors also thank Igor Martinović and the anonymous reviewers for extremely helpful suggestions.


  • [1] N. Saini, E. Price, R. Tallamraju, R. Enficiaud, R. Ludwig, I. Martinovic, A. Ahmad, and M. Black, “Markerless outdoor human motion capture using multiple autonomous micro aerial vehicles,” in

    2019 IEEE/CVF International Conference on Computer Vision (ICCV)

    , Oct 2019, pp. 823–832.
  • [2] R. Tallamraju, E. Price, R. Ludwig, K. Karlapalem, H. H. Bülthoff, M. J. Black, and A. Ahmad, “Active perception based formation control for multiple aerial vehicles,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 4491–4498, Oct. 2019.
  • [3] E. Price, G. Lawless, R. Ludwig, I. Martinovic, H. H. Bülthoff, M. J. Black, and A. Ahmad, “Deep neural network-based cooperative visual tracking through multiple micro aerial vehicles,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3193–3200, Oct 2018.
  • [4] T. Nägeli, S. Oberholzer, S. Plüss, J. Alonso-Mora, and O. Hilliges, “Flycon: real-time environment-independent multi-view human pose estimation with aerial vehicles,” in SIGGRAPH Asia 2018 Technical Papers.   ACM, 2018, p. 182.
  • [5] L. Xu, Y. Liu, W. Cheng, K. Guo, G. Zhou, Q. Dai, and L. Fang, “Flycap: Markerless motion capture using multiple autonomous flying cameras,” IEEE Transactions on Visualization and Computer Graphics, vol. 24, no. 8, pp. 2284–2297, Aug 2018.
  • [6] D. Falanga, P. Foehn, P. Lu, and D. Scaramuzza, “Pampc: Perception-aware model predictive control for quadrotors,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2018, pp. 1–8.
  • [7] T. Nägeli, J. Alonso-Mora, A. Domahidi, D. Rus, and O. Hilliges, “Real-time motion planning for aerial videography with dynamic obstacle avoidance and viewpoint optimization,” IEEE Robotics and Automation Letters, vol. 2, no. 3, pp. 1696–1703, 2017.
  • [8] N. Kolotouros, G. Pavlakos, M. Black, and K. Daniilidis, “Learning to reconstruct 3d human pose and shape via model-fitting in the loop,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2019, pp. 2252–2261.
  • [9] J. Liang and M. C. Lin, “Shape-aware human pose and shape reconstruction using multi-view images,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4352–4362.
  • [10] N. Saini, E. Price, R. Tallamraju, R. Enficiaud, R. Ludwig, I. Martinovic, A. Ahmad, and M. J. Black, “Markerless outdoor human motion capture using multiple autonomous micro aerial vehicles,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 823–832.
  • [11] K. Lee, J. Gibson, and E. A. Theodorou, “Aggressive perception-aware navigation using deep optical flow dynamics and pixelmpc,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1207–1214, 2020.
  • [12] B. F. Jeon and H. J. Kim, “Online trajectory generation of a mav for chasing a moving target in 3d dense environments,” arXiv preprint arXiv:1904.03421, 2019.
  • [13] T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search,” in 2016 IEEE international conference on robotics and automation (ICRA).   IEEE, 2016, pp. 528–535.
  • [14] R. Bonatti, W. Wang, C. Ho, A. Ahuja, M. Gschwindt, E. Camci, E. Kayacan, S. Choudhury, and S. Scherer, “Autonomous aerial cinematography in unstructured environments with learned artistic decision-making,” Journal of Field Robotics, vol. 37, no. 4, pp. 606–641, 2020. [Online]. Available:
  • [15] T. Fan, P. Long, W. Liu, and J. Pan, “Fully distributed multi-robot collision avoidance via deep reinforcement learning for safe and efficient navigation in complex scenarios,” arXiv:1808.03841, 2018.
  • [16] M. Everett, Y. F. Chen, and J. P. How, “Motion planning among dynamic, decision-making agents with deep reinforcement learning,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2018, pp. 3052–3059.
  • [17] T. Lee, M. Leok, and N. H. McClamroch, “Geometric tracking control of a quadrotor uav on se (3),” in 49th IEEE conference on decision and control (CDC).   IEEE, 2010, pp. 5420–5425.
  • [18] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2017, pp. 7291–7299.
  • [19] S. J. Prince, Computer vision: models, learning, and inference.   Cambridge University Press, 2012.
  • [20] R. Tallamraju, D. H. Salunkhe, S. Rajappa, A. Ahmad, K. Karlapalem, and S. V. Shah, “Motion planning for multi-mobile-manipulator payload transport systems,” in 2019 IEEE 15th International Conference on Automation Science and Engineering (CASE), Aug 2019, pp. 1469–1474.