Deep Neural Network-based Cooperative Visual Tracking through Multiple Micro Aerial Vehicles

by   Eric Price, et al.
Max Planck Society

Multi-camera full-body pose capture of humans and animals in outdoor environments is a highly challenging problem. Our approach to it involves a team of cooperating micro aerial vehicles (MAVs) with on-board cameras only. The key enabling-aspect of our approach is the on-board person detection and tracking method. Recent state-of-the-art methods based on deep neural networks (DNN) are highly promising in this context. However, real time DNNs are severely constrained in input data dimensions, in contrast to available camera resolutions. Therefore, DNNs often fail at objects with small scale or far away from the camera, which are typical characteristics of a scenario with aerial robots. Thus, the core problem addressed in this paper is how to achieve on-board, real-time, continuous and accurate vision-based detections using DNNs for visual person tracking through MAVs. Our solution leverages cooperation among multiple MAVs. First, each MAV fuses its own detections with those obtained by other MAVs to perform cooperative visual tracking. This allows for predicting future poses of the tracked person, which are used to selectively process only the relevant regions of future images, even at high resolutions. Consequently, using our DNN-based detector we are able to continuously track even distant humans with high accuracy and speed. We demonstrate the efficiency of our approach through real robot experiments involving two aerial robots tracking a person, while maintaining an active perception-driven formation. Our solution runs fully on-board our MAV's CPU and GPU, with no remote processing. ROS-based source code is provided for the benefit of the community.



There are no comments yet.


page 1

page 5

page 7


Active Perception based Formation Control for Multiple Aerial Vehicles

Autonomous motion capture (mocap) systems for outdoor scenarios involvin...

Real-time 3D Deep Multi-Camera Tracking

Tracking a crowd in 3D using multiple RGB cameras is a challenging task....

AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation

In this letter, we present a novel markerless 3D human motion capture (M...

Learning to Drive Off Road on Smooth Terrain in Unstructured Environments Using an On-Board Camera and Sparse Aerial Images

We present a method for learning to drive on smooth terrain while simult...

Aggressive Visual Perching with Quadrotors on Inclined Surfaces

Autonomous Micro Aerial Vehicles (MAVs) have the potential to be employe...

Aerial-Ground collaborative sensing: Third-Person view for teleoperation

Rapid deployment and operation are key requirements in time critical app...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human/animal pose tracking and full body pose estimation and reconstruction in outdoor, unstructured environments is a highly relevant and challenging problem. Its wide range of applications includes managing large public gatherings, search and rescue

[1, 2], coordinating outdoor sports events [3] and facilitating animal conservation efforts in the wild [4]. In indoor settings, similar applications usually make use of body-mounted sensors, artificial markers and static cameras. While such markers might still be usable in outdoor scenarios, dynamic ambient lighting conditions and the impossibility of having static/fixed cameras make the overall problem difficult. On the other hand, body-mounted sensors are not suitable for some kinds of subjects (e.g., animals in the wild or large crowds of people). Therefore, our approach to the aforementioned problem involves a team of micro aerial vehicles (MAVs) tracking subjects by using only on-board monocular cameras and computational units, without any subject-fixed sensor or marker. Among the several challenges involved in developing such a system, multirobot cooperative detection and tracking (MCDT) of a subject’s 3D position is one of the most important. It is also the sole focus of this paper.

Fig. 1: Two octocopters cooperatively detecting and tracking a person while maintaining a perception-driven formation.

The key component of our MCDT solution is the person detection method suitable for outdoor environments and marker/sensor-free subjects. Deep convolutional neural network-based (DNN) person detection methods are, unarguably, the state-of-the-art. In recent past, DNNs have consistently shown to outperform traditional techniques for object detection

[5]. Consequently, they have gained immense popularity in various robotics-related fields, such as, autonomous driving [6] and detection and identification of pedestrians [7] and vehicles [8]. However, there are only few works that exploit the power of DNNs for visual object detection on board MAVs.

The main limiting factor in using DNNs on MAVs is the computational requirements of these networks. The majority are not real-time capable. The fastest (open-source) DNN-based detector [9] achieves approximately frames per second (fps) on a pixel square area. However, this requires the computing power of a dedicated high end GPU, which is usually bulky and impractical to install on-board a MAV. Moreover, the higher power consumption of such a GPU often requires a bulkier power supply, further aggravating this issue. One alternative is transmitting raw images to a ground-station in order to offload their processing. However, this approach is also infeasible due to communication bandwith constraints when multiple MAVs are used. Using the most advanced111Available at the time of performing the work presented in this paper. light weight GPU (Jetson TX1 [10]), Wei Liu et al.’s single shot Multibox detector (SSD Multibox) achieves fps [9] in sequential processing. This hardware is suitable for airborne use, but the speed is just at the lower boundary of real time applicability.

On the other hand, the fps performance of SSD is only with low resolution images. Real-time on-board processing of high resolution images is still infeasible. Thus, unfortunately, it cannot exploit the richness of very high-resolution images provided by most modern cameras. A naïve approach to use high resolution images will be to perform an uninformed and non-selective processing of a down-sampled, low resolution input image. This will always be suboptimal due to the reduced information content. Also, a low resolution image could have a wide field of view (FOV), but often lacks in detail regarding a distant subject, thus limiting detection capability and accuracy. Alternatively, an un-informed region of interest (ROI) (e.g., around the center of the original image) greatly limits the chance of the tracked person being in the FOV. Therefore, DNNs would often fail at objects with small scale or far away from the camera, which are typical characteristics of a scenario with aerial robots.

To achieve real-time, continuous and accurate vision-based detections using DNNs for MCDT, our solution leverages the mutual world knowledge which is jointly acquired by our multirobot system during cooperative person tracking. However, this does not imply sharing full resolution raw images at high frame rates between MAVs. In a multirobot environment, the required communication bandwidth would increase at least linearly with each additional robot, rendering such data sharing impractical. Instead, we only share compactly-representable detection measurements which have low bandwidth requirements. Using these, we perform cooperative person tracking and use the predicted poses of the tracked person to feed the DNN-based detector a well-informed ROI for future images. Consequently, each MAV knows where to look next and can actively select the relevant ROI that supplies the highest information content. Thus, our method not only minimizes the information loss incurred by downsampling the high resolution image, but also maximizes the chance of the person being completely in the FOV. Fusion of detection measurements within our MCDT method takes into account the associated person detection uncertainties as well as the self-localization uncertainties of all MAVs.

In this work we use SSD Multibox detector [9] due to its versatility in detecting a wide range of object classes, availability of annotated training data, speed and accuracy. However, it is important to note that our approach is not tailored to this specific detector. We treat it as a black box over a defined interface. Any visual detector can be used, as long as it can operate on arbitrarily scaled subsets of the input space.

The core contributions of this paper are as follows.

  • A real-time, continuous and accurate DNN-based multi-robot cooperative detection and tracking method, suitable for MAVs with only on-board camera and computation. This method

    • allows increased sensitivity to small and far away targets compared to a fixed wide-angle low-resolution detector,

    • allows increased field of view compared to a fixed narrow-angle low-resolution detector,

    • performs significantly better than exhaustive searches over the entire high resolution image, such as using tiled or sliding window approaches, and

    • keeps computational complexity independent of the camera image resolution.

  • A method for characterizing the DNN-based detection measurement noise.

  • Fully open source ROS-based implementation of our method.

We evaluate the performance of our MCDT approach through active cooperative perception experiments. A team of two 8-rotor MAVs actively detects and tracks a person in an outdoor environment. We demonstrate that using our MCDT method, the MAVs are able to consistently track and follow a person, while i) jointly maintaining a high quality 3D position estimate of the person, ii) remaining oriented towards the person, and iii) satisfying other formation constraints. The formation controller is based on one of our previous works [11]. It involves the combination of a model predicted controller (MPC) and a potential field algorithm. In the next section we situate our work within the state-of-the-art. This is followed by a description of our system design, theoretical details of our proposed MCDT approach and characterization of detection measurement errors in Sec. III. Section IV presents our experiments and results, followed by conclusions in Sec. V.

Ii State-of-the-Art

Motivated by several applications in graphics and computer vision, e.g., virtual reality, sports analysis, pose monitoring for health purposes and crowd management, full body pose and motion estimation has attracted widespread research attention

[12]. Most state-of-the-art research in this field is focused on indoor scenarios, e.g., [13]. However, recent works have started to address the challenges of performing full body motion capture in unstructured everyday environments [14] and even in unknown outdoor scenarios [15]. Authors in [15] use body-fixed sensors (IMUs) to estimate the full body pose. Although they achieve a high degree of accuracy, the setup itself may be cumbersome or even infeasible for certain subjects (e.g., animals). To avoid this, as well as any body-fixed markers, the alternative approach is to perform motion capture using only multiple cameras. In [16], authors presented such a method to extract the skeletal motion of multiple closely-interacting people by using multiple cameras, without using body-fixed markers or sensors. However, their solution assumes a scenario where cameras can be fixed to the environment and the ambient light is controlled. In unknown outdoor environments, these assumptions do not hold. Hence, our approach to motion capture in outdoor scenarios involves multiple cameras on board a fleet of MAVs. In this paper, we address the key issue of multirobot cooperative detection and tracking (MCDT) involved in developing such an outdoor motion capture system. Through a cooperative perception-driven formation, we not only ensure that the tracked subject is almost always in the FOV of all MAVs, but also prevent the MAVs from colliding with each other.

Multirobot cooperative tracking has been researched extensively over the last years [17, 18, 19]. While the focus of most of these methods is to improve the tracked target’s pose estimate by fusing information obtained from teammate robots, some recent methods, e.g., [18], simultaneously improve the localization estimates of the poorly localized robots in addition to the tracked target’s pose, within a unified framework. However, hardly any cooperative target tracking method directly facilitates detections through cooperation among the robots. In this paper we do so by sharing independently obtained measurements among the robots, regarding a mutually observed object in the world. Using these shared measurements, our MCDT method at each MAV allows its detector to focus only on the relevant and most informative ROIs for future detections.

Cooperative tracking of targets using multiple MAVs has also attracted attention in the recent past [20, 21, 22]. While some address the problem of on-board visual detection and tracking using MAVs [23], they still rely on hand-crafted visual features and traditional detection approaches. Moreover, cooperation among multiple MAVs using on-board vision-based detections is not well addressed. In this paper we address both of these issues by using a DNN-based person detector that runs on board the MAVs and sharing the detection measurements among the MAVs to perform MCDT.

Deep CNN-based detectors currently require the power of GPUs. For MAVs, light-weight GPUs or embedded solutions are critical requirements. In [24], Rallapalli et al. present an overview of the feasibility of deep neural network-based detectors for embedded and mobile devices. Also, several recent works now concern airborne applications of GPU-accelerated neural networks for computer vision tasks. However, they mostly evaluate their performance in offline scenarios, without a flight capable implementation [25, 26]. On the other hand, there are some networks suitable for real time detection and localization of arbitrary objects in arbitrary poses and backgrounds. These include networks, such as, YOLO [27] or Faster R-CNN [28], which are both outperformed in speed and detection accuracy by SSD Multibox approach [9]. Hence, in our work we use the latter. Furthermore, we ensure its real-time operation using a camera of any given resolution. We achieve this through our cooperative approach involving active selection of the most informative ROI, a method that is novel to the best of our knowledge.

Iii System Overview and the Proposed Approach

Iii-a System Overview

Our multi-MAV system does not consist of a central computational unit. Each of our MAVs is equipped with an on-board CPU and GPU to perform all computations. Although our architecture also does not depend on a centralized communication network, the field implementation is done through a central wifi access point. Each MAV runs its own instance of the following software modules. A low-level flight-controller module, a self-localization module, a cooperative detection and tracking (CDT) module and an MPC-based formation controller module. Figure 2 details the flow of data among these modules. The data shared between any two MAVs consists of their self-pose estimates and the detection measurements of the tracked person. All the aforementioned software modules run on board in realtime. In the following sections, after introducing notations, we focus on the detailed description of our CDT module. Therein we describe how our proposed MCDT approach enables the MAV formation to track seemingly-small and far-away persons with high accuracy without losing them from any MAV’s FOV during tracking.

Fig. 2: Overall architecture of our multi-MAV system, designed to cooperatively detect and track a person while maintaining a perception-driven formation and satisfying certain formation constraints. The focus of the work presented in this paper and the novelties correspond to the green blocks in this figure.

Iii-B Preliminaries

Let there be MAVs tracking a person . Let the 6D pose (3D position and 3 orientation angles in body-frame coordinates) of the MAV in the world frame at time be given by , obtained using a self-localization system. The uncertainty covariance matrix associated to the MAV pose is given as . Each MAV has an on-board, monocular, perspective camera , rigidly attached to it.

Let the position and velocity of the tracked person in the world frame at time be given by . We assume that the person is represented by the position and velocity of its centroid in the 3D Euclidean space. The uncertainty covariance matrix associated to it is given by .

Iii-C DNN-based Cooperative Detection and Tracking

1:   DNN person detector .
2:   and to all MAVs
3:   and such that
4:   EKF Prediction using exponentially decelerating state transition model.
5:  for  to  do
6:      EKF Update
7:  end for
8:   Predict for next ROI
9:   Calculate next ROI
10:  Update self-pose bias,
11:  return  
Algorithm 1 Cooperative Detector and Tracker (CDT) on MAV with inputs

Algorithm 1, which is a recursive loop, outlines our CDT approach. Each MAV runs an instance of this algorithm in real time. The algorithm is based on an EKF where the inputs are the person’s tracked 3D position estimate at the previous time step , the covariance matrix associated to that estimate and , also computed at . The other input is the image at . Steps 1–11 correspond to the iteration of the algorithm at .

Step 1 of Alg. 1 performs the person detection using a DNN-based detector on a ROI provided to it from the previous time step and the image at the current time step . Using raw detection measurements on the image (explained further), the detection measurement in the world frame is computed as a mean and a noise covariance matrix . The detection is performed on the latest available image and uses only the GPU. If this resource is busy processing a previous images, the detection (and therefore the EKF update, Step 6) is skipped.

Raw detection measurements on the image consist of a set of rectangular bounding boxes with associated confidence scores and a noise covariance matrix. To obtain this noise, we performed a characterization of the DNN-based detector, explained in Sec. III-D. The raw detection measurements are transformed first to the camera coordinates and finally to the world coordinates. This transformation incorporates the noise covariances in the raw detection measurements and the MAV’s self-pose’s uncertainty covariance. Note that the raw measurements are in 2D image plane, where as the final detection measurements are computed in the world frame. For this, we make a further assumption on the height of the person being tracked. We assume that the person’s height follows a distribution . While we assume that the person is standing or walking upright in the world frame, the model can be adapted to consider a varying pose (e.g., sitting down), for instance, by increasing . Finally the overall transformation of detections from image frame to world frame measurements also takes into account. For mathematical details regarding the transformation that include propagating noise/uncertainty covariances, we refer [29].

In Steps 2–3 of Alg. 1 we transmit and receive data among the robots. This includes self-pose estimates (used for inter-robot collision avoidance) and the detection measurements, both in world frame. Step 4 performs the prediction step of the EKF. Here we use an exponentially decelerating state transition model. When the algorithm continuously receives measurements allowing continuous update steps, predictions behave similar to a constant velocity model. However, when the detections measurements stop arriving, the exponential decrease in velocity allows the tracked person’s estimate to become stationary. This is an important property of our CDT approach, as the ROI is calculated from the tracked person’s position estimate (Step 8). In the case of having no detection measurements, the uncertainty in the person’s position estimate would continuously grow, resulting in a larger ROI. This is further clarified in the explanation of Steps 8–9.

In Steps 5–7 we fuse measurements from all MAVs (including self-measurements). As these measurements are in the world frame, fusion is done by simply performing an EKF update for each measurement.

In Steps 8–9 of Alg. 1 lies the key novelty of our approach. We actively selects a ROI ensuring that future detections are performed on the most informative part of the image, i.e., where the person is, while keeping the computational complexity independent of camera image resolution. As the computational complexity of DNN-based detectors grows very fast with the image resolution, using our approach we are still able to use a DNN-based method in real-time and with high detection accuracy. The ROI is calculated as follows. First (Step 8), using a prediction model similar to the EKF prediction and the estimates at the current time step ( and ), we predict the state of the person in the next time step as . Then, in Step 9, using the predicted 3D position of the person (only the position components of ), we calculate the position and associated uncertainty of the person’s head , and feet , . For this we again assume the height distribution model for a person, as introduced previously, and that the person is in an upright position. , and are back-projected onto the image frame along with the uncertainties and are denoted as , (person’s center), , (head) and , (feet). The center-top pixel of the ROI is now given by


and the the center-bottom pixel given by


where is the x-axis component of . and are the y-axis components of and , respectively. and

are the standard deviations in the y-axis computed directly from

and , respectively. Lastly, the left and right borders of the ROI are calculated to match a desired aspect ratio. We chose the aspect ratio that corresponds to the majority of the training images for optimal detection performance.

Step 10 of Alg. 1 performs the bias update of the MAV self-pose. Self-pose estimates obtained using GPS and IMU sensors (as is the case of our self-localization system) often have a time-varying bias [30]. The self-pose biases of each MAV cause mismatches when fusing detection measurements in the world frame, which are detrimental to our MCDT approach because i) the calculated ROI for person detection will be biased and will often not include the person, and ii) the person’s 3D position estimate will be a result of fusing biased measurements. To truly benefit from our MCDT approach, we track and compensate the bias in each MAV’s localization system (see III-A). We track these biases by using an approach based on [30]. The difference between a MAV’s own detection measurements and the tracked estimate (after fusion) is used to update the bias estimate. Our approach is a simplified form of [30] as we only track position biases, which is sufficient if the bias in orientation is negligible.

Iii-D Noise Quantification of a DNN-based Object Detector

Our MCDT approach depends on a realistic noise model of the person detector. To this end, we perform its noise quantification. As this is dependent on the specific detector being used, we describe the procedure followed for the pre-trained SSD Multibox detector[9], used in our work.

SSD Multibox can detect the presence and location of a set of object classes in an image, in which people are included. The output of the detector is, for each detection, a bounding box, a class label and a confidence score. The input image size is defined at training time. We used a pre-trained222SSD300 trained on the PASCAL VOC 2007+12 dataset. Available: network in this work.

We quantify the detection noise with respect to (w.r.t.) the size of the detections, i.e., smaller and distant, or larger and closer to the camera. To this end, we created an extended test set from the PASCAL VOC 2007 dataset with varying levels of downscaling and upscaling of the detected person. Figure 3 details one such example. We selected images with a single, non-truncated person to avoid the detection association problem.

Fig. 3: Example of our extended test set created from the PASCAL VOC 2007 dataset. From the left: downscaled, original and upscaled images.

Statistical analysis using the person detector in our extended test set was performed. As the test set has images of different sizes, we calculate all measures relative to their image size. Figure 4

shows the detection accuracy, using the Jaccard index, relative to the person height. It is evident that even though the detection accuracy for a given minimum confidence threshold is nearly constant w.r.t. the analyzed ROI, the absolute error decreases with a smaller ROI. The chance of successful detection falls significantly for relative sizes below

of the ROI and goes down to zero at , thus forming an upper boundary for desired ROI size. This analysis clearly justifies the necessity of our MCDT approach with active selection of ROIs.

Fig. 4: Detection accuracy w.r.t. the relative person height in the ROI. Images are divided into bins according to the height.

Further analysis is presented Figure 5

, where we show the relative error in the detected person’s position over all test images. The error is shown to be well described by a Gaussian distribution. We performed similar analysis on the errors of each detection for the top, bottom, left and right-most points of the detection bounding box at different relative sizes. We found all noises to be similarly well described by Gaussian distributions, without significant correlations between them. Thus, the person detection noise can be well approximated by a constant variance model. (see Table

I for exact values).

Fig. 5: Error distribution of SSD Multibox detections in X and Y coordinate.
TABLE I: SSD Multibox detection’s noise variances for each side of the detection bounding boxes, relative to ROIs.

Iv Experiments and Results

Iv-a Hardware and Software Setup

To evaluate our approach, we conducted only real robot experiments on a team of two 8-rotor MAVs tracking a person. Each MAV is equipped with an HD camera, computer and NVIDIA Jetson TX1 embedded GPU. GPS, IMU and magnetic compass are used for self pose estimates and flight control. They use a ROS [31] Multi-Master setup [32] over Wifi. Each MAV continuously reads images from the camera at . The GPU runs SSD-Multibox [9]. As described in Sec. III, ROIs of images down-sampled to are sent to the GPU. The detection frame rate achieved is

. Detections of persons are projected into 3D, shared between the MAVs and fused using a Kalman Filter on each MAV. MAV self-pose estimates are bias corrected based on mismatches between perceived and fused 3D position of the person within the Kalman Filter. The MAVs fly in formation using an MPC based controller

[11]. The formation constraints include maintaining i) a preset distance to the tracked person’s position estimate (), ii) a preset altitude above the ground plane (), and iii) orientation towards the tracked person. For inter-robot collision avoidance we use an additional layer of artificial potential-field based avoidance algorithm on top of MPC. State estimates and IMU sensor data are updated at . The achieved video frame rate is at a resolution of RGB. Note that during all experiments the tracked person is wearing an orange helmet only for safety. The helmet is not used for any detection. On the contrary, the DNN-based detector, which we use, was not optimized for persons with helmets.

Iv-B Online Dataset

Using our proposed approach for MCDT and on-board real-time computation during the flights, both MAVs perform estimation of i) their 6D self-pose, and ii) 3D position of the tracked person. We treat this as our primary dataset and refer to it as the ‘online dataset’ for analysis further in this section.

Iv-C Offline Datasets

During each experiment, we also record raw sensor and camera data. Using these, we then generate additional ‘offline datasets’, by disabling selected aspects of our MCDT approach. This allows us to evaluate the contribution of each component of our approach to the overall performance. The following offline datasets are generated.

  • No actively-selected ROI (AS-ROI) - Here we turn off the actively selection of relevant ROI. Instead, the full image frame is down-sampled to and sent to the GPU.

  • No MAV self-pose bias correction (SPBC) - Here we turn off the MAV self-pose bias correction.

  • Single MAV runs - Here we perform person tracking using the data from only a single MAV. However, the active selection of ROI is still applied in the single MAV scenario.

Iv-D Stationary Person Experiment

Fig. 6: X Axis is time in seconds. Y axis represents 2D (only on the ground plane) or 3D position errors over time in the stationary person experiment.
Fig. 7: X Axis is time in seconds. Y axis represents 2D (only on the ground plane) or 3D position errors over time in the stationary person experiment.

The formation constraints for the 2 MAVs are = and = . The person remains stationary for . We do not have absolute ground truth measurements of the person’s 3D position. Since the person remains stationary, we instead calculate a reference point as the mean of all 3D position estimates for each dataset (online and offline). The error in any 3D position estimate is then calculated as its distance to this reference point.

Dataset MSE 2D MSE 3D
MAV 1 only
MAV 2 only
TABLE II: Mean squared error (MSE) during the stationary person experiment.

Figure 6, 7 and Tab. II present the results of the stationary person experiment. Here, the MAV self-pose bias correction has no significant overall effect on the person’s 3D position estimate accuracy. There are occasional situations in which not using the bias-corrected MAV self-pose produces slightly better results than when using it (see, e.g., timestamp in Fig. 6). However, this is outweighed by the risk of losing the person partially or completely if ROI is zoomed on the wrong location (timestamp in Fig. 6). Disabling the active selection of ROI leads to roughly comparable tracking accuracy to that of the online estimates, as long as the person is still successfully detected. However, it has higher chances of detection failure, as is evident in the moving person experiment’s results (see Tab. III). Consequently, it degrades the overall performance (timestamp in Fig. 6) and results in a noticeably higher mean squared error.

As seen in Fig. 7, cooperative perception significantly outperforms individual MAV estimates.

Iv-E Moving person Experiment

Fig. 8: X Axis is time in seconds. Y axis represents 2D (only on the ground plane) or 3D position errors over time in the moving person experiment.
Fig. 9: X Axis is time in seconds. Y axis represents 2D (only on the ground plane) or 3D position errors over time in the moving person experiment.

The formation constraints for the 2 MAVs are = and = . Since we do not have the ground truth positions for the person at any time, the person walks along a predefined trajectory, which is a square with edge length aligned towards true north. The person keeps walking the outline of this square for .

Considering this known trajectory constraint, we fit a horizontal, north aligned reference square to each dataset (online and offline). The error in a pose estimate is then calculated as the distance to the closest point on this reference square. Fitting the reference square is done in such a way that the mean squared error of all pose estimates in each dataset is minimized.

MAV 1 only
MAV 2 only
TABLE III: Moving person experiment: MSE and percentage of positive person detections over all frames on which detection was attempted.

Fig, 8, 9 and table III show the results of the moving person experiment. Similar to the stationary experiment, the MAV self-pose bias correction occasionally introduces additional errors, but overall it leads to better performance. The benefit of actively selecting ROI, the cornerstone of our MCDT method, is more significant in the dynamic situation of moving person and active formation following than in the stationary person experiment. This is reflected in the mean squared errors calculated over the whole experiment.

Table III shows that active selection of ROI increases the chances of successful person detection by a large margin.

Similar to the stationary person experiment, cooperative perception with 2 MAVs outperforms the single MAV case by factor of in the mean squared error of the person’s 3D position estimate.

The video attached to this paper illustrates our work presented here and highlights a clip from the footage of the moving person experiment from a bird’s eye view. A high resolution version of the attached video can be found here333 A more detailed video, showing the experimental setup, full stationary and moving person experiments, as well as the image streams from the MAV cameras overlaid with detections and EKF estimates can be found here444 Source code related to this work can be found on our project page555 and here666

V Conclusions and Future Work

In this paper we presented a novel method for real-time, continuous and accurate DNN-based multi-robot cooperative detection and tracking. Leveraging cooperation between robots in a team, our method is able to harness the power of deep convolutional neural network-based detectors for realtime applications. Through real robot experiments involving only on-board computation and comparisons with baseline approaches, we demonstrated the effectiveness of our proposed method. We also showed the feasibility of real-time person detection and tracking with high accuracy from a team of MAVs which maintain a perception-driven formation. Additionally, we also performed noise quantification of the DNN-based detector that allowed us to use it within a Bayesian filter for person tracking.

The work in this paper paves the way for our future goal of performing full-body pose capture in outdoor unstructured scenarios. To this end, we are developing a perception-driven formation controller that not only attempts to minimize the joint 3D position uncertainty in realtime, but also considers the full-body pose reconstruction errors learned offline. Our future goals also include tracking animals and developing a heterogeneous team of MAVs with different sensing and maneuvering capabilities.


  • [1] M. Andriluka, P. Schnitzspan, J. Meyer, S. Kohlbrecher, K. Petersen, O. von Stryk, S. Roth, and B. Schiele, “Vision based victim detection from unmanned aerial vehicles,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct 2010, pp. 1740–1747.
  • [2] S. Waharte, A. Symington, and N. Trigoni, “Probabilistic search with agile uavs,” in 2010 IEEE International Conference on Robotics and Automation, May 2010, pp. 2840–2845.
  • [3]

    “MULTIDRONE: H2020-ICT-2016-2017 H2020-ICT-2016-1 - Robotics and Artificial Intelligence Project,” 2008. [Online]. Available:
  • [4] G. A. M. d. Santos, Z. Barnes, E. Lo, B. Ritoper, L. Nishizaki, X. Tejeda, A. Ke, H. Lin, C. Schurgers, A. Lin, and R. Kastner, “Small unmanned aerial vehicle system for wildlife radio collar tracking,” in 2014 IEEE 11th International Conference on Mobile Ad Hoc and Sensor Systems, Oct 2014, pp. 761–766.
  • [5]

    J. Schmidhuber, “Deep learning in neural networks: An overview,”

    Neural networks, vol. 61, pp. 85–117, 2015.
  • [6] F. Falcini, G. Lami, and A. M. Costanza, “Deep learning in automotive software,” IEEE Software, vol. 34, no. 3, pp. 56–63, 2017.
  • [7] J. Hosang, M. Omran, R. Benenson, and B. Schiele, “Taking a deeper look at pedestrians,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 4073–4082.
  • [8] Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: A review,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 5, pp. 694–711, 2006.
  • [9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European Conference on Computer Vision.   Springer, 2016, pp. 21–37.
  • [10] N. Otterness, M. Yang, S. Rust, E. Park, J. Anderson, F. D. Smith, A. Berg, and S. Wang, “An evaluation of the nvidia tx1 for supporting real-time computer-vision workloads.”   RTAS.
  • [11] P. Lima, A. Ahmad, A. Dias, A. Conceição, A. Moreira, E. Silva, L. Almeida, L. Oliveira, and T. Nascimento, “Formation control driven by cooperative object tracking,” Robotics and Autonomous Systems, vol. 63, no. 1, pp. 68–79, 2015.
  • [12] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “Human pose estimation: New benchmark and state of the art analysis,” in Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).   IEEE, June 2014, pp. 3686 – 3693.
  • [13] F. Bogo, M. J. Black, M. Loper, and J. Romero, “Detailed full-body reconstructions of moving people from monocular RGB-D sequences,” in International Conference on Computer Vision (ICCV), Dec. 2015, pp. 2300–2308.
  • [14] D. Vlasic, R. Adelsberger, G. Vannucci, J. Barnwell, M. Gross, W. Matusik, and J. Popović, “Practical motion capture in everyday surroundings,” ACM Trans. Graph., vol. 26, no. 3, July 2007. [Online]. Available:
  • [15] T. von Marcard, B. Rosenhahn, M. Black, and G. Pons-Moll, “Sparse inertial poser: Automatic 3d human pose estimation from sparse imus,” Computer Graphics Forum 36(2), Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics), 2017.
  • [16] Y. Liu, J. Gall, C. Stoll, Q. Dai, H.-P. Seidel, and C. Theobalt, “Markerless motion capture of multiple characters using multi-view image segmentation,” Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2720–2735, 2013.
  • [17] S. S. Dias and M. G. S. Bruno, “Cooperative target tracking using decentralized particle filtering and rss sensors,” IEEE Transactions on Signal Processing, vol. 61, no. 14, pp. 3632–3646, July 2013.
  • [18] A. Ahmad, G. Lawless, and P. Lima, “An online scalable approach to unified multirobot cooperative localization and object tracking,” IEEE Transactions on Robotics, vol. PP, no. 99, pp. 1–16, 2017.
  • [19] Z. Wang and D. Gu, “Cooperative target tracking control of multiple robots,” IEEE Transactions on Industrial Electronics, vol. 59, no. 8, pp. 3232–3240, Aug 2012.
  • [20] K. Hausman, J. Müller, A. Hariharan, N. Ayanian, and G. S. Sukhatme, “Cooperative control for target tracking with onboard sensing,” in Experimental Robotics.   Springer, 2016, pp. 879–892.
  • [21] W. Zheng-Jie and L. Wei, “A solution to cooperative area coverage surveillance for a swarm of mavs,” International Journal of Advanced Robotic Systems, vol. 10, no. 12, p. 398, 2013. [Online]. Available:
  • [22] M. Zhang and H. H. T. Liu, “Cooperative tracking a moving target using multiple fixed-wing uavs,” J. Intell. Robotics Syst., vol. 81, no. 3-4, pp. 505–529, Mar. 2016. [Online]. Available:
  • [23] C. Fu, R. Duan, D. Kircali, and E. Kayacan, “Onboard robust visual tracking for uavs using a reliable global-local object model,” Sensors, vol. 16, no. 9, p. 1406, 2016.
  • [24] S. Rallapalli, H. Qiu, A. Bency, S. Karthikeyan, R. Govindan, B. Manjunath, and R. Urgaonkar, “Are very deep neural networks feasible on mobile devices,” IEEE Trans. Circ. Syst. Video Technol, 2016.
  • [25] D. C. De Oliveira and M. A. Wehrmeister, “Towards real-time people recognition on aerial imagery using convolutional neural networks,” in Real-Time Distributed Computing (ISORC), 2016 IEEE 19th International Symposium on.   IEEE, 2016, pp. 27–34.
  • [26] Q. Zhang, J. Xu, L. Xu, and H. Guo, “Deep convolutional neural networks for forest fire detection,” in 2016 International Forum on Management, Education and Information Technology Application.   Atlantis Press, 2016.
  • [27] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [29] A. Ahmad, E. Ruff, and H. Bülthoff, “Dynamic baseline stereo vision-based cooperative target tracking.”   IEEE, 2016, pp. 1728–1734.
  • [30] W. Whitacre and M. Campbell, “Cooperative geolocation and sensor bias estimation for uavs with articulating cameras,” in AIAA Guidance Navigation and Control Conference, 2009, pp. 1934–1939.
  • [31] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating system,” in ICRA workshop on open source software, vol. 3, no. 3.2.   Kobe, 2009, p. 5.
  • [32] S. H. Juan and F. H. Cotarelo, “Multi-master ros systems,” Institut de Robotics and Industrial Informatics, 2015.