Coordinate-Aligned Multi-Camera Collaboration for Active Multi-Object Tracking

by   Zeyu Fang, et al.

Active Multi-Object Tracking (AMOT) is a task where cameras are controlled by a centralized system to adjust their poses automatically and collaboratively so as to maximize the coverage of targets in their shared visual field. In AMOT, each camera only receives partial information from its observation, which may mislead cameras to take locally optimal action. Besides, the global goal, i.e., maximum coverage of objects, is hard to be directly optimized. To address the above issues, we propose a coordinate-aligned multi-camera collaboration system for AMOT. In our approach, we regard each camera as an agent and address AMOT with a multi-agent reinforcement learning solution. To represent the observation of each agent, we first identify the targets in the camera view with an image detector, and then align the coordinates of the targets in 3D environment. We define the reward of each agent based on both global coverage as well as four individual reward terms. The action policy of the agents is derived with a value-based Q-network. To the best of our knowledge, we are the first to study the AMOT task. To train and evaluate the efficacy of our system, we build a virtual yet credible 3D environment, named "Soccer Court", to mimic the real-world AMOT scenario. The experimental results show that our system achieves a coverage of 71.88


page 8

page 9


Multi-Target Active Object Tracking with Monte Carlo Tree Search and Target Motion Modeling

In this work, we are dedicated to multi-target active object tracking (A...

Pose-Assisted Multi-Camera Collaboration for Active Object Tracking

Active Object Tracking (AOT) is crucial to many visionbased applications...

Learning Multi-Agent Coordination for Enhancing Target Coverage in Directional Sensor Networks

Maximum target coverage by adjusting the orientation of distributed sens...

Optimizing Camera Placements for Overlapped Coverage with 3D Camera Projections

This paper proposes a method to compute camera 6Dof poses to achieve a u...

Sidekick Policy Learning for Active Visual Exploration

We consider an active visual exploration scenario, where an agent must i...

Tracking Grow-Finish Pigs Across Large Pens Using Multiple Cameras

Increasing demand for meat products combined with farm labor shortages h...

Active Object Search

In this work, we investigate an Active Object Search (AOS) task that is ...

1 Introduction

With recent progress in computer vision and robotics, Active Object Tracking (AOT) has become an emerging task 

[1, 2, 3], in which mobile cameras track objects effectively by adjusting their poses based on the visual observations automatically. Existing works in the field of AOT only focus on single-object tracking task. However, there are various real-world applications requiring multiple cameras to track multiple objects, such as sports competitions, traffic monitoring, etc. Considering the potential application value, in this work, we are dedicated to Active Multi-Object Tracking (AMOT) task, where a team of collaborative cameras automatically control their actions to track multiple objects, and approach it with a multi-camera collaboration system.

Compared to the settings of existing works where each camera is responsible for tracking a single object, AMOT requires multiple cameras to coordinate the attention to multiple objects. AMOT is a challenging task due to the following facts. Firstly, AMOT involves multiple cameras to track the targets, and the problem complexity grows exponentially with the camera number. Secondly, each camera only captures a local and partial observation of the scene filed, which requires an effective information integration scheme to achieve global optimization of camera coordination. Thirdly, the evaluation metric,

i.e., overall target coverage, is somewhat high-level, which is difficult to evaluate the control action on each individual camera and guide the cameras to learn an effective collaboration.

Figure 1: An overview of our environment for AMOT task. (a) A vertical view of the global situation. Multiple randomly moving targets are tracked by a team of cameras. Based on its visual observation and policy, cameras take actions including translation, rotation and zoom to cover maximum number of targets in global visual field. (b) A typical view observation of a single camera. Bounding boxes of each target are generated by object detection algorithm. A target is considered to be covered only if the size of its bounding box is larger than a threshold.

To address the above issues, we formulate AMOT as a centralized multi-agent reinforcement learning (MARL) problem and propose a coordinate-aligned multi-camera collaboration system. Specifically, we regard each camera as an agent and subtly define the state representation and reward function for the agents. For the state representation, considering that agents (cameras) directly capture video shots of the scene field from different perspectives, we first identify the targets in the video frames with an image object detector [4], and then align the detected targets by camera calibration [5]. In this way, the targets captured by all agents are mapped to the same absolute coordinate system in the 3D environment, which integrates the observation from all agents into a global state representation. To effectively train the agents with reinforcement learning, we design the reward function with both global and local terms. The global reward term is defined as the coverage rate of all targets, while the local reward terms are specifically defined for each agent considering the visibility, direction, bounding box and position rewards of the corresponding targets in the view field.

Based on the above definition, we leverage a value-based reinforcement learning algorithm to identify the optimal policy so as to select a proper action to control each agent (camera). Specifically, a shared deep Q-network [6]

is used to estimate the Q-value function, which evaluates the Q-value of each possible camera action given integrated state information including coordinates as observations. In this way, agents learn to collaborate effectively, and meanwhile, the individual-global gap is largely narrowed.

To train and evaluate our proposed system, we build a 3D environment for the AMOT task, as illustrated in Figure 1, based on Unreal Engine [7]. Different from the existing 2D environment [8], which lacks visual appearance and complex scenes like occlusion, our environment is virtual yet credible with many human targets to mimic the real-world scenarios. Our environment settings are defined according to real-world soccer match discipline, where multiple players are distributed in a large court walking during a whole episode. Multiple cameras are evenly placed at the border and controlled to cover as many targets as possible. Such an environment enables our system to acquire visual observations and conduct actions on cameras, and also provides precise rewards, facilitating the training. The environment can efficiently render the next state (i.e., the next frames) based on the current environmental situation and conducted actions.

We validate the effectiveness of our method by conducting experiments in the environment mentioned above. The results show that our method achieves considerable performance gain compared to the baseline method in which cameras are fixed. Moreover, the ablation study demonstrates the effectiveness of the inverse projection transformation design and the necessity of each type of individual reward.

Our contributions are summarized as follows:

  • To the best of our knowledge, we are the first to study the AMOT task and propose a solution based on centralized multi-agent reinforcement learning, which learns an efficient collaborative tracking policy for all the involved cameras. This solution will serve as an initial baseline in this field.

  • We build a 3D virtual environment to mimic real-world multi-object active tracking scenes, which enables the training of agents on active tracking tasks and also has a potential of generalization to more complicated scenes. Our virtual environment is released to the public for further study in the community.

2 Related Work

In this section, we discuss related work from three aspects: traditional solutions to AOT, and deep RL for single-camera and multi-camera AOT.

2.0.1 Traditional solutions to AOT

Traditional solutions to active tracking usually consist of two separate modules, i.e., detection and control. The detection module first detects and analyzes motion and location features, which are then used by the control module to manipulate the camera sequentially [9, 10, 11, 12]. Thanks to the great progress achieved in conventional object detection and tracking in recent decades [13, 14, 15, 16, 17], passive tracking tasks, in which the video data is pre-captured, can be effectively addressed with existing tracking algorithms. According to these tracking results, a control module is developed to manipulate the camera and track targets actively [18, 19]. Similarly in our method, an object detector is also used to obtain visual observations as the input of RL algorithm. However, traditional methods suffer limited generalization capability due to the lack of data sets.

2.0.2 Single-camera AOT via deep RL

Recently the breakthrough in deep reinforcement learning [20, 21, 22] and multi-agent reinforcement learning [23, 24] provides an alternative way to deal with the active tracking problem. Previous works show that the RL algorithm is feasible and data-efficient for many vision-based control tasks [25, 26, 27, 28], which inspires researchers to employ RL algorithm on visual tracking problem. In active tracking task, the RL algorithm is first applied via an end-to-end solution [1]

. Its network consists of convolution and LSTM networks, taking raw frames as states and outputting camera actions. The task is formulated as a Markov Decision Process (MDP) problem, and A3C

[29] is adopted as its RL algorithm. The experiments are conducted within virtual 3D environments. The results show that the end-to-end method performs favorably against many traditional trackers with a hand-engineered camera-control module, and also has good generalization capability and potential to transfer to real-world scenarios. However, this end-to-end method assumes that the target motion pattern is fixed, i.e., the target only moves along a fixed trajectory, which limits the generalization capability of the tracker.

Still based on the single-target single-camera system, the follow-up work addresses the above problem by adding an adversarial mechanism [2]. Both the tracker and the target are given a partial zero-sum reward and learn to have a better tracking policy or avoid the tracker. Such a mechanism diversifies the visual appearance and tries to detect the weakness of the tracker, which vice versa yields a more robust and efficient tracker. Active tracking in a more complex scene with multiple distractors is also studied [30], and attention modules are used to encode the template feature and the historical feature embeddings.

2.0.3 Multi-camera AOT via deep RL

The multi-camera collaboration system is also investigated. A pose-assisted method is proposed [3], in which cameras share their pose information when tracking a single target to overcome imperfect observations like occlusion. The system has two controllers named vision-based controller and pose-based controller. A switcher is used to evaluate the observation and choose which controller to use. The empirical results in 3D virtual environments show that this method has the capability to deal with complex scenes compared with traditional tracking algorithms.

Figure 2: An overview of our multi-agent active tracking system. denotes concatenation and is the one hot code of the -th camera. In each step, each agent gets primitive frame pixels from the environment. These observations are further integrated into joint observation via object detector and inverse projection transformation. Then the Q-Network picks an optimal action according to the current observation and policy. After conducting actions, the environment evolves into the next step, and the rewards generated will be used to update the Q-Network.

Despite the progress achieved by previous works in active tracking, the active multi-object tracking (AMOT) task has not been explored yet. In this work, this task is formulated as a coverage problem. A number of cameras are supposed to cover the maximum number of random-walking targets within their view range. Previous work applies the hierarchical reinforcement learning method on a 2D version of the coverage problem [8], suggesting that multi-agent RL is a potential solution to AMOT. However, in the 2D environment, targets are simplified as points, omitting any visual appearances. Moreover, cameras are assumed to have direct access to the targets’ location, which is unrealistic. Different from that, a 3D multi-object multi-camera environment could be more complicated with raw frames as inputs and imperfect observations like occlusion.

Inspired by previous work, we first build a 3D virtual environment to simulate real-world scenarios and study the AMOT problem on it. A centralized multi-agent RL algorithm is deployed to learn a proper policy for each camera to take action. In contrast, our system efficiently deals with AMOT problem given raw frames and has much higher coverage compared with the equal number of fixed cameras.

3 Method

In this section, we introduce our system for the AMOT task in detail. We first discuss the problem formulation and then show the whole procedure sequentially in three parts: object detection, observation integration, and reinforcement learning. Figure 2 illustrates an overview of our method.

3.1 Problem Formulation

The active tracking problem mentioned above is formulated as a partially observable multi-agent Markov decision process (POMDP) [31]. Traditional POMDP is defined with a 7-tuple

, which denotes a set of states, actions, conditional transition probabilities, reward function, observations, conditional observation probabilities, and discount factor, respectively. Specially in our multi-agent system, we denote

as the set of agents, each of which corresponds to a camera. All cameras share the same observation space and action space. The action space is discrete and described as a vector with 3 dimensions:

, denoting action on translation, rotation and zoom, respectively. Based on the above definition, our multi-agent system is governed with 8 elements, i.e., .

In step , the -th agent firstly gets its observation with probability , where , and . Then the -th agent sends its observation to the Q-network, and all agents’ observations will be further combined into a joint observation . Concatenated with one-hot code to identify each agent, is sent to the Q-network to obtain the estimated Q-value of each action. And the -th agent’s action is then picked via an epsilon-greedy policy based on the Q-values. All agents’ picked actions compose the joint action . Finally, the updated state will be drawn with probability , and the -th agent will receive a reward . The ultimate goal for our AMOT problem is to maximize the expected total reward within the whole episode:


where is the number of steps in an episode.

3.2 Agent State Representation

The state representation of agents has significant impact on efficiency of reinforcement learning. In our task, the information originally received from cameras only contains images with raw pixels and locations of each camera, which is hard to learn an effective RL policy. To address it, target detection and location alignment are applied sequentially to convert primitive observations to a more compact and reasonable state representation.

3.2.1 Target Detection

At the beginning of each step, each camera takes shots of the virtual environment based on current environment state. To estimate the location of each target, object detection algorithms are adopted to predict bounding boxes based on these raw frames, which act as a part of the observation defined above. Besides, due to the real-time efficiency of our environment, the object detector is also supposed to output results with high quality and efficiency.

To this end, we use the YoloV4-Tiny model as our object detection model, which is derived from state-of-the-art object detection algorithm YoloV4 with a simplified network structure and reduced parameters [14, 4]. Compared to YoloV4 which has more than 60 million parameters, the YoloV4-Tiny model contains only 6 million parameters. This greatly increases the feasibility of deploying this object detector in our active tracking problem. Our experimental results show that the performance of taking YoloV4-Tiny as an object detector is comparable with that of using ground truth bounding boxes generated from the environment.

3.2.2 Location Alignment by Inverse Projection Transformation

However in AMOT task, bounding boxes obtained from object detection merely demonstrate the relative position between the camera and the object but not the absolute position of the object in the 3D environment, lowering the training efficiency of reinforcement learning. Compared to the bounding boxes, the absolute coordinate of each target in the 3D environment is a better state representation which provides the absolute positions of objects. Therefore, bounding boxes are mapped to targets’ coordinates after detection via inverse projection transformation. Inspired by the camera calibration technology [5], we design inverse projection transformation as an effective solution to convert bounding boxes into coordinates.

According to the pinhole camera model, the pixel-level 2D point position in the frame and 3D point position in world coordinates are correlated with the projection equation:


where and are the extrinsic parameters of the camera in the world coordinate system. Since the targets randomly walk on the ground plane, their coordinates are always zero since the z-axis is set perpendicular to the ground. In this context, Equation (2) is simplified into:


where and are the first and second columns of matrix . The intrinsic matrix is defined as below:


Parameters in the intrinsic matrix contain focal length and principal point, which only depend on the camera’s inherent structure, invariant to the camera’s position and direction.

In our virtual environment, the origin of the world coordinate system is placed at the center of the ground. The camera’s intrinsic matrix, pose, and location at each step are known. Given 2D point position in the frame, its position in world coordinate and is calculated by:


3.2.3 Agent Observation Integration

The middle point of the bottom margin of the bounding box is used as the 2D point position to estimate the target’s world coordinate in our method. With coordinates obtained, the observation of the -th agent is defined as a joint vector with coordinates of detected targets and camera’s posture information including location , rotation , zoom scale , and distance to others .

3.3 Reward Definition

To help our agents learn a competent and robust policy, it’s nontrivial to design a proper reward. Our reward consists of a team reward and an individual reward. For the -th agent, the reward is defined as a weighted sum of the team reward and the individual reward :


3.3.1 Team Reward

With targets in our environment and coverage rate as evaluation, the team reward is defined same as the coverage rate:


where is the visible flag. If the ground truth bounding box size of target in camera , noted as , is larger than the whole frame size by a threshold at step , the value of is set to , otherwise .

3.3.2 Individual Reward

The individual reward is the weighted sum of four reward terms:


where , and are four terms considering bounding box size, visibility, direction, and position, respectively. In the following, we elaborate each term of the individual reward separately.

Firstly, the target bounding box is expected to have enough size for tracking. Therefore with for the bounding box size of target in camera at step and for the frame size, the bounding box reward is also defined:


Secondly, cameras are supposed to cover the maximum number of targets, which coincides with the average coverage as an evaluation metric. Meanwhile, it’s possible that a single target is simultaneously observed by more than one camera, which has no benefits but wastes camera resources. Considering the above two issues, visibility reward is defined as below:


where takes the same definition as that in Eq. (7).

Thirdly, a camera already capturing at least one target should hold the target in its visual field instead of losing it in the near future. Therefore, the camera is supposed to adjust its pose to track the target in the center of its view, otherwise the target can easily walk out of the visual field. Thus a direction reward is designed as below:


where is the angle error between camera orientation and target direction in yaw angles. Specifically, if there are more than one target in the frame, the average position is used to represent the target position.

Fourthly, considering a heuristic view that staying in the same position will limit cameras’ global perspective, a position reward is designed to keep cameras in distance:


where is the position difference between camera and camera at step .

3.4 Centralized Reinforcement Learning

With state representation and reward function defined above, a centralized value-based multi-agent RL algorithm is used to train our agents, given both estimated target coordinates and camera pose as features. In the following, we discuss the network architecture designed in our algorithm as well as the training strategy.

Figure 3:

The network architecture of our Q-Network. Note that the FC and GRU represent the Fully Connected layer and the Gated Recurrent Unit, respectively.

is the one-hot code used to identify different agents. The i-th camera’s output features from the first two FC layers are combined with its one-hot code and last action and then concatenated with other cameras’ features. The joint feature is the input of the FC3, and finally, the network outputs three Q-values of the i-th camera evaluating actions of translation, rotation, and zoom.

3.4.1 Network Architecture

In our method, a deep Q-network is adopted to approximate the Q-value function. All agents share the same Q-network and choose their actions based on the estimated Q-values and epsilon-greedy algorithm. Figure 3 shows our network architecture. When obtaining the estimated Q-values of a single agent, partial observations of all agents will be input into the network since these observations are shared. Each agent’s partial observation will be firstly encoded through a unit consisting of two fully connected layers. The encoded feature is then concatenated with the one-hot code and agent’s last action, which enables the network to identify specific agent to estimate Q-values and provides temporal information. The combined features of all agents are then concatenated together and input into the follow-up layers. Finally, the network outputs the Q-values of each possible action of the agent. Specifically, the last fully connected layer is divided into three branches to accelerate convergence, considering that our action is denoted as a 3-dimensional vector.

3.4.2 Training Strategy

Previous works have proven that the conventional deep Q-network algorithm suffers from substantial overestimation [32]. Thus double Q-learning method [6] is adopted to update our Q-network. In double Q-learning, two Q functions: and are stored. These two functions both provide Q-values of each action given situation . However, function is updated with the value from the to avoid overestimation:


where and are the current and next situation, is the chosen optimal action, and is defined as follows:


In practice, we use two deep Q-networks to learn these two functions. The networks are updated by the temporal difference error (TD error) calculated as follows:


Then the optimizer uses squared TD error as loss to update our estimate network , and parameters of are copied to every episodes.

Figure 4: Our 3D environment. The left two figures show the global view while the six figures in the right respectively show the raw observation of six cameras. The size of the ground truth bounding box is used to judge whether an object is covered. And the green points denote covered objects while the red points denote the uncovered ones in the global view.

4 Experiments

In this section, we first discuss in detail our environment and settings. Then we evaluate our approach for AMOT task in this environment.

4.1 Simulation Environment

Training agents for multi-agent active tracking in real-world scenes is a tough work due to several reasons. Firstly, it’s hard to determine the ground truth to compute reward function or evaluate performance. Secondly, building a new data set specially for active tracking is also expensive due to numerous possible states. Thirdly, our agent needs to interact with the environment frequently to optimize its policy, while it may suffer from high-cost trial-and-error.

To avoid these problems, a virtual environment is built which simulates real-world multi-agent scenes to train our agents. The environment and our code are released to the public for convenience of repeating our results and conducting further study *** And to make the training realistic, the simulation operates in real-time. The capability of generalizing trackers trained in the virtual environment to real-world scenes has already been shown in previous works [2, 1]. The 3D virtual environment is built on the Unreal Engine, with UnrealCV[7, 33] and OpenAI Gym[34] which provides convenient APIs for interaction between the algorithm and the environment.

Our environment simulates a real-world soccer court with twenty-two human-like targets and six mobile cameras, shown in Figure 4. The size of the scope of target activity is world units, where 100 world units refer to 1 meter in the real world. The targets in our environment are divided into two teams, and the appearances within the same team have little difference. In our environment, each target is a human-like character with the appearance shown in Figure 6. To mimic real-world soccer match scenes, targets are divided into two teams with different colors, while targets in the same team have no difference except the number on the back. At the beginning of each episode, each target is randomly placed within the field. To implement random walking, we randomly set a reachable point in the field as a destination for each target. Then the target walks at a constant speed towards that destination. If the destination is reached or more than 15 seconds are costed, another point will be picked as the destination for that target.

The action space of each camera is combined with three independent dimensions. Agent respectively chooses actions from translation (move clockwise, keep still, move anticlockwise), rotation (turn left, keep still, turn right) and scale (zoom in, keep still, zoom out). Then the three separate actions are combined into joint action. Therefore, each agent holds in total 27 different actions. The translation action moves the camera clockwise or anticlockwise along the border by 100 world units. The rotation action manipulates the direction of the camera by 10 degrees. And the zoom action changes the zoom scale of the camera by 10%.

At the beginning, the six cameras are evenly placed on the border of the court, with a height of 500 world units. The cameras will catch a frame with a resolution of , and its view angle is set to 90 degrees. Our environment also allows our cameras to observe in object mask method, so it’s easy to get the ground truth of targets’ bounding box. Therefore, we use the bounding box size as the criteria of whether an object is covered in the visual field of a camera, which is more reasonable than the relative distance between the target and the camera.

Figure 5: An example of inverse project transformation. The black point in the right graph refers to the estimated location of each observed target while the green point under that denotes the ground truth location.
Figure 6: Two different appearances identify the two teams in our environment. The number in the back varies among different targets.

4.2 Settings

4.2.1 Evaluation Metric

In our environment, with limited numbers of cameras, the joint visual field is unable to cover the whole area. So if the cameras are fixed, in most cases there are always some targets out of view. Therefore the ultimate goal is to train these cameras to cover as many targets as possible. For this purpose, the average coverage rate is used for the metric:


where is define in Equation (7), is the average coverage rate, and refer to number of targets and episode length, respectively. And the bounding box area is used as a threshold to determine whether a target is captured by a camera. Existing metrics in passive multi-object tracking (MOT) or single-object active tracking tasks seem plausible for the AMOT task but are actually not applicable. For example, the Multiple Object Tracking Accuracy (MOTA) is a fair metric in MOT. It is defined by number of misses, false positives, and mismatches. However, it can only evaluate the effectiveness of the detector. Camera control policy have little influence on MOTA since it don’t change the performance of detectors but their poses. Therefore, we only use coverage rate as the evaluation metric.

Parameters Physical Meaning Value
Team reward weight 0.4
Max angle in direction reward /4
Max size proportion in bounding box reward 0.2
Min size proportion in bounding box to be counted observed 0.0005
Max camera distance in position reward 5000
Vision reward weight 0.8
Direction reward weight 0.2
Position reward weight 0.2
Table 1: The details of parameters of the rewards.

4.2.2 Baseline

Due to differences in environment setting, task goal, and problem formulation, simply transferring existing algorithms in other tasks like coverage problem in sensor deployment [35] is not fair and thus unconvincing. Besides, previous work in active tracking only involves single-target tracking tasks, with no datasets or algorithms for AMOT. Thus we provide fixed cameras as baselines for our mobile cameras to compare with. These fixed cameras are evenly distributed at the border and oriented towards the center point of the court.

4.2.3 Hyper Parameters

For our RL algorithm, the learning rate is 0.0005. The reward discount factor is 0.99. The batch size which is the number of episodes to train on is set to 32. In the epsilon greedy policy, factor starts from 1.0 and finished at 0.1 with an annealing time of 50000 steps. One episode lasts for 100 steps, and after the final step the environment will be reset, i.e., the position of each target will be randomly set, and the position and pose of each camera will be initialized. Since we use double Q-learning, the target network updates every 100 episodes. The episode length is set to 100 steps. For evaluation, we train each model for 500k steps, i.e., 5k episodes. The parameters of the rewards are present in Table 1.

Figure 7: Global view of the environment when applying baseline and our method. For comparison, targets are placed at the same location. Since the bounding box size instead of distance is the only criterion of coverage, the green wedge in this figure is only an approximate visual field of each camera. So targets 4 and 22 are green in the right graph though out of any wedge.

4.3 Performance Evaluation and Discussion

We quantitatively evaluate the performance of our method and compare it to the baseline method. To show the necessity of inverse projection transformation, we also compare our method with the method that directly uses bounding boxes as Q-network inputs. Ablation analysis is also conducted to show the effectiveness of the proposed different individual rewards.

4.3.1 Evaluation

Method CR(%)
Table 2: Comparative results in “soccer court” environment. Note that CR represents Coverage rate, “Ours+” refers to our method with ground truth bounding boxes replacing object detector, and “Ours-” refers to our method without inverse projection transformation.

Our agents are trained and tested with the settings mentioned above. We compare our system with baseline and other modified methods shown in Table 2

. To compare the effectiveness of the object detector, we compare our method with “Ours+", in which precise bounding boxes of each target are directly provided by the environment instead of the detector. Moreover, to demonstrate the effectiveness of inverse projection transformation for coordinate alignment, we also compare our method with the two-stage tracking method “Ours-". In “Ours-", the step of inverse projection transformation is omitted, thus the bounding boxes detected in each camera are the inputs of the Q-networks, instead of coordinates. Considering the uncertainty brought by initialization, we conduct 100 runs in each method and report the mean and standard deviation of the coverage rate.

The results demonstrate that our agents have the capability to move and track actively. Thus, when targets are nearly out of view, our agent can take action to approach the target, trying to keep it in its capture. Meanwhile, the fixed cameras are unable to deal with these situations and keep losing the target. The results also show that using object detection and ground truth bounding boxes has little difference in performance, which proves the effectiveness of using Yolov4-tiny as our object detector. It also demonstrates that using inverse projection transformation for alignment notably improves the convergence speed and also achieves better performance. Figure 7 illustrates an example of the difference between ours and the baseline method. The fixed cameras have plenty of blind areas where targets can never be captured. Meanwhile, our agents track objects actively when they are about to move out of their visual field, and cooperate to cover maximum numbers of targets.

To further verify the correctness of the Inverse Project Transformation technique, we test for 1000 steps to compare the calculated coordinates with the ground truth. The results show that the average Euclidean distance between them is 29.6 world units with an std. of 8.04. Considering that the size of the scope of target activity in our environment is world units, the estimated error is negligible. The main cause of the error is the incomplete bounding box of the partially observed target. Figure 5 shows an example of inverse project transformation. For most of the targets, their estimated locations are close to the ground truth, as the black points nearly cover the green points. However, for target 20, the estimation is not perfect since the object is partially captured by camera 6. The real position of it is closer to the camera.

Method CR(%)
Ours - Vis
Ours - Dir
Ours - Box
Ours - Pos
Table 3: Results of ablation study. Note that CR represents Coverage rate. We adjust the reward structure for each compared method. Models are trained for 200k steps.

4.3.2 Ablation Analysis

Our final reward is a weighted sum of individual rewards and the team reward. In Section 3.4 multiple individual rewards are introduced in our reward definition: visibility, direction, bounding box, and position reward. These four types of rewards are significant for our agents to learn an appropriate collaboration policy. Therefore we add an ablation analysis to illustrate the effectiveness of these four rewards and combination of the individual rewards and the team reward. In each method, we remove one of these four rewards from our individual reward. As shown in Table 3, performance drops considerably when discarding any of these four rewards, or using only team reward or individual reward. In other words, fusing individual rewards and the team reward is effective, and each type of individual rewards is necessary to help our method achieve a higher coverage rate.

5 Conclusion

In this paper, we first formulate the AMOT task as a POMDP problem and then propose the coordinate-aligned multi-camera collaboration method for it. Moreover, we establish a 3D environment simulating real-world multi-object tracking scenes to train and evaluate our method. The specifically designed reward and centralized multi-agent reinforcement learning network enable our agent to learn an optimal collaboration policy. To address the partial observation integration problem, we leverage inverse projection transformation as an intermediate step, converting bounding boxes into aligned coordinates. Empirical results show that our method outperforms the traditional method with fixed cameras by achieving a higher coverage rate, and validate the effectiveness of our object detector and inverse projection transformation step.


  • [1] Wenhan Luo, Peng Sun, Fangwei Zhong, Wei Liu, Tong Zhang, and Yizhou Wang. End-to-end active object tracking via reinforcement learning. In

    Proceedings of the International Conference on Machine Learning (ICML)

    , pages 3286–3295. PMLR, 2018.
  • [2] Fangwei Zhong, Peng Sun, Wenhan Luo, Tingyun Yan, and Yizhou Wang. Ad-vat: An asymmetric dueling mechanism for learning visual active tracking. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • [3] Jing Li, Jing Xu, Fangwei Zhong, Xiangyu Kong, Yu Qiao, and Yizhou Wang. Pose-assisted multi-camera collaboration for active object tracking. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    , volume 34, pages 759–766, 2020.
  • [4] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Scaled-yolov4: Scaling cross stage partial network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 13029–13038, 2021.
  • [5] P.F. Sturm and S.J. Maybank. On plane-based camera calibration: A general algorithm, singularities, applications. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 432–437 Vol. 1, 1999.
  • [6] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 30, 2016.
  • [7] Weichao Qiu, Fangwei Zhong, Yi Zhang, Siyuan Qiao, Zihao Xiao, Tae Soo Kim, Yizhou Wang, and Alan Yuille. Unrealcv: Virtual worlds for computer vision.

    ACM Multimedia Open Source Software Competition

    , 2017.
  • [8] Jing Xu, Fangwei Zhong, and Yizhou Wang. Learning multi-agent coordination for enhancing target coverage in directional sensor networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 10053–10064. Curran Associates, Inc., 2020.
  • [9] Kye Kyung Kim, Soo Hyun Cho, Hae Jin Kim, and Jae Yeon Lee. Detecting and tracking moving object using an active camera. In Proceedings of the International Conference on Advanced Communication Technology (ICACT), volume 2, pages 817–820, 2005.
  • [10] Joachim Denzler and Dietrich WR Paulus. Active motion detection and object tracking. In Proceedings of the IEEE International Conference on Image Processing (ICIP), volume 3, pages 635–639, 1994.
  • [11] D. Comaniciu and V. Ramesh. Robust detection and tracking of human faces with an active camera. In Proceedings of the IEEE International Workshop on Visual Surveillance (VS), pages 11–18, 2000.
  • [12] Don Murray and Anup Basu. Motion tracking with an active camera. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 16(5):449–459, 1994.
  • [13] Yang Li, Jianke Zhu, Steven CH Hoi, Wenjie Song, Zhefeng Wang, and Hantang Liu. Robust estimation of similarity transformation for visual object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 33, pages 8666–8673, 2019.
  • [14] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv Preprint arXiv:2004.10934, 2020.
  • [15] Hongwei Hu, Bo Ma, Jianbing Shen, Hanqiu Sun, Ling Shao, and Fatih Porikli.

    Robust object tracking using manifold regularized convolutional neural networks.

    IEEE Transactions on Multimedia (TMM), 21(2):510–521, 2019.
  • [16] Qiurui Wang, Chun Yuan, Jingdong Wang, and Wenjun Zeng.

    Learning attentional recurrent neural network for visual tracking.

    IEEE Transactions on Multimedia (TMM), 21(4):930–942, 2019.
  • [17] Ningxin Liang, Guile Wu, Wenxiong Kang, Zhiyong Wang, and David Dagan Feng. Real-time long-term tracking with prediction-detection-correction. IEEE Transactions on Multimedia (TMM), 20(9):2289–2302, 2018.
  • [18] Yunus Çelik, Mahmut Altun, and Mahit Güneş. Color based moving object tracking with an active camera using motion information. In Proceedings of the International Artificial Intelligence and Data Processing Symposium (IDAP), pages 1–4, 2017.
  • [19] Aswin C. Sankaranarayanan, Ashok Veeraraghavan, and Rama Chellappa. Object detection, tracking and recognition for multiple smart cameras. Proceedings of the IEEE, 96(10):1606–1624, 2008.
  • [20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv Preprint arXiv:1312.5602, 2013.
  • [21] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [22] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018.
  • [23] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 4295–4304. PMLR, 2018.
  • [24] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384, 2021.
  • [25] Yuxin Wu and Yuandong Tian. Training agent for first-person shooter game with actor-critic curriculum learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • [26] Xinlei Pan, Yurong You, Ziyan Wang, and Cewu Lu. Virtual to real reinforcement learning for autonomous driving. In Proceedings of the British Machine Vision Conference (BMVC), 2017.
  • [27] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Fei-Fei Li, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 3357–3364, 2017.
  • [28] Zhang-Wei Hong, Yu-Ming Chen, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, Hsuan-Kung Yang, Brian Hsi-Lin Ho, Chih-Chieh Tu, Yueh-Chuan Chang, Tsu-Ching Hsiao, et al. Virtual-to-real: Learning to control in visual semantic segmentation. arXiv Preprint arXiv:1802.00285, 2018.
  • [29] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 1928–1937. PMLR, 2016.
  • [30] Mao Xi, Yun Zhou, Zheng Chen, Wengang Zhou, and Houqiang Li. Anti-distractor active object tracking in 3d environments. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), pages 1–1, 2021.
  • [31] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99–134, 1998.
  • [32] Hado Hasselt. Double q-learning. Advances in Neural Information Processing Systems, 23:2613–2621, 2010.
  • [33] Fangwei Zhong, Weichao Qiu, Tingyun Yan, Alan Yuille, and Yizhou Wang. Gym-unrealcv: Realistic virtual worlds for visual reinforcement learning. Web Page, 2017.
  • [34] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv Preprint arXiv:1606.01540, 2016.
  • [35] Andrew Howard, Maja J. Matarić, and Gaurav S. Sukhatme. Mobile sensor network deployment using potential fields: A distributed, scalable solution to the area coverage problem. In Hajime Asama, Tamio Arai, Toshio Fukuda, and Tsutomu Hasegawa, editors, Distributed Autonomous Robotic Systems 5, pages 299–308, Tokyo, 2002. Springer Japan.