Camera networks are becoming ubiquitous in smart cities where monitoring of urban environments has numerous applications like traffic management, law enforcement and security and automated surveillance. In these scenarios, camera sensors are deployed in public spaces like road intersections, common areas in residential, commercial and government complexes to collect data, which is transmitted, stored and analysed by the government or local authorities. For example, surveillance cameras in residential and commercial complexes can be used to identify and track trespassers and unauthorized personnel or for forensic analysis for investigating crimes.
For these applications, tracking targets is important and most approaches for multi-camera tracking are driven by the state-of-the-art visual object detection, tracking and re-identification methods. While single-camera tracking poses challenges like appearance, lighting, viewpoint and background variations and occlusions, multi-camera tracking with non-overlapping fields-of-view (FOV) poses a different challenge of re-identification of targets accross cmaeras. Since camera networks often have units that are spatially distant, transition times from one FOV to another may take several seconds or minutes or even longer depending on the scale of the camera network. Depending on network size and the cameras’ FPS, these networks generate a deluge of video frames, which are potential query candidates for the re-identification module. For handling such volumes, scalable methods are of vital importance. One common approach is to select the potential camera feeds where the target is likely to be present. This approach can benefit both manual and automated surveillance as fewer frames need to be processed for tracking targets of interest. Along the same lines, we investigate camera selection decisions to identify the most likely camera frame where the target may reappear at the next time instance.
The inter-camera target handovers are typically resolved using visual re-identification (Re-Id) techniques, where the current template of the target is matched against all target candidates in all candidate cameras. Even for small camera networks with non-overlapping camera FOVs, this association problem becomes very challenging because of the non-deterministic and unknown time a target takes to transition between two non-overlapping FOVs. This uncertainty results in a large number of candidate frames, each with possibly many target candidates. Since most Re-Id or verification approaches work at an operating point chosen based on a fixed False Alarm Rate (FAR), the number of false alarms will depend on the number of frames processed for Re-Id. Re-Id false alarms could be very detrimental to the tracker’s performance. Hence minimizing the number of frames that undergo a Re-Id query is critical to the tracking performance in camera networks, as well as reduce the computational complexity necessary to reduce the processing of frames not queried. An intelligent camera frame selection strategy could benefit both the accuracy and efficiency of a multi-camera target tracking system.
In this paper, we highlight this important yet relatively unexplored problem of camera selection in multi-camera target tracking. Ideally, none of the camera frames should be selected for a Re-Id query during a target transition period. These target transition times are typically time-varying and are characterized by target speed, inter-camera distance and other processes, which nonetheless, can be modeled from the video data captured from the cameras. Thus, we propose to learn a camera selection policy that intelligently schedules Re-Id queries to resolve inter-camera handovers. We design our approach in a manner that the learning strategy directly leverages the video data and does not depend upon the network topology. We will show experimentally that our proposed method makes a very few queries to the network as compared to the baseline and other competing methods used in the literature.
The likelihood of a target appearing at the next time step in one of the cameras is time-varying and depends on various non-deterministic factors like target speed, occlusion and others. Based on this observation, it is natural to model the camera selection problem for scheduling Re-Id queries as a Markov Decision Process (MDP), which was investigated in our initial workSharma et al. by employing the Q-Learning method to exactly solve the MDP. However, exact methods are hard to scale for larger camera networks, which have larger state and action spaces. Therefore, in this paper, we present an extension of Sharma et al.
and show that deep learning based approximate methods like Deep Q-Networks (DQN)Mnih et al. (2013) can be effectively used to scale up our camera selection approach to larger camera networks. In addition to the datasets used in Sharma et al. like NLPR-MCT Chen et al. (2014) , we also evaluate the approximate approaches with larger camera networks like the Duke MTMC dataset Ristani et al. (2016). The learned camera selection policy is used for inter-camera tracking (ICT) to generate an action that corresponds to waiting for the next time step by selecting a dummy camera or selecting one of the real cameras to schedule a Re-Id query. Finally, the policy is learned directly from the videos captured from the camera units and does not assume the knowledge of the underlying network topology. Nonetheless, in our experiments, we observe that the policy implicitly learns the network topology anyway.
The specific contributions of this paper are:
We highlight the importance of camera selection decisions to enable accurate and efficient target tracking in a network of cameras.
We extend our approach reinforcement learning based intelligent querying in camera networks (using exact Q-learing Sharma et al. ) using deep learning. The deep learning based approximation enables camera selections for larger camera networks whereas the exact methods fail in larger state space. We also include a dense reward that helps to distinguish between states.
We demonstrate over multiple real-world datasets pertaining to both indoor and outdoor environments that the learned camera selection policy queries a very small number of frames with a small trade-off on the recall values.
The rest of the paper is organized as follows: Sec. 2 discusses prior work relevant to the problem of multi-camera tracking and Deep Reinforcement Learning. We present the details of our formulation and training procedure in Sec. 3 and show comparative results and empirical evaluation in Sec. 4. Then we discuss the limitations in Sec. 5 and conclude in Sec. 6.
2 Related Works
2.1 Visual Tracking in Camera Networks
Prior works Hamid et al. (2010); Zhang et al. (2015); Khan and Shah (2003) assumed overlapping cameras and find 3D coordinates of the target object for tracking. These works rely on camera calibration and network topology to derive the 3D coordinates. Other approaches for tracking using 3D coordinates are network flow problem Zhang et al. (2015)
, Kalman filter based tracking using the homography matrixKhan and Shah (2003). The overlapping FOVs is too strong a constraint and have limited application in the real world.
There have been efforts made to track a target in non-overlapping FOVs. Initially, the tracking task was performed for inter-camera tracking (ICT) to find camera handovers using affinity model of the target’s appearance Kuo et al. (2010), social grouping model Zhang et al. (2015), using data association methods across multiple cameras Makris et al. (2004); Chen et al. (2015); Daliyot and Netanyahu (2013). The other approaches formulate the tracking problem using graph based approaches Zhang et al. (2015); W. Chen et al. (2017), contextual information Y. Cai (2014), spatio-temporal mapping between 3D coordinates Chen et al. (2011), and clique based formulation Ristani and Tomasi (2015); Ristani et al. (2016)
. Many other works also incorporate target’s travel time to estimate the transition time of camera handoversJaved et al. (2008). These approaches perform target tracking in a unified way. However, a multiple camera network setup offer multiple challenges in terms of illumination variation, occlusions, and uncertain transition times that may have a time-varying distribution for different targets. In this direction, many related works incorporated appearance based template matching to track the target across the camera network. The appearance cues of the target were modeled in Sunderrajan and Manjunath (2013); Zhang et al. (2015); Daliyot and Netanyahu (2013). The appearance of the target is generally captured by color Zhang et al. (2015) or texture Daliyot and Netanyahu (2013) features. To handle lighting variations across different cameras, color normalization Y. Cai (2014), brightness transfer functions Javed et al. (2008) were used. Spatio-temporal reasoning Hamid et al. (2010); Kuo et al. (2010) and graph based methods W. Chen et al. (2017)
were applied over the appearance features to perform inter-camera tracking. In appearance based information, Bayesian inferenceHuang and Russell (1997) was applied by integrating color and size of the object with the velocity and the arrival times of the target in two camera views. The approach was extended to more than two camera views using hidden variables Pasula et al. (1999) in the Bayesian formulation. Matei et al. Matei et al. (2011), on the other hand used a multi-hypothesis framework instead of a Bayesian model. Other approaches to multi-camera target tracking are using conditional random fields (CRF) Chen and Bhanu (2017), global graph model using MAP association and flow graphs W. Chen et al. (2017). The state-of-the-art Lee et al. (2018) on NLPR dataset uses a two step framework to perform SCT (Single Camera Tracking) and ICT (Inter-Camera Tracking) separately. They incorporate multiple human appearance features along with segmentation using change point detection to perform SCT. They perform ICT by making a camera link model using a combination of appearance features and the distribution of transition time of the target between camera pairs.
Apart from above related works, re-identification based approaches Zheng et al. (2017); Lavi et al. (2018) are prominent for template matching to associate different bounding box detection. All-pair template matching using re-identification and data association is found to be NP-hard Collins (2012); Chari et al. (2015) and hence multiple methods Ristani and Tomasi (2018); Chen and Bhanu (2017) use time-consecutive frames to reduce the search complexity. Ristani and Tomasi (2018) uses correlation clustering to trade off the computational cost. In contrast to these works, we propose a reinforcement learning based policy learning approach which selects a camera where the target is likely to be present at the next time step with the goal of reducing the search space for template matching (Re-Id). Our approach can be readily integrated with any Re-Id approach as we only focus on the frame selection component.
Deep learning based approaches have shown superior template matching performance. For example, Ristani et al. Ristani and Tomasi (2018) proposed a weighted triplet loss for re-identification for better feature representation. They achieve multi-camera tracking using correlation clustering. However, their approach is restricted to tracking targets offline. In this paper, we will show that camera selection decisions are crucial to enable tracking in camera networks by comparing number of frames to be processed by various related methods. Moreover, our camera selection approach can benefit both online and offline target tracking in multi-camera networks.
2.2 Deep Reinforcement Learning
Many vision problems Xiang et al. (2015); Yun et al. (2017); Paletta et al. (2005); Karayev et al. (2014); Mnih et al. (2013) have been formulated using Markov Decision Process (MDP) BELLMAN (1957). Formulating the tracking problem using MDP is effective because the agent learns to take actions sequentially, which implicitly model the target’s motion. In our formulation, we have used one MDP for camera selection decision to enable single target tracking in multiple cameras which can easily be extended to tracking multiple targets by simultaneously running multiple policies. Deep-Q learning Mnih et al. (2013) has shown human level performance in playing Atari games using visual frames. Such methods use one-step reward during the training process, however, n-steps reward Sutton and Barto (1998) can help in faster convergence by bootstrapping states for multi-step reward. Time limits Pardo et al. (2018) in reinforcement learning has shown that randomizing the state vector after a time limit achieves better performance. Recently, deep reinforcement learning techniques were applied for visual object detection Mathe et al. (2016) and tracking Yun et al. (2017); III and Ramanan (2017); Luo et al. (2017). These approaches are applied for single object/target tracking in a single camera field-of-view. To our knowledge, we are the first to explore deep reinforcement learning for single target tracking in a camera network. We have shown that a policy learned using reinforcement learning can intelligently poll cameras to reduce the number of frames required for target’s template matching. In our approach, we have used deep-Q learning Mnih et al. (2013) to learn a policy to poll a camera frame at any time-step to look for the presence of the target.
3 Proposed Methodology
In this section, we will provide the details of the system architecture and the reinforcement learning formulation for the camera selection decision problem.
3.1 System Overview
Figure 1 shows our system architecture, which consists of two blocks: First, block Q which learns a policy to select the next camera where the target is likely to appear given target’s current state. The second block verifies the presence of the target in the selected camera frame. In surveillance, this is usually done manually using human input or automatically using re-identification Lavi et al. (2018); Zheng et al. (2017) based approaches. We will name this second block, the presence block. The presence block takes as input the selected camera frame and will return if the target is present in the camera frame along with the corresponding bounding box location, otherwise it returns a . As our focus is on learning the policy for camera selection, we being with the assumption that the presence block is perfect, and then investigate the impact of error in presence prediction. We achieve this by using ground truth labels for simulating a perfect Re-Id approach, and then induce random matching errors at different levels, in effect simulating outputs from Re-Id models at different levels of accuracy. This setting is followed in order to systematically evaluate the strength of our camera selection policy.
The block Q, takes as input the current state (detailed in next subsection) and selects a camera index which will be polled to search the target using the presence block. The policy selects one of the actions, where is the number of cameras. The first actions correspond to each camera and the action is to be selected when the target is transitioning from one camera’s FOV to another. The sequence of selected cameras gives the target’s trajectory in terms of the cameras in which the target appears termporally. This is a non-trivial task due to the unknown and non-deterministic transition time of each target during camera transitions which also requires to correct any wrong selections made at previous timestamps. Consider the examples shown in Fig. 2
, where the transition times between a pair of cameras are different for all targets. The policy is implemented using a neural network model depicted in Fig.3. The network parameters are learned using deep Q-learning Mnih et al. (2013) with n-steps bootstrapping Sutton and Barto (1998). In the subsequent subsection, we will provide details of the training and testing algorithm for camera selection decisions.
3.2 Markov Decision Process and Q-learning
The goal of reinforcement learning is to learn a policy that decides sequential actions specific to the target’s state by maximizing a cumulative reward function Sutton and Barto (1998). Our system architecture uses deep Q-learning to learn a policy to make camera selection decisions. A decision problem can be formulated using a Markov Decision Process (MDP). The MDP is defined by the tuple , where the set of states, is the set of actions, is the state transition function and is the reward function that determines the reward that the environment provides by taking an action at state . In an MDP, we learn a stochastic policy to decide the probability of an action given the current state of the agent. The environment then responds at next time with the next state (decided by the state transition function) and a reward (decided by the reward function). In real-world, both the state-transition function and the reward function can be stochastic. Given the MDP formulation, we can learn the policy using trial-and-error strategy. Our MDP formulation is given in the later part of this subsection. We define a state-action value function Q to estimate the expected value of the return for taking action in state by . Return at time is defined as the discounted sum of future rewards:
Where is the discount factor which is typically included to make the return bounded. The estimates will then be used to make the action selection decision. We use state-action value function to learn the expected return starting at state and taking action and using policy for further time-steps (we will use in place of for all following text). The value function will tell us the expectation of how good (in terms of reward) the current state and action will result in future given the current policy.
The goal is to learn the optimal state-action function Sutton and Barto (1998). The optimal Q-function can be learned using reinforcement learning techniques such as Q-learning, policy gradient, etc. We use Q-learning to learn an optimal Q-function because it is an off-policy and model-free algorithm i.e., an optimal policy can be learned by state-space exploration using trial-and-error and doesn’t need an accurate state transition model. We have explained the learning procedure later in this subsection.
We formulate the camera selections as a decision problem where each camera is considered as a separate action. As noted by related works Ristani and Tomasi (2018), the target tracking over a camera network can be NP-hard for searching the target in all cameras and at all times and therefore, this becomes important to reduce the number of search operations while tracking a target across the multiple cameras. Selecting one camera for the search operation will reduce the need for searching across all cameras. The cameras in a typical camera network are deployed far apart and hence searching is pointless when the target is transitioning between cameras. To ensure this the policy learns to decide a null camera when the target is not visible in any of the camera. Therefore, the task is to learn a policy at target’s state which will give the probability of selecting a camera (equivalently selecting an action) given the current state i.e., , where is the action (or equivalently camera in the context of camera selection decisions). Such a policy can predict the period of visibility (when it is visible in any of the camera) of the target in the camera network, and the period of invisibility (when the target is not visible in any of the camera). We will show that this policy can be learned directly from data using the trial-and-error based approach i.e., by taking feedback from the environment.
However, this problem doesn’t map to the MDP directly because of the target’s partial observability like occlusion from other targets or the target not present in the selected camera. For example, if the policy selects camera but the target is present in the FOV of camera . The observation that the learning agent gets from the environment doesn’t provide the target’s state information and it makes the state non-Markov. Hence, we need to create a state vector from the noisy observations which is Markov (next state is independent of the previous state, given the current state). For partial observable environment, we can keep a history vector of the observations starting from the initial location to the current time which helps to estimate the next state. However, considering full history length in the state vector becomes intractable. Therefore, we need to create state from history which is Markov. In addition to the observations, we keep the action history and time elapsed in the state vector. To read more about the partial observable problem, readers are encouraged to read Sutton and Barto (1998). The individual components of the state vector are defined in the following text:
State: The state at time captures the observations of the target and the history of cameras selected by the policy, and time elapsed . The individual elements of the state space are following:
: An observation of the target’s location is its spatial location in a particular camera frame i.e., where is the camera index and is the bounding box in the camera . If we keep last observations of the target then the next location can be estimated (for example, using kalman filter) and hence the last observations make the state vector Markov. The last observations form the vector . In which is encoded as a one-hot vector and is encoded by normalizing the bounding box location i.e., . are the pixel coordinates of the upper left corner of the bounding box and are corresponding width and height respectively. The bounding box values are normalized by dividing the pixel coordinates by the corresponding image dimensions.
: The action at next time-steps depends on the current action and the previous actions selected by the policy. Hence, we have included the previously selected actions to the state vector. represents the history of the cameras selected by the learned policy. The history of cameras is encoded as one-hot vector.
: It captures the time elapsed since the target was last seen in any camera. This captures the time ticks since the target is not visible. Motivated from time-limits in reinforcement learning Pardo et al. (2018), we have included to work with indefinite transition times. For an infinite horizon problems, the time limits motivates that the state should be randomized after when the time-limit expires. Randomizing the state after time-limits achieves better performance.
Actions: The action at time is encoded by dimension vector, where is the number of cameras in the camera network. An optimal policy should select an action from the first actions when the target is visible in the camera index . The action is selected when the policy selects no camera, i.e., the target is not visible in any of the camera.
State transition function: After deciding an action at time , the next state is decided by following state evolution function:
The function appends the one-hot encoding of the selected camerato the camera history vector. If the target is found in selected camera then last seen observation vector is updated by including new otherwise is incremented by .
Reward: The reward function is defined for each state and action pair. In Sharma et al. , we provided a binary reward function and here we use a dense reward. During training, this reward helps in knowing how long will it take to end the current handover. give At time , it is following:
Assumptions: We assume that all the cameras of the camera network are uniquely identifiable and the camera network topology doesn’t change during testing phase (the CCTV network infrastructure doesn’t frequently change in the real world too).
Policy: The policy selects an optimal action from the learned Q-value functions. After learning, given the target state, it selects an optimal action using the learned Q-value function in-state as:
Q-learning: Q-learning is a temporal-difference (TD) Sutton and Barto (1998) learning algorithm which learns directly from state-space exploration without knowing a state-transition model. The Q-learning learns an optimal Q-value function by iteratively updating the values using the following bellman equation independent of the policy being followed:
Where is the learning rate, and is the discount factor. At state , the learning agent performs an action and then the environment responds with a new state and a reward value. An optimal Q-function should reflect the expected return for the state-action pair. Usually we start with a random policy and we explore the state-space by taking actions to update the value function about the goodness of the state-action pair. Sufficient exploration is essential for the Q-learning methods to update returns for a large number of state-action pair. In RL, we use epsilon-greedy exploration strategy Sutton and Barto (1998). The update value considers the reward received for next one-step only but the one-step reward doesn’t give the actual future reward during initial steps and the policy will not learn the right camera handover for larger transition times. Hence we are incorporating n-step rewards Sutton and Barto (1998) to update the value function.
Q-learning with n-step bootstrapping: The Q-learning update equation specified above updates the value function at next time using one step reward. In n-step reward, we update the value of a state after receiving rewards for n time steps. For example, taking , would change the Q-learning bellman equation 6 to:
3.3 Camera Selection Decisions using Deep-Q Network
Earlier in Sharma et al. , we proposed an exact RL method (the learned value function is stored in a table) for camera selection decisions where we discretized the state because of a very large state space but using deep learning we can learn features even from the continuous and larger state space. Neural networks were found to map the states to reward values in many related works Mnih et al. (2013); Silver et al. (2016); Hausknecht and Stone (2015)
. The parameters of the neural network can be updated using gradient descent based backpropagation algorithmsKingma and Ba (2014)
. For all implementation of exact RL method, we have used a server machine with 128 GB RAM, 5GB GPU (Nvidia Tesla K20m) and Matlab-16B version. For implementation of neural networks, we have used a workstation with 8GB GPU (Nvidia GeForce GTX-1080), 16GB RAM and in pytorch. The exact RL method worked only for NLPR MCT datasets and goes Out-of-Memory (OM) for DukeMTMC dataset.
Neural network model: Our neural network model is shown in figure 3. For the neural network, we will find the optimal weights which will help the learning agent to get maximum reward. For the reward based learning, we have used deep Q learning Mnih et al. (2013) algorithm to update the neural network weights based on the reward received from the environment. The first three hidden layers of the network have relu activation and the last layer, outputs the Q-values corresponding to each individual action has linear activation. The output is a dimension vector, where is the number of cameras in the camera network. Each output corresponds to an action reflects the Q-value for the input state . The action corresponding to maximum Q-value of the output layer is selected by the policy (equation 5). The selected camera frame is then passed to the presence block of the system to find the bounding box location of the target in the selected camera. The system then moves to the next state using the state-transition function and on the next state, the policy again selects an action using the Q-values predicted by the neural network.
Training procedure: To train the network, we need to have the target labels corresponding to each input. Deep-Q learning Mnih et al. (2013) algorithm uses the return at each step as the target label. The output of the network at state is and the corresponding target is the discounted future reward for -steps. For simplicity, taking , the target for state after receiving a reward from environment is . We have used mean-square error to compute loss at each time-step. Hence, when action is taken at state , the loss (corresponding to action ) can be written as:
The loss term for actions other than will be zero (there are actions). In RL, the term in brackets is also known as TD (Temporal Difference) error. In the loss for n-step bootstrapping, we replace the next (one) step reward with the n-step return. The step by step training procedure is shown in algorithm 1.
Note that the training procedure is same irrespective of whether the target is inside a camera field-of-view or transitioning between cameras. For training the neural network, we initialized the state vector with the initial location of the target and history vector to all zeros. The selected action (camera index) is then used to verify the presence of the target (see section 3.1). The state is accordingly updated using the state transition function. At any particular time, a target can see occlusion during SCT (Single Camera Tracking) and hence to simulate such cases, we have included short random jumps and hence increments by the value of the random jump or by when presence block cannot find the target. If the target is found, is set to . Each transition is stored in a replay memory until the end of the episode. When episode ends, a small minibatch is sampled randomly from the replay memory for backpropagation using adam Kingma and Ba (2014). The training process is repeated until convergence (when the reward received in each episode saturates). Instead of fixing a value for the epsilon in epsilon greedy exploration, we start with a value of and decrements it as training progresses. The epsilon is set using
. At later training epochs, the policy’s decision was used as the epsilon values reaches a minimum as shown in the figure4. The second plot of the figure shows that the reward saturates at later training epochs.
4 Evaluation and Results
In this section, we present details of the datasets used, the evaluation metric and the experimental results of the proposed architecture on the used datasets.
|Duration||20 min||20 min||3.5 min||24 min||1hr 25min|
4.1 Dataset and Evaluation Metric
Dataset: We have used NLPR_MCT data set Chen et al. (2014) and DukeMTMC Ristani et al. (2016) dataset to evaluate the proposed architecture for camera selections in multi-camera network for single target tracking. The NLPR_MCT dataset consists of four sub-datasets each having cameras with a resolution of . Details of the dataset are given in Table 1. The dataset comprises cameras installed in both indoor and outdoor environments with significant illumination variation across different cameras. The set-1 and set-2 have the same environment and network topology. The set-3 was captured in an office building, and the set-4 was captured in a parking area. We learn a separate policy for set-3, set-4, and set-1. Since the camera network in set-2 is same as set-1, we use the same policy for both subsets. The DukeMTMC dataset consists of cameras deployed in Duke University campus. To date, DukeMTMC dataset is the benchmark dataset for multi-target multi-camera (MTMC) tracking. The details of the dataset are given in table 1. It is difficult to identify the correct topology of the camera network with both overlapping and non-overlapping FOVs, for example in the case of the DukeMTMC dataset. The top view of the camera topology of DukeMTMC dataset is shown in figure 5.
The training and the testing sets are constructed from each datasets by randomly selecting half the people for the training and the remaining half for testing. However, the evaluation benchmark of DukeMTMC dataset doesn’t provide platform for camera selection performance and hence to train the policy and to evaluate the performance, we have divided the available training set into two parts by splitting person identities in two sets. Therefore, for camera selection decisions, we are reporting performance on the sub-part of the actual training set. The two sets contain mutually exclusive person identities. We expect the policy to implicitly learn the network topology, and so long as the network is static, the policy should work for all new, unseen target individuals. Typically, CCTV network topologies in the real-world are seldom modified.
We define evaluation metrics over the entire sequence of frames generated by the camera network. The sequence is indexed by time-steps corresponding to the time of frame capture for the cameras. Since the cameras operate on the same frame rate for a given subset, we can ignore any synchronization errors without any significant impact on the camera selection and tracking performance.
: To evaluate the camera selection performance, we report camera selection accuracy, precision and recall computed over the entire sequence of each subset. In order to consider instances when the target is not visible in any of the cameras, we introduce a dummynull camera and denote it by . Given a target, let the ground truth sequence of cameras in which it appears be contained in the vector and sequence of cameras polled by the policy be in vector with the element indicated using a subscript. The Accuracy (A), precision (P) and recall (R) are defined as following for a single target
The final value for each of these metrics is reported as an average computed over all targets. Along with , we also report number of frames polled () during an inter-camera transition of the target. It is defined as
is an important measure because with a large number of frames polled, the chance of false alarms during a re-identification query as well as the computational complexity is substantially increased. We perform evaluation in two parts, one for ICT alone and another for ICT along with SCT. For ICT alone case, we do not consider the frames when the target was seen in a single camera field-of-view. We also evaluate the overall performance of target tracking in a camera network, when our camera selection policy is used for ICT. We use the standard Multi-Camera Tracking Accuracy (MCTA), which gives a single scalar value for all components involved in multi-camera tracking, i.e., F1-score for detection, number of target handovers for single camera tracking, and the number of handovers in inter-camera tracking. The metric is defined as
where is the precision, is recall for target IDs. The number of target-ID mismatches at time is given by and is the number of true positives in a single camera at time . The superscripts and denote the single camera tracking (SCT) or cross-camera tracking (ICT) scenario. Readers are requested to see Ristani et al. (2016); Chen et al. (2014) for details about the MCTA metric.
We have proposed a single target tracking approach that tracks the given target across multiple cameras whereas the related approaches on the benchmark datasets are multi-target multi-camera. To make a fair comparison with related approaches, we have created a multi-target version of our algorithm. To compute multi-target tracking results, we are running multiple pipelines of our approach for multiple targets. In our approach, the tracking performance of one target does not depend on another and hence the approach can be easily extended to multi-target tracking problem.
4.2 Camera Selection Performance of the Learned Policy
|NLPR DB-1||NLPR DB-2|
|NLPR DB-3||NLPR DB-4|
|NLPR DB-1||NLPR DB-2|
|NLPR DB-3||NLPR DB-4|
|ICT alone||SCT + ICT|
In this subsection, we will describe the performance of the learned policy for camera selection decisions. There are two cases for tracking a target in a camera network. First, ICT (Inter-Camera Tracking) where the task is to identify the correct camera handovers that the target performs. Second, is SCT+ICT (Single Camera Tracking + ICT) where the task is to identify the correct cameras when the target is moving in a single camera field-of-view along with the camera handovers.
To perform the experiment, we have initialized the initial state of the target with its initial location with history vector being all zeros. At each time-step, the learned policy selects a camera index where the target is likely to be present. The selected camera is then queried to identify whether the target is present in the selected camera field-of-view. The presence of the target is used to locate its spatial location (bounding box) in the selected camera frame. For surveillance, this task is usually performed by human agents who continuously watch the camera feed. Alternatively, this task can be achieved by re-identification based methods to automatically identify the presence of the target. Such methods use visual template matching to re-identify an object in different camera feeds given the visual template of the target. To evaluate the camera selection decisions, we use correct presence of the target from the ground truth data. We make this simplifying choice in this experiment to eliminate the uncertainty introduced due to the re-identification performance. The policy continues polling of cameras until the target exits the camera network or the sequence terminates. The complete procedure to perform target tracking using the learned policy is shown in the algorithm 2. For infinite horizon problems, time limits Pardo et al. (2018) in reinforcement learning have shown on various applications that randomizing the state vector (even during testing) after a time period provides better performance because larger time steps may end up in a bad state. Randomizing the state vector will help the policy to select actions from another state and eventually results in better performance. Similarly, in our case, when reaches a predefined maximum value, we select a random camera index to update the state vector and let the policy continue from that point to make camera selection decisions. For example, for NLPR DB-3, without using time limits, we got camera selection accuracy of whereas by setting a time limit of time-steps we got an accuracy of . We observed similar case of other datasets and used a different time limit for all datasets. All further results are reported with time limits of for NLPR DB-1 and 2, for NLPR DB-3, for NLPR DB-4, and for DukeMTMC dataset.
Metrics like accuracy, precision and recall encapsulate overall performances and allow comparative analysis as shown in Table 2, 3 and 4 which reports the camera selection performance on each dataset. Table 2 shows accuracy (A), precision (P), and recall (R) for NLPR MCT dataset for ICT case only. Table 3 shows A, P, R for NLPR MCT dataset for both SCT and ICT and Table 4 shows the camera selection decision performance for DukeMTMC dataset for both cases, ICT alone and SCT and ICT together.
In addition to the proposed policy’s performance, we are comparing the camera selection performance of the policy with three baseline approaches used in related works. The Exhaustive approach is a brute-force approach which polls each camera at all time steps until the target is found in one of the cameras. The table shows that it has accuracy but poor precision. The Neighbor approach assumes that the camera network topology is known and searches the target by polling only in the neighboring cameras. Approaches proposed in Zhang et al. (2015); Chen and Bhanu (2017) searches the target in the adjacent cameras and hence process the same number of frames as the neighbor search approach. Along with these two approaches, we also compare camera selection performance with a method proposed in Lee et al. (2018). The approach proposed in Lee et al. (2018)
first estimates the distribution of the camera transitions assuming the fact that the multiple targets generally follow same paths and then samples a transition time to reduce the number of frames to be processed. They estimate a Gaussian distribution and hence we named this approach asGaussian. After the transition time, they start searching the target in cameras using a camera link model which will link different cameras having a path for transition. We repeated their experiment by estimating a Gaussian distribution from the train set and sampling a transition time for each person in the test set. The camera link model is used as set of neighboring cameras. The metrics computed in each table are reported for two cases: For ICT, the metrics are computed using equation (9
), but only using the time instances when the target is transitioning from one camera to the other. In case of SCT + ICT, the entire sequences are used. As expected, we see that the proposed policy has better precision than the other competing approaches. The Gaussian method is excluded in case of SCT + ICT, as the distribution is only defined for the ICT case. While the A, P and R measures indicate the overall performance of camera selection, a confusion matrix shows the pairwise miss-classification in camera selection. Based on the cameras being polled by our policy at various time steps, we report a confusion matrix for DukeMTMC dataset as shown in Table5. Our previous implementation in Sharma et al. using Q-learning goes out of memory for this dataset due to a very large state space. The confusion matrix is computed using deep learning based approximation of the Q-learning algorithm.
|Inter-camera tracking (ICT)|
|Y. Cai (2014)||0.9152||0.9132||0.5163||0.7152|
|Chen et al. (2014)||0.7425||0.6544||0.7369||0.3945|
|Chen et al. (2014)||0.6617||0.5907||0.7105||0.5703|
|Lee et al. (2018)||0.9610||0.9264||0.7889||0.7578|
|W. Chen et al. (2017)||0.835||0.703||0.742||0.385|
|SCT + ICT|
|Y. Cai (2014)||0.8831||0.8397||0.2427||0.4357|
|Chen et al. (2014)||0.7477||0.6561||0.2028||0.2650|
|Chen et al. (2014)||0.6903||0.6238||0.0848||0.1830|
|W. Chen et al. (2017)||0.8525||0.7370||0.4724||0.3778|
Figure 6 show the sequence of cameras polled by the policy as compared to what is seen in the ground truth. Horizontal axis is time and vertical axis shows the camera schedules in ground truth () and polled by policy (). The dark colors are camera schedules (mapped with colormap) and white color shows the length of the transition. The figure reflects the performance of deep RL policy for making camera selection decisions. One important aspect of target tracking in multiple cameras is computational time. Many related methods match target template across neighboring cameras Zhang et al. (2015); Chen and Bhanu (2017), all cameras Y. Cai (2014); Ristani and Tomasi (2018) for offline tracking. However, such approach will require a large amount of frames to be processed for template matching. Using the proposed policy, this template matching will be limited to a single camera per time-step per person. In figure 7, we have compared the number of frames to be processed of various such approaches. The figure shows the boxplot of -metric scores computed over all targets using the deep RL policy and various baseline approaches on DukeMTMC dataset. cameras).
4.3 Impact of Camera Selection Decisions on Target Tracking in Camera Networks
Now we will show the effectiveness of the camera selection decisions to enable target tracking in a camera network. To complete the tracking pipeline, we simulate the presence block of our proposed architecture. To simulate the presence block errors in a typical re-identification pipeline are generated by wrongly identifying the target with other available objects. We will compare the performance with state-of-the-art tracking methods.
To perform this experiment, we have initialized the state vector with the initial location of the target and history vector being all zeros. The learned policy then polls a camera frame which is looked for the presence of the target using presence block (refer to section 3.1). Unlike previous experiment, we are simulating a real re-identification pipeline for the presence block by adding errors to the presence decision. For example, to simulate error in re-identification, with probability , we are taking another target’s bounding box otherwise we are using the correct bounding box of the target. Once the presence is identified, the state vector is updated using the state-transition function. The updated state vector is then used by the policy to poll another camera and the process repeats till the end of the target’s trajectory or the end of the sequence. The predicted trajectory is the sequence of (c,b) i.e., camera and bounding box values. The predicted trajectory of the target is then used to compute the MCTA metric scores. We have compared the performance of the policy with simulated re-identification errors with various state-of-the-art methods on the NLPR MCT dataset. The MCTA scores are shown in the tables 6 and 7 for ICT alone case and SCT+ICT case respectively. In table 6, we have shown MCTA values for inter-camera tracking (ICT) only where the single-camera trajectory of the target is taken from the ground-truth. In table 7, shows the overall performance of the various methods i.e., during both single-camera tracking (SCT) and inter-camera tracking (ICT). The same experiments are reported by the related methods on NLPR dataset. In comparison to other methods, our approach performs better in most cases at error in re-identification. For higher errors, our method (especially deep RL) starts performing worse than others. Also, the related approaches are multi-target and multi-camera (MTMC) tracking approach whereas ours is single-target and multi-camera tracking. Therefore, to make a fair comparison, we have extended our approach to MTMC as explained in section 4.1. Similarly, results for DukeMTMC dataset are shown in the table 8.
|Approach||ICT alone||SCT + ICT|
We have proposed an approach for intelligent camera selection for dealing with target handovers in multi-camera target tracking. Our initial work used exact RL methods Sharma et al. and extended it to approximate methods using Deep RL in order to deal with larger camera networks. The deep RL implementation make better camera selection decisions and can be used with larger camera networks. However, there are a few limitations of the proposed deep RL approach. First, the performance of deep RL approach is sensitive to errors in Re-id. This requires investigations in training the deep learning based policy with a real re-identification so that the policy can learn how to handle errors during tracking. Second, large transition times results in a policy that has heavily imbalanced action distributions, e.g., becoming the most frequent action. Hence, efforts should be applied in exploring methods to handle imbalanced action space. Third, the indefinite transition time of a target makes exploration difficult in deciding whether the target goes out of the camera network or will appear again. There is a scope of improvement in identifying such cases.
We highlighted that re-identification queries in target tracking across camera networks can become a performance and computational bottleneck for practical systems. We proposed a solution that intelligently makes these queries by selecting cameras that are more likely to contain the target at a given time. We proposed a reinforcement learning based approach that learns a policy for camera selection based on previous actions and target location. We empirically show on two benchmark datasets that the proposed approach substantially reduces the number of frames queried, with negligible loss of tracking performance.
We acknowledge Infosys Center for Artificial Intelligence (CAI) at IIIT-Delhi for its partial support for conducting this research work.
- A markovian decision process. Journal of Mathematics and Mechanics 6 (5), pp. 679–684. External Links: Cited by: §2.2.
- On pairwise costs for network flow multi-object tracking. , pp. 5537–5545. Cited by: §2.1.
- Adaptive learning for target tracking and true linking discovering across multiple non-overlapping cameras. IEEE Transactions on Multimedia 13 (4), pp. 625–638. External Links: Cited by: §2.1.
- A novel solution for multi-camera object tracking. In 2014 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 2329–2333. External Links: Cited by: Table 6, Table 7.
- Multi-Camera Object Tracking (MCT) Challenge. Note: http://http://mct.idealtest.org/Datasets.html/ Cited by: §1, §4.1, §4.1, Table 6, Table 7.
- Multitarget tracking in nonoverlapping cameras using a reference set. IEEE Sensors Journal 15 (5), pp. 2692–2704. External Links: Cited by: §2.1.
- Integrating social grouping for multitarget tracking across cameras in a crf model. IEEE Transactions on Circuits and Systems for Video Technology 27 (11), pp. 2382–2394. External Links: Cited by: §2.1, §2.1, §4.2, §4.2.
- Multitarget data association with higher-order motion models. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1744–1751. External Links: Cited by: §2.1.
- A framework for inter-camera association of multi-target trajectories by invariant target models. In Computer Vision - ACCV 2012 Workshops, J. Park and J. Kim (Eds.), Berlin, Heidelberg, pp. 372–386. External Links: Cited by: §2.1.
- Player localization using multiple static cameras for sports visualization. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. , pp. 731–738. External Links: Cited by: §2.1, §2.1.
- Deep recurrent q-learning for partially observable mdps. CoRR abs/1507.06527. External Links: Cited by: §3.3.
- Object identification in a bayesian context. In In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97, pp. 1276–1283. Cited by: §2.1.
- Tracking as online decision-making: learning a policy from streaming videos with reinforcement learning. CoRR abs/1707.04991. External Links: Cited by: §2.2.
- Modeling inter-camera space-time and appearance relationships for tracking across non-overlapping views. Comput. Vis. Image Underst. 109 (2), pp. 146–162. External Links: Cited by: §2.1.
- Anytime recognition of objects and scenes. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 572–579. External Links: Cited by: §2.2.
- Consistent labeling of tracked objects in multiple cameras with overlapping fields of view. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (10), pp. 1355–1360. External Links: Cited by: §2.1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3, §3.3, 24.
- Inter-camera association of multi-target tracks by on-line learned appearance affinity models. In Proceedings of the 11th European Conference on Computer Vision Part I, ECCV2010, Berlin, Heidelberg, pp. 383–396. External Links: Cited by: §2.1.
- Survey on deep learning techniques for person re-identification task. CoRR abs/1807.05284. External Links: Cited by: §2.1, §3.1.
- Online-learning-based human tracking across non-overlapping cameras. IEEE Transactions on Circuits and Systems for Video Technology 28 (10), pp. 2870–2883. External Links: Cited by: §2.1, §4.2, Table 6.
- End-to-end active object tracking via reinforcement learning. CoRR abs/1705.10561. External Links: Cited by: §2.2.
- Bridging the gaps between cameras. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., Vol. 2, pp. II–205–II–210 Vol.2. External Links: Cited by: §2.1.
- Vehicle tracking across nonoverlapping cameras using joint kinematic and appearance features. In CVPR 2011, Vol. , pp. 3465–3472. External Links: Cited by: §2.1.
- Reinforcement learning for visual object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2894–2902. External Links: Cited by: §2.2.
- Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: Cited by: §1, §2.2, §3.1, §3.3, §3.3, §3.3.
Q-learning of sequential attention for visual object recognition from informative local descriptors.
Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, New York, NY, USA, pp. 649–656. External Links: Cited by: §2.2.
- Time limits in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 4045–4054. External Links: Cited by: item 3, §2.2, item 3, §4.2.
- Tracking many objects with many sensors. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, San Francisco, CA, USA, pp. 1160–1167. External Links: Cited by: §2.1.
- Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, Cited by: §1, §2.1, §4.1, §4.1.
- Tracking multiple people online and in real time. In Computer Vision – ACCV 2014, D. Cremers, I. Reid, H. Saito, and M. Yang (Eds.), Cham, pp. 444–459. External Links: Cited by: §2.1.
- Features for multi-target multi-camera tracking and re-identification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §2.1, §3.2, Figure 5, §4.2.
-  Reinforcement learning based querying in camera networks for efficient target tracking. In Proceedings of International Conference on Automated Planning and Scheduling (ICAPS), 2019, Cited by: item 2, §1, §3.2, §3.3, §4.2, §5.
- Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. External Links: Cited by: §3.3.
- Multiple view discriminative appearance modeling with imcmc for distributed tracking. In 2013 Seventh International Conference on Distributed Smart Cameras (ICDSC), Vol. , pp. 1–7. External Links: Cited by: §2.1.
- Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: Cited by: §2.2, §3.1, §3.2, §3.2, §3.2, §3.2, §3.2.
- An equalized global graph model-based approach for multicamera object tracking. IEEE Transactions on Circuits and Systems for Video Technology 27 (11), pp. 2367–2381. External Links: Cited by: §2.1, Table 6, Table 7.
- Learning to track: online multi-object tracking by decision making. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 4705–4713. External Links: Cited by: §2.2.
- Exploring context information for inter-camera multiple target tracking. In IEEE Winter Conference on Applications of Computer Vision, Vol. , pp. 761–768. External Links: Cited by: §2.1, §4.2, Table 6, Table 7.
- Action-decision networks for visual tracking with deep reinforcement learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1349–1358. External Links: Cited by: §2.2.
- A camera network tracking (camnet) dataset and performance baseline. In 2015 IEEE Winter Conference on Applications of Computer Vision, Vol. , pp. 365–372. External Links: Cited by: §2.1, §4.2, §4.2.
- Tracking multiple interacting targets in a camera network. Computer Vision and Image Understanding 134, pp. 64 – 73. Note: Image Understanding for Real-world Distributed Video Networks External Links: Cited by: §2.1, §2.1.
- A discriminatively learned cnn embedding for person reidentification. ACM Trans. Multimedia Comput. Commun. Appl. 14 (1), pp. 13:1–13:20. External Links: Cited by: §2.1, §3.1.