In surveillance applications, the ability of long-term tracking a certain person or object from security cameras is highly desirable. Usually the security personnel can label a suspicious person in the video. The tracking system can then track this suspicious person in the videos without further human inputs for minutes or hours.
One situation that the visual tracking systems for the surveillance applications may deal with frequently is the sudden-appearance-changes of the being tracked person or object. For example, the suspicious person may take off his/her jacket, pull up his/her hood, or abandon some luggage, in order to fool the surveillance system. Such sudden-appearance-changes are suspicious activities and usually should be reported to the security personnel in real-time.
Unfortunately, it is usually difficult for computer vision algorithms to distinguish between a sudden-appearance-change to occlusions. For many surveillance applications, the surveillance scenes are usually crowded with many people. Thus, occlusions may happen very frequently. Without the ability of distinguishing between sudden-appearance-changes to occlusions, the visual tracking systems may generate a large number of false alarms.
Detecting occlusions correctly is also important for enhancing the reliability of visual tracking algorithms. It is well-known that for visual tracking, there exists a so called “high-adaptability-to-drifting-resistance trade-off” problem. The problem relates to how much we should update our models for the being-tracked person or object on-the-fly. If we do not update the models much, then in the case that the appearance of the being tracked person changes rapidly, there is a high risk of tracking loss. If we always update the models very rapidly, then in the case of occlusions, the model may be tuned to occulders. And these wrongly tuned models may result in the so called drifting. That is, the tracking algorithm may start to track the occulders instead. In Fig. 2 of section IV, we actually show an example of such drifting phenomena.
In this paper, we propose a new occlusion and appearance-change detection method. The proposed real-time visual tracking system uses multiple surveillance cameras. Initially, the security personnel provides one bounding box of the suspicious person for each video frame sequence. The visual tracking system then tracks the where-about of the suspicious person in real-time. If a sudden-appearance-change of the suspicious person is detected, then the visual tracking system would raise an alarm signal immediately.
Our method uses both generative and discriminative models for the video frame streams. For each camera, one discriminative model is maintained for discriminating the image patches that contain the being-tracked person to the image patches that do not contain the being-tracked person. Similarly, one generative model is maintained for each camera. In this paper, we use a recently proposed compressive sensing and naive Bayes based classier in
as the discriminative model. We use linear sub-space models as the generative models. That is, we assume that the image patches containing the being-tracked person from several adjacent video frames all are vectors within a certain affine sub-space.
A center component of our method is a hidden Markovian model for the prediction errors of the generative models. That is, whenever a new video frame is received, the new image patch containing the being-tracked person is predicted from the previous such image patches. The hidden Markovian model thus contains a visible part and a hidden part. The visible part contains the observed prediction errors. And the hidden part contains random variablesand . The binary random variable denotes whether an occlusion has occurred for the -th camera at time . And the binary random variable denotes whether a sudden-appearance-change of the being-tracked person has occurred at time .
Please note that the above hidden Markovian statistical model is the centerpiece of the proposed occlusion and sudden-appearance-change detection methods. The statistical model works also well with other discriminative and generative models.
There is a large literature on visual tracking algorithms, such as , , . We have no intention here to provide a throughout survey on the general visual tracking algorithms. The approaches of using multiple cameras for visual tracking have become attractive in the recent years, due to the availability of large quantities of low-cost commodity cameras. There are some previous discussions on visual tracking using multiple cameras. In  , approaches are discussed, where the responsibilities of tracking may be passed from one camera to another camera. In , from each camera, a statistical estimation of the location of the being tracked person or object is obtained independently. The independent estimations are then fused into a joint location estimation. In , video frames from multiple cameras are projected on a reference frame using homography transforms, such that the signals corresponding to the being tracked person or object may be added constructively.
The rest of the paper is organized as follows. In Section II, we discuss the proposed visual tracking system and the hidden Markovian model. In Section III, we present the sequential Bayesian estimation methods for the hidden Markovian model. Simulation results of the proposed method is provided in Section IV. Finally, some concluding remarks are presented in Section V.
Ii Visual Tracking System and Hidden Markovian Model
A block diagram of the proposed visual tracking system is shown in Fig. 1. The system uses multiple cameras (only 2 are shown here). The system starts tracking a suspicious person, after the security personnel provides one bounding box of the suspicious person for each camera. For each camera, there is a real-time tracking sub-system as in 
(shown as classifiers in Fig.1 ). All such real-time tracking sub-systems work almost independently, except that their learning parameters are controlled by the center controller.
The learning parameters control how much each individual real-time compressive tracking sub-system should update the discriminative model after receiving each new video frame. As the being-tracked person changes his/her pose, orientation etc., the appearance of the person may change smoothly. Thus, each compressive tracking sub-system may update its discriminative model according to each newly observed video frame. The parameter is a real number between and . If , then the compressive tracking sub-system does not update its model. If , then the compressive tracking sub-system updates the discriminative model using a combination of past observed video frames and the newly observed video frame.
Our proposed method adjusts the learning parameters based on the probabilities of sudden-appearance-change and the probabilities of occlusions . Let us assume that we use security cameras. Suppose at each time , we receive one video frame at each camera, where . Each tracking sub-system then finds one image patch that contains the being-tracked person, where is a column vector. That is, is the vector obtained by stacking the pixels in the image patch.
For each camera, we use one generative model (shown as predictors in Fig. 1). Each predictor maintains an estimation of an affine subspace , such that all the past observed , , roughly lie within this affine space. We can then define a prediction error as the distance of the newly observed to this affine space . Note that there exist efficient algorithms for computing the affine space , such as in .
We may then estimate the probabilities of sudden-appearance-change and the probabilities of occlusions from
based on the following hidden Markovian model. We assume that the probability density function of
where, indicates that a sudden-appearance-change has occurred at time and indicates that an occlusion has occurred at the -th camera at time . In other words, if , then the prediction error
is exponentially distributed. Otherwise, the prediction error
is uniformly distributed betweenand . We further assume that and are statistically independent conditioned on and , if . We assume that each random process is a Markovian random process. Similarly, We also assume that the random process is Markovian.
We may then use the Bayesian decision methods to detect occlusions at each camera and the sudden appearance change of the being tracked person or object by computing the probabilities
where denotes the collection of observed prediction errors , and denote the collection of variables . We show in Section III that these probabilities can be recursively calculated in a very efficient way.
The proposed method raises an alarm signal, whenever the probability goes over a certain threshold. The method may also adjust the learning parameters of the compressive tracking sub-systems according to the probability . For example, we may set , whenever the probability is greater than .
Iii Recursive Bayesian Estimation
In this section, we derive a recursive formula for calculating from in Eq. 1, where (a) follows from the Markovian properties of the statistical model.
Iv Experimental Result
We use one of the PETS 2007 (Tenth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance) data-sets (available from http://www.cvg.reading.ac.uk/PETS2007/data.html). The data-set consists of 12000 video frames for 4 cameras, 3000 video frames for each camera. A suspicious person enters the scene at roughly frame 500 and drops and leaves behind his backpack at roughly frame 850.
We observe that the real-time compressive tracking systems with a fixed learning parameter in the prior art  do have the drifting problem as shown in Fig. 2. We show in Fig. 3 that such drifting problem can be avoided by the algorithm proposed in this paper. The occlusion event around frame 577 is detected by our algorithm very clearly with the corresponding probability close to .
The proposed algorithm also detects the unusual behavior of the being-tracked person at frame 851 (with the corresponding probability higher than ). The proposed algorithm is able to track the where-about of the suspicious person as shown in Figs. 4, and 5, where each figure shows the tracking results at one camera.
In all the above experimental results, the tracking results are shown as red bounding boxes. And the frame indexes are labelled in all the images.
The paper discusses a visual tracking algorithm to detect sudden-appearance-change and occlusions. By experimental results, we show that the proposed algorithm can reliably detect the sudden-appearance-change and occlusion events. Such reliable estimations can also be used to avoid the drifting problems.
This research was originally submitted to Xinova, LLC by the author in response to a Request for Invention. It is among several submissions that Xinova has chosen to make available to the wider community. The author wishes to thank Xinova, LLC for their funding support of this research. More information about Xinova, LLC is available at www.xinova.com.
-  K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,” in 12th European Conference on Computer Vision (ECCV), October 2012.
-  V. R. D. Comaniciu and P. Meer, “Kernel-based object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, 2003.
-  S. Avidan, “Support vector tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 8, 2004.
-  ——, “Ensemble tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 2, 2007.
-  M. Quaritsch, M. Kreuzthaler, B. Rinner, H. Bischof, and B. Strobl, “Autonomous multicamera tracking on embedded smart cameras,” Journal on Embedded Systems, January 2007.
-  Q. Cai and J. Aggarwal, “Tracking human motion in structured environments using a distributed-camera system,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 12, 1999.
M. Bhuyan, B. Lovell, and A. Bigdeli, “Tracking with multiple cameras for
video surveillance,” in
9th Biennial Conference of the Australian Pattern Recognition Society on Digital Image Computing Techniques and Applications, December 2007.
-  R. Eshel and Y. Moses, “Homography based multiple camera detection and tracking of people in a dense crowd,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2008.
-  A. Levy and M. Lindenbaum, “Sequential Karhunen-Loeve basis extraction and its applications to images,” IEEE Transactions on Image Processing, vol. 9, no. 8, 2000.