1 Introduction
Person reidentification at a distance increasingly receives attention in video surveillance, particularly for the applications restricting the use of face recognition. But this task is very challenging due to the following difficulties,
• Robust human representation (signature). There are large variations for human body in appearance, (e.g., different views, poses, lighting conditions). It is usually intractable to construct a template of the individual to be recognized by extracting only lowlevel image features.
• Effective human matching (localizing). Given the template, reidentifying targets with the global body information often suffers from high matching false positives, as the targets are possibly occluded or conjuncted with others and backgrounds in realistic surveillance applications. Furthermore, it is desired to accurately localize human body parts in general.
The objective of human reidentification in this work is to recognize an individual by employing body information to address the above difficulties. We study the problem with the following setting based on the application requirements in surveillance: (1) The clothing of individuals remain unchanged across different scenarios. (2) The individual to be reidentified should be in a moderate resolution, (e.g., pixels in height). Our approach builds a compositional partbased template to represent the target individual and matches the template with input images by employing a stochastic cluster sampling algorithm, as illustrated in Fig. 1.
We organize the template of a query individual with an expressive tree representation that can be produced in a very simple way. We perform the human body part detectors [1, 2] on several reference images of the individual, and the images of detected parts are grouped according to their semantics. That is, a human template is decomposed into body parts, e.g., head, torso, arms, each of which associates with a number of part instances. Note that we can prune the instances sharing very similar appearances with others. This expressive template fully exploit information from multiple reference images to capture well appearance variability, partially motivated by the recently proposed hierarchical and partbased models in object recognition [23, 18, 16]. Specifically, several possible instances (namely proposals), extracted from different references, exist at each part in the template, and we regard this representation as the multipleinstancebased compositional template (MICT). As a result, new appearance configurations can be composed by the part proposals in the MICT. One may question the scalability issue for building such a customized template. We argue that the critical concern is accurately identifying the target in realistic scenarios, e.g., searching for one suspect across scenes, rather than processing numbers of targets at the same time.
In the inference stage, the body part detectors are initially utilized to generate possible part locations in the scene shot, and human reidentification is then posed as the task of partbased template matching. Unlike traditional matching problems, the multiple part proposals in the MICT make the search space of matching combinatorially large, as the part proposals need to be activated alone with the matching process. Handling the false alarms and misdetections by the part detectors is also a nontrivial issue during matching. Inspired by recent studies in cluster sampling [6, 17, 22], we propose a stochastic algorithm to solve the compositional template matching.
The matching algorithm is designed based upon the candidacy graph, where each vertex denotes a pair of matching part proposals, and each edge link represents the contextual interaction (i.e. the compatible or the competitive relation) between two matching pairs. Compatible relations encourage vertices to activate together, while competitive relations depress conflicting vertices being activated at the same time. Specifically, two vertices are encouraged to be activated together, as they are kinematically or symmetrically related, whereas two vertices are constrained that only one of them can be activated, as they belong to the same part type or overlap. The algorithm iterates in two steps for optimal matching solution searching. (i) It forms several possible partial matches (clusters) by turning off the edge links probabilistically and deterministically. (ii) It activates clusters to confirm partial matches, leading to a new matching solution that will be accepted by the Markov Chain Monte Carlo (MCMC) mechanism [6]. Note that body parts are allowed to be unmatched to cope with occlusions.
The main contributions of this paper are twofold. First, we propose a novel formulation to solve human reidentification by matching the composite template with cluster sampling. Second, we present a new database including realistic and general challenges for human reidentification, which is more complete than existing related databases.
2 Related Work
In literature, previous works of human reidentification mainly focus on constructing and selecting distinctive and stable human representation, and they can be roughly divided into the following two categories.
Globalbased methods define a global appearance human signature with rich image features and match given reference images with the observations [14, 24, 8]. For example, D. Gray et al. propose the feature ensemble to deal with viewpoint invariant recognition. Some methods improve the performance by extracting features with region segmentation [15, 25, 4]. Recently, advanced learning techniques are employed for more reliable matching metrics [26], more representative features [19], and more expressive multivalued mapping function [3]
. Despite acknowledged success, this category of methods often has problems to handle large pose/view variance and occlusions.
Compositional approaches reidentify people by using partbased measures. They first localize salient body parts, and then search for parttopart correspondence between reference samples and observations. These methods show promising results on very challenging scenarios [21], benefiting from powerful partbased object detectors. For example, N. Gheissari et al. [12] adopt a decomposable triangulated graph to represent person configuration, and the pictorial structures model for human reidentification is introduced [7]. Besides, modeling contextual correlation between body parts is discussed in [5].
Many works [12, 8, 7] utilize multiple reference instances for individual, i.e. multishot approaches, but they omit occlusions and conjunctions in the target images and reidentify the target by computing a onetomany distance, while we explicitly handle these problems by exploiting reconfigurable compositions and contextual interactions during inference.
3 Representation
In this section, we first introduce the definition of multipleinstancebased compositional template, and then present the problem formulation of human reidentification.
3.1 Compositional Template
In this work, we present a compositional template to model human with huge variations.
A human body is decomposed into parts: head, torso, upper arms, forearms, thighs and calfs, and each limb is further decomposed into two symmetrical parts (i.e. left and right), as shown in Fig. 2(a)). Each part is modeled as a rectangle and indicated by a tuple , where denotes the part type, and the part center coordinates, the part orientation, the part relative scale, as widely employed in pictorial structures model [9, 1]. The multipleinstancebased compositional template (MICT) is defined as
(1) 
where denotes a part proposal and the set of proposals for the th part in template.
Given reference images of an individual, the MICT is constructed as follows.
We first employ body part detectors to scan every reference image and obtain detection scores for all body parts. The training and detecting process of part detectors closely follows [2]. Given detection scores, we further prune impossible part configurations by several strategies: (i) For all parts, the firing detection is pruned if the overlap rate of foreground mask (done by background subtraction) is less than . (ii) The reference image is segmented into horizonal strips with equal height. Head is detected in the first strip (the first to fourth top to bottom), parts of upper body (i.e. torso, upper arms and forearms) in the second, and parts of lower body (i.e. thighs and calfs) the rest. Finally, we apply nonmaximum suppression and collect the proposals with highest responses for each part from all reference images.
Given target images (scene shots) to be matched, we can obtain the target proposal set
by a similar process as constructing the MICT, except the firing detection being pruned only by the foreground mask. Considering realistic complexities in surveillance, there probably exist large numbers of detection false alarms in the target proposal set
.3.2 Candidacy Graph
Given the template and the target proposal set , the problem of human reidentification can be posed as the task of partbased template matching and solved by two steps: (i) activating one proposal for each part in , (ii) finding the match in .
We define the set of activated part proposals from , each of which corresponds to a certain part:
(2) 
The binary label indicates whether the proposal is activated or remains inactivated, i.e. for activated and for inactivated. The set of matched part proposals from can be defined as
(3) 
where maps the activated proposal of the th part in to a proposal in . Note that not necessarily has a match (i.e. ), in case the matched part is occluded or missed in .
To solve these two steps simultaneously, we propose a candidacy graph representation and further formulate the problem by graph labeling. We define the candidacy graph , where each vertex denotes a candidate matching pair . A similar binary label is employed to indicate whether a matching pair is activated or not. Solving the matching problem is equivalent to labeling vertices in the candidacy graph . The label set is thus defined as
(4) 
Each edge in denotes the relation between two matching pairs and . We incorporate two kinds of relations, i.e. compatible and competitive relations, to model the contextual interactions in scene shots. In the following discussion, we drop the notation of edge index for notation simplicity.
Compatible relations encourage matching pairs to activate together in matching. We represent compatible relations as how two target part proposals are coupled together and mainly explore two cases: (i) kinematics relations for coupling kinematic dependent parts. (ii) symmetry relations for coupling symmetrical parts. That is,
(5) 
where and denotes the part type of and , respectively.
(i) Kinematics relations describe spatial relationship between kinematic dependent parts (navy blue edges in Fig. 2(a)). The spatial distribution between two proposals and
is modeled as a zeromean Gaussian distribution under the coordinate system of their connected joint:
(6) 
where and are the transformations of and from image coordinate system to joint coordinate system. For detailed explanations, see [9, 1].
In the experiment, kinematics relations are learnt from reference images with body part annotations.
(ii) Symmetry relations measure the appearance similarity between symmetrical parts (brown edges in Fig. 2(a)). We suppose symmetrical parts from the same individual tend to share similar appearance while those from different individuals don’t. Therefore the symmetry relations are represented as
(7) 
where measures the distance between two part proposals and is defined in Equ.(12).
We give an example to illustrate how kinematics relations and symmetry relations work in scene shots, as shown in Fig. 2(b). Note that we omit certain part proposals for clear specification.
Competitive relations depress conflicting matching pairs being activated at the same time. We also develop two cases for competitive relations: (i) Two target proposals with the same part type cannot be activated simultaneously. (ii) The overlapped region between two target part proposals should only be compared once. That is,
(8) 
where indicates the overlap intersectionoverunion between and , is a scaling constant.
An illustration of the candidacy graph representation is shown in Fig. 4, corresponding to the example in Fig. 3.
In summary, the problem of matching the template to the target proposal set can be represented as
(9) 
where denotes the number of unmatched part pairs and the number of scales of the activated proposals, and they can be computed from the labeling set . According to Bayes’ Rule,
can be solved by maximizing a posterior probability:
(10)  
Likelihood measures the appearance similarity between the template and the matching target. Assuming the appearance similarity of each matching pair is independent, then can be factorized into
(11) 
where denotes the distance between two proposals.
We adopt modified HSV color histogram [7] and MSCR descriptor [11] to describe the visual statistics for each part proposal, which has been widely used in existing human reidentification studies [8, 7]. The distance between two arbitrary proposals and is defined as
(12) 
where denotes the normalized HSV color histogram, the MSCR descriptor, and the Bhattacharyya distance and the distance defined in [8], respectively.
Prior penalizes the undesired activation of matching pairs (e.g. missing parts) and matching inconsistency among the activated matching pairs. We define as
(13)  
where and are corresponding parameters for and , respectively.
imposes constraints on the edge links among activated vertices, that is
(14) 
where and indicate the compatible edges and competitive edges in the candidacy graph , respectively.
4 Inference Algorithm
In a scene shot containing multiple individuals, matching the template to the target becomes an extremely complicated problem. For example, in Fig. 3, the four individuals in the shot all share similar appearance with the template. As a result, solving Equ.(10) probably leads to a local optimal solution. In this case, popular inference algorithms, such as EM, Belief Propagation and Dynamic Programming, are easily struck and thus fail to reidentify the correct target (i.e. finding global optimal solution), while Composite Cluster Sampling, as introduced in [17, 22], overcomes this problem by jumping from partial coupling matches in each MCMC step. Therefore, we employ Composite Cluster Sampling to search for optimal match between the template and the correct target.
Composite Cluster Sampling algorithm consists of the following two steps:
(I) Generating a composite cluster. Given a candidacy graph and the current matching state , we first separate graph edges into two sets: set of inconsistent edges (i.e. edges violating current state) and set of consistent edges in the other two cases. Next we introduce a boolean variable to indicate an edge is being turned on or turned off. We turn off inconsistent edges deterministically and turn on every consistent edge with its edge probability . Afterwards, we regard candidates connected by ”on” positive edges as a cluster and collect clusters connected by ”on” negative edges to generate a composite cluster .
(II) Relabeling the composite cluster. In this step, we randomly choose a cluster from the obtained composite cluster and flip the labels of the selected cluster and its conflicting clusters (i.e. the clusters connected with the selected cluster), which generates a new state . To find a better state and achieve a reversible transition between two states and , the acceptance rate of the transition from state to state is defined by a MetropolisHastings method [20]:
(15) 
where and denote the state transition probability, and the posterior defined in Equ.(10).
Following instructions in [6], the state transition probability ratio is computed by
(16)  
where and denote the sets of positive and negative edges being turned off around , respectively, that is,
(17)  
Note that the subscript of , in Equ.(16) indicates the current state and is omitted for simplicity in the above definition.
We show an example of one transition in composite cluster sampling in Fig. 5. In this figure, contains two clusters . In state A, is activated and the conflicting cluster is deactivated while in state B labels of and are flipped. The transition from state to state achieves a fast jump between two kinds of partial coupling matches and coincides with an individualtoindividual comparison in reidentification.
Applying the above mechanism, we summarize the inference algorithm in Algorithm 1.
5 Experiments
In this section, we first introduce the datasets and the parameter settings, and then show our experimental results as well as component analysis of the proposed approach.
5.1 Datasets and Settings
We validate our method on three public databases as follows.
(i) VIPeR dataset^{1}^{1}1Available at www.umiacs.umd.edu/~schwartz/datasets.html. It is commonly used for human reidentification, containing people in outdoor, and there are images for each individual.
(ii) EPFL dataset^{2}^{2}2Available at cvlab.epfl.ch/data/pom/. This database is very challenging, originally proposed for tracking in multiviews [10]. It consists of different scenarios that are filmed by three or four cameras from different angles. For evaluating our method, we extract individuals from the original videos and annotate each of them with ID and location (bounding box). In total, there are reference images for different individuals, (normalized to pixels in height), and shots in , which contain targets to be reidentified.
(iii) CAMPUSHuman dataset^{3}^{3}3Available at http://vision.sysu.edu.cn/projects/humanreid/. We construct this database including general and realistic challenges for people reidentification in surveillance. There are reference images normalized to pixels in height, for individuals, with IDs and locations provided. We present shots containing targets for evaluating methods, and the targets often appear with diverse poses/views, conjunctions and occlusions, see Fig. 7 (bottom row). Note all images in both EPFL dataset and CAMPUSHuman dataset are captured from the original videos with large time gap to guarantee appearance varieties (unlike ETHZ dataset [24]).
Experiment settings. For VIPeR dataset, we adopt the common setting that running the algorithm on random partitions containing pairs. For EPFL and CAMPUSHuman dataset, we randomly select reference images for each individual, and all target images are tested to match. The results on all three datasets are computed by taking average over ten runs. Our approach is evaluated under cases of both single reference image (singleshot, SvsS) and multiple reference images (multishot, MvsS, ).
All the parameters are fixed in the experiments, including for scaling the overlap , and for penalizing the activation of vertices. We construct the MICT for each individual with their selected reference images. In the reidentification, a number of body part proposals are generated. In practice, we set approximately times the number of individuals in the shot.
We implement our approach with C++ and run the program on a PC with I5 2.8GHZ CPU and 4GB memory. On average, the inference algorithm converges after around samplings, which costs . The time cost is related with the complexity of the candidacy graph.
5.2 Experimental Results
We compare our approach with the stateofthearts methods: Pictorial Structures (PS) [1], Viewbased Pictorial Structures (VPS) [2], Custom Pictorial Structures (CPS) [7], Symmetrydriven Accumulation of Local Features (SDALF) [8] and Ensemble of Localized Features (ELF) [14]. We adopt the provided code of PS and implement VPS and CPS according to their descriptions. For fair comparison, the same likelihood is employed for PS, VPS and CPS as the proposed method. The results are evaluated by two ways: (i) reidentifying individuals in segmented images, i.e. targets already localized, and (ii) reidentifying individuals from scene shots without provided segmentations.
For the first evaluation, we adopt the cumulative match characteristic (CMC) curve for quantitative analysis, as in previous works [13, 24]. The curve reflects the overall ranked matching rates; precisely, a rank matching rate indicates the percentage of correct matches found in top ranks. As Fig. 6 shows, we demonstrate the superior performance over the competing approaches in both singleshot case and multishot case. And our method yields the best rank matching rate on EPFL and CAMPUSHuman datasets. We observe that the performance of reidentification can be improved significantly by fully exploiting reconfigurable compositions and contextual interactions in inference. Our performance only improves slightly on VIPeR dataset, as most erroneous matchings are due to severe illumination changes, which has been approved in [8].
Dataset  EPFL  CAMPUSHuman 

Our M=2  57/294  215/1519 
VPS M=2  54/294  175/1519 
PS M=2  32/294  141/1519 
Our Single  50/294  173/1519 
VPS Single  49/294  139/1519 
PS Single  24/294  118/1519 
The second test is stricter, since the algorithms should also localize the target during reidentification. We adopt the PASCAL Challenge criterion to evaluate the localization results: a match is counted as the correct match only if the intersectionoverunion ratio () with the groundtruth bounding box is greater than . We compare our method with PS [1], VPS [2], which can localize the body at the same time as localizing the parts. The quantitative results are reported in Table 1. A number of representative results generated by our method are exhibited in Fig. 7. From the results, existing methods perform poor when individuals are not well segmented and scaled to uniform size. In contrast, our method can reidentify challenging target individuals by searching and matching their salient parts and thus achieves better performance. Note the performance of our approach also drops significantly due to inaccurate part localizations and interference of other individuals.
Component Analysis. We further analyze component benefits of our approach on CAMPUSHuman dataset under the setting: multishot . Regarding feature effectiveness, we separately evaluate different image features, as shown in Fig. 8(left). It is apparent that the combined feature improves the result. We also demonstrate the effectiveness of the constraints employed, and Fig. 8(right) confirms that both kinematics and symmetry constraints help construct better matching solution.
6 Conclusion
This paper studies a novel compositional template for human reidentification, in the form of an expressive multipleinstancebased compositional representation of the query individual. By exploiting reconfigurable compositions and contextual interactions during inference, our method handles well challenges in human reidentification. Moreover, we will explore more robust and flexible part representations and better interpart relations in future works.
References

[1]
M. Andriluka, S. Roth, and B. Schiele.
Pictorial structures revisited: People detection and articulated pose estimation.
In Proc. CVPR, 2009.  [2] M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estimation and tracking by detection. In Proc. CVPR, 2010.
 [3] T. Avraham, I. Gurvich, M. Lindenbaum, and S. Markovitch. Learning implicit transfer for person reidentification. ECCV Workshops, 2012.
 [4] S. Bak, E. Corvee, F. Bremond, and M. Thonnat. Person reidentification using haarbased and dcdbased signature. In Proc. AVSS, 2010.
 [5] S. Bak, E. Corvee, F. Bremond, and M. Thonnat. Person reidentification using spatial covariance regions of human body parts. In Proc. AVSS, 2010.
 [6] A. Barbu and S. Zhu. Generalizing swendsenwang to sampling arbitrary posterior probabilities. IEEE Trans. PAMI, 27(8):1239–1253, 2005.
 [7] D. Cheng, M. Cristani, M.Stoppa, L. Bazzani, and V. Murino. Custom pictorial structures for reidentification. In Proc. BMVC, 2011.
 [8] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person reidentification by symmetrydriven accumulation of local features. In Proc. CVPR, 2010.
 [9] P. Felzenszwalb and D. Huttenlocheret. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005.
 [10] F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using kshortest paths optimization. IEEE Trans. PAMI, 33(9):1806–1819, 2011.
 [11] P.E. Forssén. Maximally stable colour regions for recognition and matching. In Proc. CVPR, 2007.
 [12] N. Gheissari, T. Sebastian, P. Tu, and J. Rittscher. Person reidentification using spatiotemporal appearance. In Proc. CVPR, 2006.
 [13] D. Gray, S. Brennan, and H. Tao. Evaluating appearance models for recongnition, reacquisition and tracking. PETS, 2007.
 [14] D. Gray and H. Tao. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In Proc. ECCV, 2008.
 [15] W. Hu, M. Hu, X. Zhou, T. Tan, J. Lou, and S. Maybank. Principal axisbased correspondence between multiple cameras for people tracking. IEEE Trans. PAMI, 28(4):663–671, 2006.
 [16] L. Lin, X. Liu, S. Peng, H. Chao, Y. Wang, and B. Jiang. Object categorization with sketch representation and generalized samples. Pattern Recognition, 45:3648–3660, 2012.
 [17] L. Lin, X. Liu, and S. Zhu. Layered graph matching with composite cluster sampling. IEEE Trans. PAMI, 32(8):1426–1442, 2010.
 [18] L. Lin, T. Wu, J. Porway, and Z. Xu. A stochastic graph grammar for compositional object rrepresentation and recognition. PR, 42:1297–1307, 2009.
 [19] C. Liu, S. Gong, C. Loy, and X. Lin. Person reidentification: What features are important? ECCV Workshops, 2012.
 [20] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. J. Chemical Physics, 21(6):1087–1092, 1953.
 [21] U. H. Office. iLIDS multiple camera tracking scenario definition. 2008.
 [22] J. Porway and S. Zhu. C4: Exploring multiple solutions in graphical models by cluster sampling. IEEE Trans. PAMI, 33(9):1713–1727, 2011.
 [23] B. Rothrock and S. Zhu. Human parsing using stochastic andor grammars and rich appearances. ICCV Workshops, 2011.
 [24] W. Schwartz and L. Davis. Learning discriminative appearancebased models using partial least squares. In XXII SIBGRAPI, 2009.
 [25] X. Wang, G. Doretto, T. Sebastian, J. Rittscher, and P. Tu. Shape and appearance context modeling. In Proc. ICCV, 2007.
 [26] W. Zheng, S. Gong, and T. Xiang. Person reidentification by probabilistic relative distance comparison. In Proc. CVPR, 2011.
Comments
There are no comments yet.