Person re-identification at a distance increasingly receives attention in video surveillance, particularly for the applications restricting the use of face recognition. But this task is very challenging due to the following difficulties,
• Robust human representation (signature). There are large variations for human body in appearance, (e.g., different views, poses, lighting conditions). It is usually intractable to construct a template of the individual to be recognized by extracting only low-level image features.
• Effective human matching (localizing). Given the template, re-identifying targets with the global body information often suffers from high matching false positives, as the targets are possibly occluded or conjuncted with others and backgrounds in realistic surveillance applications. Furthermore, it is desired to accurately localize human body parts in general.
The objective of human re-identification in this work is to recognize an individual by employing body information to address the above difficulties. We study the problem with the following setting based on the application requirements in surveillance: (1) The clothing of individuals remain unchanged across different scenarios. (2) The individual to be re-identified should be in a moderate resolution, (e.g., pixels in height). Our approach builds a compositional part-based template to represent the target individual and matches the template with input images by employing a stochastic cluster sampling algorithm, as illustrated in Fig. 1.
We organize the template of a query individual with an expressive tree representation that can be produced in a very simple way. We perform the human body part detectors [1, 2] on several reference images of the individual, and the images of detected parts are grouped according to their semantics. That is, a human template is decomposed into body parts, e.g., head, torso, arms, each of which associates with a number of part instances. Note that we can prune the instances sharing very similar appearances with others. This expressive template fully exploit information from multiple reference images to capture well appearance variability, partially motivated by the recently proposed hierarchical and part-based models in object recognition [23, 18, 16]. Specifically, several possible instances (namely proposals), extracted from different references, exist at each part in the template, and we regard this representation as the multiple-instance-based compositional template (MICT). As a result, new appearance configurations can be composed by the part proposals in the MICT. One may question the scalability issue for building such a customized template. We argue that the critical concern is accurately identifying the target in realistic scenarios, e.g., searching for one suspect across scenes, rather than processing numbers of targets at the same time.
In the inference stage, the body part detectors are initially utilized to generate possible part locations in the scene shot, and human re-identification is then posed as the task of part-based template matching. Unlike traditional matching problems, the multiple part proposals in the MICT make the search space of matching combinatorially large, as the part proposals need to be activated alone with the matching process. Handling the false alarms and misdetections by the part detectors is also a non-trivial issue during matching. Inspired by recent studies in cluster sampling [6, 17, 22], we propose a stochastic algorithm to solve the compositional template matching.
The matching algorithm is designed based upon the candidacy graph, where each vertex denotes a pair of matching part proposals, and each edge link represents the contextual interaction (i.e. the compatible or the competitive relation) between two matching pairs. Compatible relations encourage vertices to activate together, while competitive relations depress conflicting vertices being activated at the same time. Specifically, two vertices are encouraged to be activated together, as they are kinematically or symmetrically related, whereas two vertices are constrained that only one of them can be activated, as they belong to the same part type or overlap. The algorithm iterates in two steps for optimal matching solution searching. (i) It forms several possible partial matches (clusters) by turning off the edge links probabilistically and deterministically. (ii) It activates clusters to confirm partial matches, leading to a new matching solution that will be accepted by the Markov Chain Monte Carlo (MCMC) mechanism . Note that body parts are allowed to be unmatched to cope with occlusions.
The main contributions of this paper are two-fold. First, we propose a novel formulation to solve human re-identification by matching the composite template with cluster sampling. Second, we present a new database including realistic and general challenges for human re-identification, which is more complete than existing related databases.
2 Related Work
In literature, previous works of human re-identification mainly focus on constructing and selecting distinctive and stable human representation, and they can be roughly divided into the following two categories.
Global-based methods define a global appearance human signature with rich image features and match given reference images with the observations [14, 24, 8]. For example, D. Gray et al. propose the feature ensemble to deal with viewpoint invariant recognition. Some methods improve the performance by extracting features with region segmentation [15, 25, 4]. Recently, advanced learning techniques are employed for more reliable matching metrics , more representative features , and more expressive multi-valued mapping function 
. Despite acknowledged success, this category of methods often has problems to handle large pose/view variance and occlusions.
Compositional approaches re-identify people by using part-based measures. They first localize salient body parts, and then search for part-to-part correspondence between reference samples and observations. These methods show promising results on very challenging scenarios , benefiting from powerful part-based object detectors. For example, N. Gheissari et al.  adopt a decomposable triangulated graph to represent person configuration, and the pictorial structures model for human re-identification is introduced . Besides, modeling contextual correlation between body parts is discussed in .
Many works [12, 8, 7] utilize multiple reference instances for individual, i.e. multi-shot approaches, but they omit occlusions and conjunctions in the target images and re-identify the target by computing a one-to-many distance, while we explicitly handle these problems by exploiting reconfigurable compositions and contextual interactions during inference.
In this section, we first introduce the definition of multiple-instance-based compositional template, and then present the problem formulation of human re-identification.
3.1 Compositional Template
In this work, we present a compositional template to model human with huge variations.
A human body is decomposed into parts: head, torso, upper arms, forearms, thighs and calfs, and each limb is further decomposed into two symmetrical parts (i.e. left and right), as shown in Fig. 2(a)). Each part is modeled as a rectangle and indicated by a -tuple , where denotes the part type, and the part center coordinates, the part orientation, the part relative scale, as widely employed in pictorial structures model [9, 1]. The multiple-instance-based compositional template (MICT) is defined as
where denotes a part proposal and the set of proposals for the th part in template.
Given reference images of an individual, the MICT is constructed as follows.
We first employ body part detectors to scan every reference image and obtain detection scores for all body parts. The training and detecting process of part detectors closely follows . Given detection scores, we further prune impossible part configurations by several strategies: (i) For all parts, the firing detection is pruned if the overlap rate of foreground mask (done by background subtraction) is less than . (ii) The reference image is segmented into horizonal strips with equal height. Head is detected in the first strip (the first to fourth top to bottom), parts of upper body (i.e. torso, upper arms and forearms) in the second, and parts of lower body (i.e. thighs and calfs) the rest. Finally, we apply non-maximum suppression and collect the proposals with highest responses for each part from all reference images.
Given target images (scene shots) to be matched, we can obtain the target proposal set
by a similar process as constructing the MICT, except the firing detection being pruned only by the foreground mask. Considering realistic complexities in surveillance, there probably exist large numbers of detection false alarms in the target proposal set.
3.2 Candidacy Graph
Given the template and the target proposal set , the problem of human re-identification can be posed as the task of part-based template matching and solved by two steps: (i) activating one proposal for each part in , (ii) finding the match in .
We define the set of activated part proposals from , each of which corresponds to a certain part:
The binary label indicates whether the proposal is activated or remains inactivated, i.e. for activated and for inactivated. The set of matched part proposals from can be defined as
where maps the activated proposal of the th part in to a proposal in . Note that not necessarily has a match (i.e. ), in case the matched part is occluded or missed in .
To solve these two steps simultaneously, we propose a candidacy graph representation and further formulate the problem by graph labeling. We define the candidacy graph , where each vertex denotes a candidate matching pair . A similar binary label is employed to indicate whether a matching pair is activated or not. Solving the matching problem is equivalent to labeling vertices in the candidacy graph . The label set is thus defined as
Each edge in denotes the relation between two matching pairs and . We incorporate two kinds of relations, i.e. compatible and competitive relations, to model the contextual interactions in scene shots. In the following discussion, we drop the notation of edge index for notation simplicity.
Compatible relations encourage matching pairs to activate together in matching. We represent compatible relations as how two target part proposals are coupled together and mainly explore two cases: (i) kinematics relations for coupling kinematic dependent parts. (ii) symmetry relations for coupling symmetrical parts. That is,
where and denotes the part type of and , respectively.
(i) Kinematics relations describe spatial relationship between kinematic dependent parts (navy blue edges in Fig. 2(a)). The spatial distribution between two proposals and
is modeled as a zero-mean Gaussian distribution under the coordinate system of their connected joint:
In the experiment, kinematics relations are learnt from reference images with body part annotations.
(ii) Symmetry relations measure the appearance similarity between symmetrical parts (brown edges in Fig. 2(a)). We suppose symmetrical parts from the same individual tend to share similar appearance while those from different individuals don’t. Therefore the symmetry relations are represented as
where measures the distance between two part proposals and is defined in Equ.(12).
We give an example to illustrate how kinematics relations and symmetry relations work in scene shots, as shown in Fig. 2(b). Note that we omit certain part proposals for clear specification.
Competitive relations depress conflicting matching pairs being activated at the same time. We also develop two cases for competitive relations: (i) Two target proposals with the same part type cannot be activated simultaneously. (ii) The overlapped region between two target part proposals should only be compared once. That is,
where indicates the overlap intersection-over-union between and , is a scaling constant.
In summary, the problem of matching the template to the target proposal set can be represented as
where denotes the number of unmatched part pairs and the number of scales of the activated proposals, and they can be computed from the labeling set . According to Bayes’ Rule,
can be solved by maximizing a posterior probability:
Likelihood measures the appearance similarity between the template and the matching target. Assuming the appearance similarity of each matching pair is independent, then can be factorized into
where denotes the distance between two proposals.
We adopt modified HSV color histogram  and MSCR descriptor  to describe the visual statistics for each part proposal, which has been widely used in existing human re-identification studies [8, 7]. The distance between two arbitrary proposals and is defined as
where denotes the normalized HSV color histogram, the MSCR descriptor, and the Bhattacharyya distance and the distance defined in , respectively.
Prior penalizes the undesired activation of matching pairs (e.g. missing parts) and matching inconsistency among the activated matching pairs. We define as
where and are corresponding parameters for and , respectively.
imposes constraints on the edge links among activated vertices, that is
where and indicate the compatible edges and competitive edges in the candidacy graph , respectively.
4 Inference Algorithm
In a scene shot containing multiple individuals, matching the template to the target becomes an extremely complicated problem. For example, in Fig. 3, the four individuals in the shot all share similar appearance with the template. As a result, solving Equ.(10) probably leads to a local optimal solution. In this case, popular inference algorithms, such as EM, Belief Propagation and Dynamic Programming, are easily struck and thus fail to re-identify the correct target (i.e. finding global optimal solution), while Composite Cluster Sampling, as introduced in [17, 22], overcomes this problem by jumping from partial coupling matches in each MCMC step. Therefore, we employ Composite Cluster Sampling to search for optimal match between the template and the correct target.
Composite Cluster Sampling algorithm consists of the following two steps:
(I) Generating a composite cluster. Given a candidacy graph and the current matching state , we first separate graph edges into two sets: set of inconsistent edges (i.e. edges violating current state) and set of consistent edges in the other two cases. Next we introduce a boolean variable to indicate an edge is being turned on or turned off. We turn off inconsistent edges deterministically and turn on every consistent edge with its edge probability . Afterwards, we regard candidates connected by ”on” positive edges as a cluster and collect clusters connected by ”on” negative edges to generate a composite cluster .
(II) Relabeling the composite cluster. In this step, we randomly choose a cluster from the obtained composite cluster and flip the labels of the selected cluster and its conflicting clusters (i.e. the clusters connected with the selected cluster), which generates a new state . To find a better state and achieve a reversible transition between two states and , the acceptance rate of the transition from state to state is defined by a Metropolis-Hastings method :
where and denote the state transition probability, and the posterior defined in Equ.(10).
Following instructions in , the state transition probability ratio is computed by
where and denote the sets of positive and negative edges being turned off around , respectively, that is,
Note that the subscript of , in Equ.(16) indicates the current state and is omitted for simplicity in the above definition.
We show an example of one transition in composite cluster sampling in Fig. 5. In this figure, contains two clusters . In state A, is activated and the conflicting cluster is deactivated while in state B labels of and are flipped. The transition from state to state achieves a fast jump between two kinds of partial coupling matches and coincides with an individual-to-individual comparison in re-identification.
Applying the above mechanism, we summarize the inference algorithm in Algorithm 1.
In this section, we first introduce the datasets and the parameter settings, and then show our experimental results as well as component analysis of the proposed approach.
5.1 Datasets and Settings
We validate our method on three public databases as follows.
(i) VIPeR dataset111Available at www.umiacs.umd.edu/~schwartz/datasets.html. It is commonly used for human re-identification, containing people in outdoor, and there are images for each individual.
(ii) EPFL dataset222Available at cvlab.epfl.ch/data/pom/. This database is very challenging, originally proposed for tracking in multi-views . It consists of different scenarios that are filmed by three or four cameras from different angles. For evaluating our method, we extract individuals from the original videos and annotate each of them with ID and location (bounding box). In total, there are reference images for different individuals, (normalized to pixels in height), and shots in , which contain targets to be re-identified.
(iii) CAMPUS-Human dataset333Available at http://vision.sysu.edu.cn/projects/human-reid/. We construct this database including general and realistic challenges for people re-identification in surveillance. There are reference images normalized to pixels in height, for individuals, with IDs and locations provided. We present shots containing targets for evaluating methods, and the targets often appear with diverse poses/views, conjunctions and occlusions, see Fig. 7 (bottom row). Note all images in both EPFL dataset and CAMPUS-Human dataset are captured from the original videos with large time gap to guarantee appearance varieties (unlike ETHZ dataset ).
Experiment settings. For VIPeR dataset, we adopt the common setting that running the algorithm on random partitions containing pairs. For EPFL and CAMPUS-Human dataset, we randomly select reference images for each individual, and all target images are tested to match. The results on all three datasets are computed by taking average over ten runs. Our approach is evaluated under cases of both single reference image (single-shot, SvsS) and multiple reference images (multi-shot, MvsS, ).
All the parameters are fixed in the experiments, including for scaling the overlap , and for penalizing the activation of vertices. We construct the MICT for each individual with their selected reference images. In the re-identification, a number of body part proposals are generated. In practice, we set approximately times the number of individuals in the shot.
We implement our approach with C++ and run the program on a PC with I5 2.8GHZ CPU and 4GB memory. On average, the inference algorithm converges after around samplings, which costs . The time cost is related with the complexity of the candidacy graph.
5.2 Experimental Results
We compare our approach with the state-of-the-arts methods: Pictorial Structures (PS) , View-based Pictorial Structures (VPS) , Custom Pictorial Structures (CPS) , Symmetry-driven Accumulation of Local Features (SDALF)  and Ensemble of Localized Features (ELF) . We adopt the provided code of PS and implement VPS and CPS according to their descriptions. For fair comparison, the same likelihood is employed for PS, VPS and CPS as the proposed method. The results are evaluated by two ways: (i) re-identifying individuals in segmented images, i.e. targets already localized, and (ii) re-identifying individuals from scene shots without provided segmentations.
For the first evaluation, we adopt the cumulative match characteristic (CMC) curve for quantitative analysis, as in previous works [13, 24]. The curve reflects the overall ranked matching rates; precisely, a rank matching rate indicates the percentage of correct matches found in top ranks. As Fig. 6 shows, we demonstrate the superior performance over the competing approaches in both single-shot case and multi-shot case. And our method yields the best rank matching rate on EPFL and CAMPUS-Human datasets. We observe that the performance of re-identification can be improved significantly by fully exploiting reconfigurable compositions and contextual interactions in inference. Our performance only improves slightly on VIPeR dataset, as most erroneous matchings are due to severe illumination changes, which has been approved in .
The second test is stricter, since the algorithms should also localize the target during re-identification. We adopt the PASCAL Challenge criterion to evaluate the localization results: a match is counted as the correct match only if the intersection-over-union ratio () with the groundtruth bounding box is greater than . We compare our method with PS , VPS , which can localize the body at the same time as localizing the parts. The quantitative results are reported in Table 1. A number of representative results generated by our method are exhibited in Fig. 7. From the results, existing methods perform poor when individuals are not well segmented and scaled to uniform size. In contrast, our method can re-identify challenging target individuals by searching and matching their salient parts and thus achieves better performance. Note the performance of our approach also drops significantly due to inaccurate part localizations and interference of other individuals.
Component Analysis. We further analyze component benefits of our approach on CAMPUS-Human dataset under the setting: multi-shot . Regarding feature effectiveness, we separately evaluate different image features, as shown in Fig. 8(left). It is apparent that the combined feature improves the result. We also demonstrate the effectiveness of the constraints employed, and Fig. 8(right) confirms that both kinematics and symmetry constraints help construct better matching solution.
This paper studies a novel compositional template for human re-identification, in the form of an expressive multiple-instance-based compositional representation of the query individual. By exploiting reconfigurable compositions and contextual interactions during inference, our method handles well challenges in human re-identification. Moreover, we will explore more robust and flexible part representations and better inter-part relations in future works.
M. Andriluka, S. Roth, and B. Schiele.
Pictorial structures revisited: People detection and articulated pose estimation.In Proc. CVPR, 2009.
-  M. Andriluka, S. Roth, and B. Schiele. Monocular 3d pose estimation and tracking by detection. In Proc. CVPR, 2010.
-  T. Avraham, I. Gurvich, M. Lindenbaum, and S. Markovitch. Learning implicit transfer for person re-identification. ECCV Workshops, 2012.
-  S. Bak, E. Corvee, F. Bremond, and M. Thonnat. Person re-identification using haar-based and dcd-based signature. In Proc. AVSS, 2010.
-  S. Bak, E. Corvee, F. Bremond, and M. Thonnat. Person re-identification using spatial covariance regions of human body parts. In Proc. AVSS, 2010.
-  A. Barbu and S. Zhu. Generalizing swendsen-wang to sampling arbitrary posterior probabilities. IEEE Trans. PAMI, 27(8):1239–1253, 2005.
-  D. Cheng, M. Cristani, M.Stoppa, L. Bazzani, and V. Murino. Custom pictorial structures for re-identification. In Proc. BMVC, 2011.
-  M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person re-identification by symmetry-driven accumulation of local features. In Proc. CVPR, 2010.
-  P. Felzenszwalb and D. Huttenlocheret. Pictorial structures for object recognition. IJCV, 61(1):55–79, 2005.
-  F. Fleuret, E. Turetken, and P. Fua. Multiple object tracking using k-shortest paths optimization. IEEE Trans. PAMI, 33(9):1806–1819, 2011.
-  P.-E. Forssén. Maximally stable colour regions for recognition and matching. In Proc. CVPR, 2007.
-  N. Gheissari, T. Sebastian, P. Tu, and J. Rittscher. Person reidentification using spatiotemporal appearance. In Proc. CVPR, 2006.
-  D. Gray, S. Brennan, and H. Tao. Evaluating appearance models for recongnition, reacquisition and tracking. PETS, 2007.
-  D. Gray and H. Tao. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In Proc. ECCV, 2008.
-  W. Hu, M. Hu, X. Zhou, T. Tan, J. Lou, and S. Maybank. Principal axis-based correspondence between multiple cameras for people tracking. IEEE Trans. PAMI, 28(4):663–671, 2006.
-  L. Lin, X. Liu, S. Peng, H. Chao, Y. Wang, and B. Jiang. Object categorization with sketch representation and generalized samples. Pattern Recognition, 45:3648–3660, 2012.
-  L. Lin, X. Liu, and S. Zhu. Layered graph matching with composite cluster sampling. IEEE Trans. PAMI, 32(8):1426–1442, 2010.
-  L. Lin, T. Wu, J. Porway, and Z. Xu. A stochastic graph grammar for compositional object rrepresentation and recognition. PR, 42:1297–1307, 2009.
-  C. Liu, S. Gong, C. Loy, and X. Lin. Person re-identification: What features are important? ECCV Workshops, 2012.
-  N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. J. Chemical Physics, 21(6):1087–1092, 1953.
-  U. H. Office. i-LIDS multiple camera tracking scenario definition. 2008.
-  J. Porway and S. Zhu. C4: Exploring multiple solutions in graphical models by cluster sampling. IEEE Trans. PAMI, 33(9):1713–1727, 2011.
-  B. Rothrock and S. Zhu. Human parsing using stochastic and-or grammars and rich appearances. ICCV Workshops, 2011.
-  W. Schwartz and L. Davis. Learning discriminative appearance-based models using partial least squares. In XXII SIBGRAPI, 2009.
-  X. Wang, G. Doretto, T. Sebastian, J. Rittscher, and P. Tu. Shape and appearance context modeling. In Proc. ICCV, 2007.
-  W. Zheng, S. Gong, and T. Xiang. Person re-identification by probabilistic relative distance comparison. In Proc. CVPR, 2011.