I Introduction
The task of person reidentification (reid) is to match people in a distributed multicamera surveillance system at different time and locations, with wide applications to forensic search, multicamera tracking and access control, etc. In most shortterm applications, lowlevel features such as color and textures are important appearance cues used to match. It is apparent that lighting will significantly affect the performance of these lowlevel features. In more extreme cases, when lighting condition changes greatly (e.g., with v.s. without lighting), color information of clothes becomes unreachable. Moreover, when people change clothes, color and textures become unreliable. For example, Figure 1 shows how color histograms change when people change clothes or appear in extreme illumination. In these cases, most existing reid systems are not workable, since they are RGBbased.
In comparison to RGB information, depth information can maintain more invariant even when suffering from clothing change and extreme illumination. As shown in Figure 1, shape and skeleton of body are likely more invariant under extreme lighting and clothing change. Nowadays, extracting depth and skeleton information with depth cameras (e.g., Microsoft Kinect) is not difficult in an indoor environment. Kinect sensor obtains depth value (distance to the camera) of each pixel by infrared, regardless of object color and illumination in indoor applications. With depth information, the lifesize point cloud and skeleton of a person can be extracted, providing shape and physical information of his/her body. Moreover, with depth value of each pixel, pedestrians can be more easily segmented from background, so that background influence can be largely eliminated. Hence, using depth information could overcome some difficulties in RGB appearancebased methods, such as color change, illumination change and background clutters.
Although there are some advantages for depthbased reidentification as compared to the RGB appearancebased methods, challenges and limitations also come along with depth information. Firstly, the depth images captured by depth device change significantly when a person’s viewpoint changes. Secondly, noises from devices exist in the captured depth images. These two aspects will seriously affect the use of depth information for person reidentification. So far, a few methods [1, 2, 3, 4] have been developed to exploit depth information for person reidentification, but the above two problems are still not well solved in existing methods. [1] uses only skeletons to extract feature. In [2, 3], besides using skeleton to extract physical information, applying point clouds converted from depth images for 3D body shape matching is also considered, but alignment errors and noises of point clouds are the problems remained unsolved. In [4]
, a deep model is applied to classify the person point cloud sequences, in which feature extraction and classification are jointly modeled and body shape is not explicitly described.
Therefore, body shape description is still an important biometric cue which needs further study for person reidentification.In this work, we aim to design a depth shape descriptor which is locally invariant to rotation^{1}^{1}1Local rotation invariance means that the feature of a body part is invariant when viewpoint change of pedestrian will not make that body part become invisible due to selfocclusion. and insensitive to noise. We propose two depth shape descriptors: depth voxel covariance descriptor and Eigendepth feature. Eigendepth feature is based on depth voxel covariance descriptor and locally rotation invariant. Then we combine depth shape descriptor with skeletonbased feature to form complete depth representation of body shape and physical information. The pipeline of constructing a descriptor for our depthbased person reid method is illustrated in Figure 2. Our method takes the following steps: (1) segmentation and computation of point cloud and normals of torso and head; (2) extracting depth voxel covariance descriptor and locally rotation invariant Eigendepth feature; (3) enriching body depth shape descriptor by additionally combining skeletonbased feature. In the second step, the Eigendepth feature is more suitable due to its stability against local rotation of body when the viewpoint of a person varies obviously, while the depth voxel covariance descriptor will be more effective because of rich information it contains when the viewpoint change of a person is slight. In the third step, the skeletonbased feature can be complementary to the depth shape descriptor extracted from step (2), so more robust matching can be achieved.
In addition, in realworld applications, most of the deployed cameras in existing surveillance systems cannot capture depth information, so how to make depthbased method work in existing system is also a challenge, while existing depthbased and RGBDbased methods assume depth information is available. Towards overcoming this limitation, we learn the relation between depth features and RGBbased appearance features by a kernelized implicit feature transfer scheme. For this purpose, an auxiliary RGBD dataset is employed to learn the nonlinear transformation between RGBbased appearance feature and depth feature. When depth device is not ready/available, the depth feature is estimated from RGB image and used to augment the RGBbased appearance feature. The experiment results show that this makes extra improvement on the reidentification performance for topranked matching.
We tested our methods on three publicly available datasets, PAVIS [1], BIWI RGBDID [2] and IASLab RGBDID [3]. The results show the effectiveness of our depthbased approach for overcoming change of clothes and extreme illumination condition. When clothes are completely different between gallery and probe, RGB appearancebased methods fail while our depthbased method is effective. Our approach outperforms other existing depthbased reidentification methods including skeletonbased methods, PCM, combination of them [3]
and recurrent attention model
[4]. Compared to other favorable rotation invariant depth shape descriptors, our descriptor also outperforms RIFT2M [5] and Fehr’s descriptor [6].In summary, the contributions of our work are: (1) proposing depth voxel covariance descriptor and Eigendepth feature for depthbased reidentification and proving the local rotation invariance of Eigendepth feature in theory; (2) forming a depth reid recognition framework by unifying depth shape descriptor and skeletonbased feature for a complete representation; (3) proposing a kernelized implicit feature transfer scheme to estimate the Eigendepth feature from RGB images implicitly when depth device is not available.
Ii Related Work
In this section, we present an overview of related imagebased reid technologies in three aspects: (1) RGB appearancebased reid, (2) depthbased reid, and (3) RGBD reid. Currently, most person reidentification approaches are based on 2D RGBbased appearance features.
Iia RGB Appearancebased Person Reidentification
Most existing works rely on RGBbased appearance features. Among them, color is most frequently used and it is often encoded into histograms [7, 8, 9, 10, 11, 12]. Besides, texturebased features are also employed, including HOGlike signature [13], Gabor feature [11, 14], graph model [15], differential filters [11, 14] and Haarlike representations [16]. Many other handcrafted features such as covariance descriptor [17]
, Fisher vector
[18], spatial cooccurrence representation [19], custom pictorial structure [20] and SARC3D [21] were also developed for achieving more reliable representations. Recently, feature learning methods have been more focused on, such as salience learning [22], mirror representation [23], salient color names [24], reference descriptor [25], contextbased feature [26][27, 28, 29, 30, 31, 32], dictionary learning [33, 34, 35] and attribute learning [36, 37, 38]. However, in the situations of clothing change or extreme illumination, these RGBbased appearance features tend to fail.Besides feature representation, a large amount of metric/subspace models [39, 11, 40, 14, 41, 42, 43, 44, 45, 46, 47, 48, 12, 49, 50, 51, 52, 53, 54, 55, 56], have been developed to achieve more reliable matching, such as LMNN [39], RankSVM [14], RDC [41], PCCA [42], KISSME [40], LFDA [43], CVDCA [50], CRAFT [51], MLAPG [52], TDL [55] and DNS [56]. Some other methods have also been proposed for this purpose, e.g., reranking [57, 58] and correspondence structure [59]
. Unsupervised learning models
[60, 61] have also been developed for person reidentification. However, they cannot solve the illumination and clothing change problems. Compared to RGBbased appearance features, depth information is a solution to this problem, because it is independent of color and maintains more invariant for a longer period of time.IiB Depthbased Person Reidentification
So far, only a few depthbased reidentification methods based on depth image, point cloud and anthropometric measurement [1, 3, 2, 62, 63, 64, 4] have been developed. To some extent, depthbased methods can solve the problems of changing clothes and extreme illumination. Barbosa et al. exploited skeletonbased feature [1] based on anthropometric measurement of distances between joints and geodesic distances on body surface. Munaro et al. built a point cloud model for each person as gallery by fusing a set of point clouds from different views and then applied Point Cloud Matching (PCM) to compute the distance between samples [2]. In [3, 62], Munaro et al. combined PCM and skeletonbased feature modified based on Barbosa et al.’s work [1]. These methods needed to align the point clouds, and no depth shape descriptor was applied for describing body shape. Haque et al. proposed a recurrent attention model [4] for depthvideobased person identification, in which 3D RAM model was for still 3D point clouds and 4D RAM model was for 3D point cloud sequences. However, among the above depthbased frameworks, PCM and Haque’s method were not suitable for solving person reidentification problem under the setting when there is no overlap on people between training and testing.
Compared to existing depthbased reidentification frameworks, the main difference of our work is that we propose depth voxel covariance descriptor and Eigendepth feature to describe body shape. Eigendepth feature is a covariancebased feature, and it is locally rotation invariant and does not require alignment of point clouds. The Eigendepth feature can be viewed as a depth shape descriptor and thus can remove the ambiguity of using only anthropometric measurement of skeletons in the previous depth modeling for reidentification. Compared to direct utilization of point cloud in PCM [2], it deals with noises of nonrigid human body better.
We also discuss some related depth shape descriptors, including the covariance descriptor in [65], RIFT2M [5] and Fehr’s covariance descriptor [6], which were not applied for person reidentification. Compared to the covariance descriptor in [65], Eigendepth feature is locally rotation invariant. Compared to rotation invariant descriptors RIFT2M [5] and Fehr’s covariance descriptor [6], Eigendepth feature is densely extracted rather than using interest points, so that it contains richer information of body shape. Moreover, its rotation invariance is achieved in eigenanalysis level, so alignment of point cloud is not needed and more compact representation can be obtained by eigenanalysis.
IiC RGBD Person Reidentification
Since RGB and depth information can be obtained simultaneously when using Kinect, some reidentification methods have been developed to combine depth information and RGB appearance cues in order to extract more discriminative feature representation. Pala et al. [66] improved accuracy of clothing appearance descriptors by fusing them with anthropometric measures extracted from depth data. Mogelmose et al. [67] presented a trimodal method to combine RGB, depth and thermal features. Mogelmose et al. [68] combined color histogram and height feature extracted from depth information. John et al. [69] combined RGBHeight histogram and gait feature of depth information. Satta et al. [70] exploited skeleton to segment human body and extracted color feature. In [71], each color pixel was assigned to the nearest bone in the skeleton, and color histograms were computed for each region. In [72], the proposed feature bodyprint exploited the mean RGB values of regions in different heights. In [73], the descriptor was based on a 3D cylindrical grid that unified color variations together with angle and height. Takac et al. [74] exploited color histograms of upper body and lower body separately. Xu et al. [75] proposed a distance metric using RGBD data to improve RGBbased person reidentification.
As reported in these works, the combination of RGB and depth is effective. They all assume that depth information is available along with RGB images. In our work, we propose to learn the relation between RGB and depth by a kernelized implicit feature transfer scheme, which enables estimation of depth features from RGB features so as to improve the reidentification performance even though the deployed cameras are not ready for capturing depth information.
A preliminary version of this work appeared in [76]. In this work, apart from providing more indepth discussion on the proposed Eigendepth feature and the depthbased person reidentification framework, a kernelized implicit feature transfer scheme is proposed to learn the relation between depth features and RGB features so as to estimate depth features in RGB images when depth sensor is not ready. In addition, more extensive experiments have been conducted.
Iii Depth Voxel Covariance and Eigendepth Feature
This section will present the extraction of depth voxel covariance descriptor and locally rotation invariant Eigendepth feature. Our descriptors are extracted from point cloud, a set of points on object surface expressed by 3D coordinate in real world converted from depth image. We first tabulate the notations defined in this section in Table I.
symbol  definition  symbol  definition  












Iiia Basic Feature Extraction
We first extract basic features of point cloud. We assume another kind of biometric cue, skeleton joints of pedestrian body, is also available along with depth images (e.g., when using Kinect). We intend to extract features on the body parts whose surfaces are more invariant and reliable. As shown in Figure 3 (a) and (b), due to pose difference, sometimes a part of limb surface is not observed under selfocclusion, so the surface shapes of arms and legs contain more noises rather than valuable information. Therefore, we divide each point cloud of the whole body by two shoulder joints and two hip joints and only the points of head and torso are used for feature extraction, while the four limbs are not.
For each point in the point cloud, a normal vector [77] is computed as basic feature. The direction of normal vector describes the shape of a small neighbour region of that point. For a point , nearest neighbourhoods of are found, and then the direction on which data is least scattered is computed by PCA [78] as the unit normal vector direction . For each point (unit: mm), a feature vector is composed of the coordinate and the unit normal vector
(1) 
IiiB Depth Voxel Covariance Descriptor
To depict the variation of local feature vectors and alleviate noises, we exploit two types of covariance matrices, namely withinvoxel covariance and betweenvoxel covariance.
IiiB1 Withinvoxel Covariance
We divide a point cloud into rectangular voxels (e.g., voxels in our case) with 50% overlap, and an example is shown in Figure 3 (c). In each voxel, withinvoxel covariance matrix is computed to describe the shape. For a voxel , let be the 6dimensional feature vectors inside . Withinvoxel covariance matrix is then defined as follows:
(2) 
where is the mean of the feature vectors of .
IiiB2 Betweenvoxel Covariance
While withinvoxel covariance describes shape in a voxel, the differences of shapes between voxels also contain discriminative information. Similar to standard covariance, we define a novel betweenvoxel covariance to represent the relation between different voxels.
As shown in Figure 3 (d), the point cloud is divided into voxels without overlap. Betweenvoxel covariance matrices are computed for each pair of 8adjacent voxels. For two adjacent voxels and , let and be the 6dimensional feature vectors inside and , respectively. We define the betweenvoxel covariance matrix as follows:
(3) 
For a depth image, the unification of the withinvoxel and betweenvoxel covariance matrices of all voxels is called the depth voxel covariance descriptor (DVCov).
IiiC Local Rotation Invariance of Eigenvalues
Assume , are voxels of a person in depth camera . Let denote the feature vectors, denote the withinvoxel covariance matrix of voxel and denote betweenvoxel covariance matrix between voxels and . We assume only viewpoint rotation and location change take place between two different camera views and and the rotation is local so that the body part within voxels and will not become invisible due to selfocclusions. To express the transformation from camera view to camera view , let denote the rotation transformation matrix of point coordinate , denote the rotation transformation matrix of unit normal vector , and denote the shift of pedestrian location. Then the transformations of feature vectors from camera view to camera view are
(4) 
(5) 
where .
Since and satisfy and , we have , so that is orthogonal transformation. Hence, the eigenvalues of withinvoxel covariance matrices and are the same, and the eigenvalues of betweenvoxel covariance matrices and are the same as well. That means the eigenvalues of withinvoxel covariance and betweenvoxel covariance are invariant to rotation and location change.
IiiD Eigendepth Feature and Analysis
In this section, we provide more indepth analysis about the role of those eigenvalues. Let denote two covariance matrices. The eigendecomposition of and are and , respectively. Here are eigenvalues of , are eigenvalues of , and and are orthogonal matrices whose columns are the corresponding eigenvectors.
We note that rotation of point clouds and normal vectors can be normalized by matching the principal axes of and according to the descending order of eigenvalues. That is, one can find an orthogonal transformation matrix such that , where is the rotation transformation we want to estimate. Hence, we construct a normalized covariance matrix , where are eigenvalues of and contains eigenvectors of . We call the rotation normalized covariance matrix from to .
Now we present how to use the above eigenvalues to construct feature vectors. Let and . Interestingly, we can have the following theorem.
Theorem 1
Computing the Euclidean distance between and is equivalent to computing the geodesic distance between covariance matrix and the rotation normalized covariance matrix on the Riemannian manifold.
Proof. The Euclidean distance between and is
(7) 
The geodesic distance between and on Riemannian manifold [79] can be calculated as follows:
(8) 
where are the generalized eigenvalues of and , computed by , i.e., eigenvalues of .
(9)  
Hence, the generalized eigenvalue of and is
(10) 
By substituting Equation (10) into (7) and (8), we have
(11) 
It can be seen that the geodesic distance on the Riemannian manifold is equivalent to the Euclidean distance between feature vectors and .
Eigendepth Feature. The above theorem tells if there exists only local rotation variation, the logarithm eigenvalue vector can convert the distance between covariance matrices on Riemannian manifold to the Euclidean distance between two feature vectors. In our work, we define the Eigendepth feature (ED) of a covariance matrix as
(12) 
where is either a withinvoxel covariance or a betweenvoxel covariance. Using eigenvalues makes the feature more compact than using depth voxel covariance descriptor.
To give a direct perception of Eigendepth features, i.e., the logarithms of eigenvalues, we show some sample images, the Eigendepth features and distances between positive and negative pairs in Figure 4. For demonstration, we selected one sample as probe image from BIWI RGBDID dataset and 18 samples of the same person captured from different views as gallery images. For a fixed voxel indicated in the red bounding box, we extracted its withinvoxel Eigendepth feature and obtained a 6dimensional feature vector consisting of the logarithms of eigenvalues for each sample. The logarithms of eigenvalues are shown in the second row in Figure 4 by the histograms. We can find that the histograms of eigenvalues look very similar over the rotation change. Since there are still extra variations but not just local rotation variation in practice, we further make comparison between the distance of positive pair (i.e., samples from the same class) and the distance of negative pair (i.e., samples from different classes) on the third row of Figure 4. We first computed the Euclidean distance between the probe image and each gallery image given above as the distance of positive pair (plotted as blue curve), and computed the average distance between each gallery image and all samples from other classes in this dataset as the distance of negative pair (plotted as red dashed curve). We can observe that the distance between samples of the same class across multiple view angles is less sensitive to rotation in practice. Moreover, the distance of positive pair is smaller than the average distance of negative pair. So the Eigendepth feature is a useful shape descriptor for recognition.
Remark. In existing literatures about covariance descriptor such as [17], geodesic distance on Riemannian manifold is used for measuring the similarities between covariance matrices. However, directly using geodesic distance is not invariant to rotation. Given two rotation transformation matrices and , the eigenvalues of and
are always different. Moreover, covariance matrix does not lie on Euclidean space, so most common machine learning methods are not proper to be applied directly.
In practice, although Eigendepth feature is proved to be locally invariant to rotation, some problems come along with this property. As shown in Figure 5 on the left, given depth images (in which grayscale denotes depth) of two different voxels of body surface, they can be transformed to each other by rotation. Obviously, their shapes are clearly different but the Eigendepth features of withinvoxel covariance matrices are the same, and such a situation could make confusion in the matching stage, which is also a problem for other rotation invariant depth shape descriptors. This kind of confusion could take place if the voxel size is too small and the voxel contains only a small region of body surface. To alleviate this problem, as illustrated in Figure 5 on the right, we divide the point cloud into voxels to extract feature for more robust representation. So the voxels are large enough to contain a big area of body surface, making it less possible to cause confusion after rotation.
Iv Depthbased Reidentification Framework
In the previous section, we have extracted depth voxel covariance descriptors (DVCov), and constructed Eigendepth feature (ED) for describing body shape. Besides using body shape, incorporating more physical information would have extra benefit on the identification of a person. As indicated in the previous section, the four limbs are not suitable for extracting invariant shape representation, but the lengths of limbs contain physical information which is also a biometric cue for distinguishing people. Hence, to obtain a complete feature representation of pedestrian, we additionally employ the skeletonbased feature (SKL) as complementary physical information, and then build a depthbased reidentification framework by combining the proposed depth shape descriptors and the skeletonbased feature together.
The whole framework is illustrated in Figure 2. For the feature representation of skeleton, we apply the skeletonbased feature in [2]. This skeletonbased feature is a feature vector that contains 13 distance values and ratios computed from the positions of skeleton joints provided by a skeleton tracker. The elements of the feature vector includes: (a) head height, (b) neck height, (c) neck to left shoulder distance, (d) neck to right shoulder distance, (e) torso to right shoulder distance, (f) right arm length, (g) left arm length, (h) right upper leg length, (i) left upper leg length, (j) torso length, (k) right hip to left hip distance, (l) ratio between torso length and right upper leg length (i.e., j/h) and (m) ratio between torso length and left upper leg length (i.e., j/i) (the unit of distances is cm).
After extracting skeletonbased feature, in the stage of feature fusion, we combine our proposed depth shape descriptors and the skeletonbased feature together to form complete representation of human body. In this work, we offer two fusion models below.

DVCov+SKL: When the viewpoint variation of a person across camera views is not large in some special cases (e.g., security check or walking in narrow passage), the influence of rotation can be secondary. In such cases, we select our depth voxel covariance descriptor as depth shape descriptor, as it contains richer information about texture and is more effective for describing the shape. We measure the similarity of two subjects by computing the combined distance , where denotes the sum of geodesic distances between the corresponding withinvoxel covariance matrices and betweenvoxel covariance matrices, and denotes the Euclidean distance between skeletonbased features.

ED+SKL: When the viewpoint variation of a pedestrian across different camera views is large, we select Eigendepth feature as depth shape descriptor since it is locally rotation invariant. Let and denote the concatenated Eigendepth feature vectors of all withinvoxel covariance matrices and all betweenvoxel covariance matrices of a person. Let denote the skeletonbased feature. We fuse these three features by concatenating them to obtain a combined feature . Then we apply LDA [78] to
for feature selection. We first reduce feature dimension to 100 by principal component analysis (PCA) and then extract
discriminant vectors by LDA, where is the number of classes. After dimension reduction, the projected features are matched by using Euclidean distance.
V Depth Feature Transfer
We have proposed a depthbased person reidentification framework in previous sections. However, in most existing surveillance systems, a large amount of cameras do not support capturing depth information, so only RGB images are available. In this section, we exploit a transfer technique to implicitly estimate depth features for RGB person images when depth device is not ready. We tabulate the notations defined in this section in Table II.
symbol  definition  symbol  definition  

, 



, 

, 


, 

, 




, 

Va Kernelized Implicit Feature Transfer Scheme
Depth features can describe body shape of a person, while some visual features (e.g., HOG [13] and LBP [80]) extracted from RGB images can also describe body shape coarsely to some extent. Therefore, we aim to learn the relation between depth features and these kinds of visual features, so as to estimate the depth features from RGB images when depth device is not ready.
For this purpose, we assume an auxiliary RGBD dataset is given. This RGBD dataset is regarded as source domain, and the RGB images from which we want to estimate depth features are regarded as target domain. We propose a kernelized implicit feature transfer scheme to transfer the depth feature from source domain to target domain. The overview of the feature transfer procedure is shown in Figure 6.
In details, suppose there exists an auxiliary RGBD dataset that consists of RGBD images for each person. Let the source domain samples be denoted by , where is the total number of samples, denotes the visual feature of the sample, denotes the depth feature of the sample, and denotes the label ( is the total number of classes/identities). Depth feature and visual feature are heterogeneous features, and we assume that they can be mapped onto a common latent subspace if they are transformed onto high dimensional nonlinear space implicitly by some kernel functions. Let us denote the nonlinear visual feature as and the nonlinear depth feature as , where the dimensions and are unknown. Then we project the visual features and depth features onto a common latent subspace by projection matrices and , respectively, where is the dimension of the common latent subspace. In the common latent subspace, we aim to make the projected visual feature close to the corresponding depth feature . For this purpose, we minimize the distance between the means of RGBbased visual features and the depth features of each person in the common latent subspace by minimizing
(13) 
In addition, we wish that the above transformation between depth and RGB features is learned in a discriminative way. In order to make both visual features and depth features discriminative in the common latent subspace, it is expected to minimize the withinclass variance and maximize the betweenclass variance of both visual features and depth features at the same time. The betweenclass scatter matrices and withinclass scatter matrices are defined as follows:
(14) 
(15) 
where , denotes visual feature and denotes depth feature, and
(16) 
(17) 
and is the number of samples of class .
Then we combine the minimization of with the discriminant feature learning that maximizes betweenclass variance while minimizes withinclass variance as follows:
(18) 
where () and () are betweenclass scatter matrix and withinclass scatter matrix of visual features (depth features) respectively, and , , , and are nonnegative tradeoff parameters. We call the above transfer model the kernelized implicit feature transfer scheme. It is unsupervised without using information in target domain.
VB Optimization
We show that the model developed in the last section can be converted to a generalized eigendecomposition problem. As suggested by the representer theorem [81], the projection matrices can be represented by the combination of training samples, i.e., , , where and are visual feature matrix and depth feature matrix of training samples respectively and and are the combination coefficient matrices. For visual feature and depth feature , we define
(19) 
(20) 
where and are kernel functions for visual feature and depth feature, respectively. The projection of a visual feature is expressed as . In the same way, . To jointly solve and , we define
, zeropadding transformation matrix
for and for . Now the objective function (18) can be reformulated as:(21) 
where , , , are scatter matrices defined by and , .
(22)  
where , .
Let , , , denote the zeropadding scatter matrices. Finally, the objective function is formulated by:
(23)  
where , . Hence, a generalized eigendecomposition problem can be derived below:
(24) 
Solving the above is to compute the eigendecomposition , in which is a diagonal matrix with sorted eigenvalues in descending order lying on the diagonal and contains the eigenvectors. Since , we can obtain by extracting the first rows of . To specify the dimension of the common latent subspace, we use the first columns of to form the projection matrix so that we can project visual feature to the dimensional common latent subspace.
VC Depth Feature Estimation on Target Domain
After learning the projection to the discriminative common latent subspace, we can implicitly estimate the depth feature of an RGB image in target domain by mapping visual feature to highdimensional nonlinear space by and projecting it to the learned common latent subspace by . Given two new samples and in target domain, let , denote the visual features of RGB images. The estimated depth features in the learned discriminative common latent subspace are computed by and . Then the distance of depth features between and is computed by
(25) 
Vi Experiments
Our depthbased person reidentification framework was evaluated on three RGBD person reidentification datasets PAVIS [1], BIWI RGBDID [2] and IASLab RGBDID [3], which were captured by Kinect. In Section VIE, the kernelized implicit feature transfer scheme was evaluated on 3DPeS [21] and CAVIAR4REID [20]. The experiment results were presented in Cumulative Matching Characteristic (CMC) curve [82] and rank accuracy. Rank accuracy is the cumulative recognition rate of correct matches at rank . The CMC curve represents the cumulative recognition rates at all ranks. The evaluation was repeated 10 times and average results were reported.
Compared Methods. By following the general reid setting, we tested Eigendepth feature (ED), our depth voxel covariance descriptor (DVCov), skeletonbased feature (SKL) and the combinations of depth shape descriptors and skeletonbased feature (ED+SKL and DVCov+SKL). We conducted comparisons with RGBbased appearance features including LOMO feature [12], ELF18 feature [50], color histograms (RGB, HS and YCbCr space) [11], HOG [13] and LBP [80], rotation invariant depth shape descriptors including RIFT2M [5] and Fehr’s covariance descriptor [6], and skeletonbased feature designed for depthbased reid [2]. All RGBbased appearance features were extracted from images which were resized to . RIFT2M and Fehr’s descriptor were densely extracted using the same voxels as Eigendepth feature. We used LDA to learn the distance metric for all features, except that the skeletonbased feature was matched by Euclidean distance and our depth voxel covariance descriptor was matched by geodesic distance using Equation (8).
Via Evaluation on PAVIS
We used two groups of dataset images in PAVIS dataset [1] for evaluation here. These two groups are denoted by “Walking1” and “Walking2”. Images of “Walking1” and “Walking2” were obtained by recording the same 79 people with a frontal view, walking slowly in an indoor scenario. Among the 79 people, 60 people in “Walking2” dressed different clothes from “Walking1”.
The characteristic of this experiment is that some people changed their clothes (by wearing one more red shirt) from “Walking1” to “Walking2” as shown in Figure 7. However, one could still explore some appearance cues (e.g., trousers and body shape) for matching persons across these two sets. Since the images of frontal bodies were captured from nearly the same view in these two sets, there was little rotation variation of point clouds. In this case, we can apply DVCov+SKL in our framework.
We used images in “Walking1” to form the gallery set and the images in “Walking2” to form the probe. By following the usual traintest policy for person reidentification, we randomly sampled half of the group “Walking1”, i.e., images of 40 persons for training, and the remaining 39 persons were used for testing. Images of these 39 testing persons in “Walking1” were randomly selected as gallery and all images of these 39 persons in “Walking2” were used as probe. In singleshot experiments, one image of each person was randomly selected as gallery. In multishot experiments, five images of a person were selected as gallery, and in such a case the distance between each probe image and each gallery class was the minimum distance between each probe image and each gallery image of that class. The performance of the tested methods was reported in Figure 8, Figure 9 (a) and Table III.
Setting  Singleshot  Multishot  

Rank  1  5  1  5 
RGBbased appearance features  
LOMO[12]  12.05  35.03  19.74  44.36 
ELF18[50]  52.15  77.85  52.62  78.26 
Color Hist[11]  47.90  74.97  48.92  74.82 
HOG[13]  45.03  73.49  45.33  73.95 
LBP[80]  42.92  71.33  45.64  72.36 
Depthbased features  
RIFT2M[5]  7.13  22.77  8.77  27.69 
Fehr’s[6]  24.26  51.64  30.56  58.67 
Skeleton[2]  33.13  67.85  37.33  71.13 
Proposed  
DVCov (depth voxel covariance)  61.49  81.23  66.00  82.92 
DVCov+SKL  67.64  87.33  71.74  88.46 
ED (Eigendepth feature)  44.67  72.10  51.59  76.15 
ED+SKL  55.95  84.77  61.23  87.64 
The results suggest that both Eigendepth feature (ED) and our depth voxel covariance descriptor (DVCov) are more effective than RIFT2M and Fehr’s descriptor for describing body shape. Since view angles of persons are nearly the same in “Walking1” and “Walking2”, our depth voxel covariance descriptor is more effective than Eigendepth feature, because it contains richer information about textures than using only eigenvalues. However, Eigendepth feature is still more effective than other methods except for our depth voxel covariance descriptor. Using skeletonbased feature alone cannot achieve high performance, but it is complementary information for our depth voxel covariance descriptor and Eigendepth feature. The combination of our depth voxel covariance descriptor and skeletonbased feature achieves encouraging performance, where the rank1 accuracy is 67.64% for singleshot recognition and 71.74% for multishot recognition. It is clear that the fusion outperformed RGB appearancebased methods and other tested depthbased methods. We note that not all RGBbased appearance features performed badly as we expected, because among the 79 people, 19 people did not change clothes and the other 60 people’s trousers did not change as well from the gallery set to the probe. In conclusion, this test showed the effectiveness of our depth voxel covariance descriptor for shape description.
ViB Evaluation on BIWI RGBDID
The BIWI RGBDID dataset [2] contains three groups of sequences “Training”, “Still” and “Walking” captured from 50 different people. For a sequence of each person, there are about 300 frames of depth images and skeletons. Before feature extraction, we converted depth images to point clouds. In “Training”, people performed motions, such as walking and rotating. Only 28 people presented in “Training” were recorded in “Still” and “Walking”, which were collected in a different day and in a different scene, so that most persons were dressed differently. In “Still”, people slightly moved, while in “Walking”, every person walked in different view angles. Examples of images in “Training”, “Still” and “Walking” are shown in Figure 10. Since pedestrians’ viewpoint variation was large here, it was more suitable to use ED+SKL in our framework.
For BIWI RGBDID dataset, images of the 22 people who only appeared in “Training” were used for training, and images of the remaining 28 people were used for testing. In the testing set, we used images in “Training” as gallery and images in “Still” and “Walking” as probe, so the same person wore different clothes in gallery and probe.
We selected the samples for evaluation by face detection as advised in
[2]. Since the persons were captured from different view angles, this dataset is suitable to evaluate the effect of the local rotation invariance property of the proposed Eigendepth feature. The average results of CMC curve and rank accuracy over 10 trials were reported in Figure 9 (b), (c) and Table IV.Probe  Still  Walking  

Setting  Singleshot  Multishot  Singleshot  Multishot  
Rank  1  5  1  5  1  5  1  5 
RGBbased appearance features  
LOMO[12]  9.07  28.21  18.17  35.47  8.74  23.33  10.31  25.39 
ELF18[50]  2.79  18.18  4.11  19.13  1.32  16.03  1.50  16.77 
Color Hist[11]  7.02  25.47  10.61  31.92  5.43  19.56  5.86  21.70 
HOG[13]  8.42  25.69  12.35  30.39  6.38  21.00  6.94  23.29 
LBP[80]  7.37  26.04  10.87  33.57  4.87  20.04  5.34  23.31 
Depthbased features  
RIFT2M[5]  4.04  19.52  4.34  20.78  3.25  17.46  3.75  18.31 
Fehr’s[6]  12.08  38.17  14.06  43.78  9.33  32.39  12.09  39.60 
Skeleton[2]  21.34  53.32  26.55  62.73  14.52  42.36  16.94  47.18 
Proposed  
DVCov  16.32  45.93  23.07  58.89  12.58  39.22  17.24  45.93 
DVCov+SKL  23.49  57.06  34.37  72.77  16.59  46.67  21.40  54.12 
ED  28.98  61.85  36.22  73.11  20.90  51.98  28.71  63.85 
ED+SKL  30.52  67.86  39.38  72.13  24.47  60.63  29.96  65.18 
As shown in Figure 9 (b), (c), RGBbased appearance features completely failed, because most people changed clothes so that color feature was not reliable. Our depthbased methods outperformed all RGB appearancebased methods. On BIWI RGBDID, people appeared in different view angles, so the problem became more challenging than the one on PAVIS. On “Walking”, the problem was even more difficult since more frames were captured in multiple viewpoints. In these situations, rotation invariant depth shape descriptor is more suitable, so that our Eigendepth feature outperformed our depth voxel covariance descriptor. Compared with other rotation invariant depth shape descriptors, Eigendepth feature outperformed RIFT2M and Fehr’s descriptor. The combination of Eigendepth feature and skeletonbased feature (ED+SKL) can achieve better performance than using them separately, which is the best on BIWI RGBDID. The results showed the local rotation invariance and the effectiveness of body shape description of Eigendepth feature.
ViC Evaluation on IASLab RGBDID
There are 11 different people in IASLab RGBDID dataset [3]. In this dataset, three groups of sequences “Training”, “TestingA” and “TestingB” were recorded, and each person rotated on himself and walked during the recording. There are about 500 frames of depth images and skeletons for each person. The sequences in “Training” and “TestingA” were acquired when the same person was wearing different clothes. The sequences in “TestingB” were collected in a different room, where each person dressed the same as in “Training”. Some sequences in “TestingB” were recorded in dark environment. Examples of images in “Training”, “TestingA” and “TestingB” are shown in Figure 11.
On this dataset, the evaluation also followed the settings on PAVIS. Half of “Training” sequences were randomly selected to form the training set and the rest were selected to form the gallery in the test. The samples in “TestingA” and “TestingB” corresponding to the gallery persons were selected to form the probe set. By following the settings in [3], all images were used in this experiment. On this dataset, mismatch would be observed when performing the matching between a person image of rear view and his/her image of frontal view, so that it challenges body shape descriptors. The average rank1 and rank3 accuracies over 10 trials of evaluation were reported in Table V.
Probe  TestingA  TestingB  

Setting  Singleshot  Multishot  Singleshot  Multishot  
Rank  1  3  1  3  1  3  1  3 
RGBbased appearance features  
LOMO[12]  26.37  65.82  25.79  66.28  30.97  75.00  30.06  79.90 
ELF18[50]  22.35  60.96  21.81  67.77  24.03  67.36  23.01  67.81 
Color Hist[11]  27.69  63.71  24.42  66.48  18.45  63.33  23.89  60.93 
HOG[13]  31.00  66.48  38.89  72.67  47.21  81.16  49.62  86.79 
LBP[80]  28.71  67.97  32.81  68.22  51.38  84.28  52.88  89.81 
Depthbased features  
RIFT2M[5]  19.69  60.76  20.94  60.87  19.88  59.78  19.88  60.02 
Fehr’s[6]  23.78  67.34  24.05  64.95  20.58  63.21  20.46  62.65 
Skeleton[2]  41.36  85.29  49.83  91.49  54.18  87.07  60.25  93.58 
Proposed  
DVCov  27.95  67.20  35.56  72.53  25.38  59.67  36.14  71.45 
DVCov+SKL  34.10  71.00  46.57  79.23  27.74  62.28  45.91  80.42 
ED  32.09  75.23  31.76  75.15  35.82  73.60  39.20  79.86 
ED+SKL  48.75  90.57  52.30  90.15  58.65  94.36  63.29  91.21 
On “TestingA”, the RGBbased appearance features nearly failed, and Eigendepth feature and skeletonbased feature outperformed them. On “TestingB”, HOG and LBP can still adapt to illumination change to some extent, while color histogram completely failed. Since rotation of samples took place in this dataset, the proposed Eigendepth feature outperformed our depth voxel covariance descriptor and is more suitable for shape description in such a situation. Eigendepth feature also outperformed the compared rotation invariant depth shape descriptors RIFT2M and Fehr’s descriptor. In most cases, combining Eigendepth feature with skeletonbased feature worked better than using them separately. Since viewpoint of pose changed from 0 to 360 for each person in the training and testing sets, shape description from front to back for the same person changes notably and thus would cause confusion for matching. Skeletonbased feature is more effective in the cases when there are only 5 persons in testing set, because there are fewer persons of similar somatotype. So skeletonbased feature is better than Eigendepth feature in this case. In general, the combination of Eigendepth feature and skeletonbased feature is the most effective. The test showed the effectiveness of our depthbased method when people change clothes and appear in the extreme lighting condition.
ViD Comparison to Depthbased Reid Frameworks
Existing wellknown methods related to depthbased person reidentification include stillimagebased recurrent attention model (3D RAM) [4], skeletonbased feature (SKL), Point Cloud Matching (PCM) and the combination of PCM and SKL (PCM+SKL) [3]. 3D RAM, PCM and PCM+SKL are designed under a different setting from the usual one for person reid; that is they require that the group of persons for training is the same as the one of persons for testing, while there is no overlap on persons between training and testing in the usual reid setting. To compare our method with the above methods, we tested our Eigendepth (ED) feature and the combination of Eigendepth feature and skeletonbased feature (ED+SKL) on PAVIS and IASLab RGBDID under the same setting as the compared methods when they were reported in [4, 3]. The experiment results were reported in Table VI. As shown, our method ED and ED+SKL clearly outperformed other existing depthbased frameworks, especially on PAVIS, a much larger dataset with more persons involved.
Dataset  Probe  ED  ED+SKL  3D RAM[4]  PCM[3]  PCM+SKL [3]  SKL[3] 

PAVIS  Walking2  54.4  57.0  41.3      28.6 
IASLab  TestingA  44.0  49.9  48.3  28.6  25.6  22.5 
RGBDID  TestingB  55.5  66.6  63.7  43.7  63.3  55.5 
*The experiments here are under a different setting from the experiments in previous sections. See Sec. VID for details.
ViE Depth Feature Transfer Evaluation
The effectiveness of the kernelized implicit feature transfer scheme was evaluated on RGB datasets 3DPeS [21] and CAVIAR4REID [20]. Before showing the experiment results, we first present implementation details of the feature transfer scheme.
Implementation Details. In this work, we selected the BIWI RGBDID dataset [2] as the auxiliary dataset. In “Training” of BIWI RGBDID, there were 50 persons performing actions of rotation and walking. For each of the 50 persons in “Training”, 8 RGB images from 8 different views ranged from 0 to 360 were selected as auxiliary RGB images. As for depth information, for each person, 8 point clouds from frontal view were selected for extracting depth features corresponding to those 8 RGB images. Some samples of auxiliary RGBD dataset are shown in Figure 6.
After constructing the RGBD auxiliary dataset, we extracted visual features and depth features to establish the connection between two modalities by the kernelized implicit feature transfer scheme. Since depth features describe body shape of pedestrians, the visual features for learning the transformation should also be able to describe body shape to some extent. We used HOG [13] and LBP [80] for describing body silhouette and textures. All RGB images in auxiliary dataset were resized to for extracting HOG and LBP features using cells. We also extracted the same visual features for samples in target domain. As for the point clouds, Eigendepth feature was extracted to describe body shape.
With the extracted visual features and depth features, we conducted the proposed kernelized implicit feature transfer scheme. We chose the guassian kernel functions for visual feature and depth feature, which are and , respectively. Let and denote the means of the distances of visual features and depth features between any two samples in the auxiliary dataset, respectively. We set the bandwidth parameters as and . As for the parameters setting of the objective function, we empirically set the default parameters as , , , , which were normalized by traces. That is to say, the terms related to depth features were assigned much larger weights since we focused on learning the relation between depth feature and visual feature in order to take advantage of the discriminative information in depth features. As for the dimension of the common latent subspace, we set .
Scorelevel Feature Fusion. We estimated depth features on RGB images in order to augment the visual features with complementary information in depth features. Let denote the distance between RGBbased appearance features of RGB images between two samples and , and denote the distance between depth features computed according to Equation (25). We fused these two types of distances with a weight as follow:
(26) 
In our experiments, each type of distance was normalized by its mean distance between any two samples in training set.
Experiment Settings. We evaluated how the transferred Eigendepth feature (TED) can help to improve the performance when combined with LOMO [12] and ELF18 [50], which were two recently proposed effective RGBbased appearance features in person reidentification. To compute the similarity of RGBbased appearance features, we applied three favorable distance metric learning methods LFDA [43], MLAPG [52] and KISSME [40]. So we had the following different settings, ELF18(LFDA)+TED, LOMO(LFDA)+TED, ELF18(MLAPG) +TED, LOMO(MLAPG)+TED, ELF18(KISSME)+TED, LOMO(KISSME)+TED. For these settings, the corresponding default distance fusion weight was set to 0.3, 0.2, 0.3, 0.15, 0.3, 0.2, respectively. It is reasonable that the distance fusion weight was set to different values when fusing different RGBbased distance metrics with the depth one. Experiments were conducted on 3DPeS and CAVIAR4REID. We followed the experiment settings on PAVIS. For each person in testing set, one image was randomly selected as gallery and the remaining images were used for probing.
As for baseline methods, CCA [83] and sparse regression [84] were compared. In details, we used CCA to maximize the correlation between RGB feature and depth feature on the auxiliary dataset. As for sparse regression, we made the sparse representation shared between RGB and depth feature dictionaries so as to derive a transferred depth feature. The depth features transferred by CCA and sparse regression are denoted by DCCA and DSPA, respectively. We combined the distance of the transferred depth feature with the distance of RGBbased appearance feature for recognition. The average rank1 to rank5 accuracies over 10 trials were reported in Table VII.
Results. The transferred Eigendepth feature (TED) can achieve rank1 accuracy 16.0% on 3DPeS and 27.8% on CAVIAR4REID. For all RGBbased appearance features and distance metrics in our experiments, TED is effective for improving the toprank matching accuracies. The augmentation of TED can boost rank1 accuracy of ELF18 using LFDA metric by 4.4% on both 3DPeS and CAVIAR4REID. Although LOMO is a stateoftheart feature for person reidentification, the transferred depth feature makes consistent improvement especially at the rank1 matching case. Compared to the baseline methods, the proposed implicit feature transfer scheme clearly outperformed CCA (DCCA) and sparse regression (DSPA) when applied for the same purpose. The results indicate that it may not be effective to use CCA and sparse regression to exploit transferred depth feature. Overall, it is evident that the transferred Eigendepth feature (TED) is complementary to RGB color and texture features, so that it can augment the RGB feature representation and help to get better ranking results.
Dataset  3DPeS  CAVIAR4REID  

Rank  1  2  3  4  5  1  2  3  4  5 
TED  16.0  21.7  26.4  29.0  32.1  27.8  35.2  39.7  43.6  46.8 
DCCA  6.5  11.0  15.0  17.8  20.5  24.2  31.9  36.6  40.1  42.9 
DSPA  2.7  4.5  6.4  7.5  9.0  5.7  9.2  11.8  14.7  17.9 
LFDA metric  
ELF18  30.3  40.5  46.4  51.5  55.3  32.6  42.9  49.6  55.2  59.0 
ELF18+TED  34.7  45.2  51.3  56.5  60.0  37.0  45.8  52.0  56.9  60.8 
ELF18+DCCA  30.0  40.5  47.2  52.7  56.7  35.7  44.0  48.9  53.3  56.6 
ELF18+DSPA  30.3  40.5  47.1  51.5  55.6  32.2  41.9  47.6  53.0  57.5 
LOMO  41.4  53.4  60.4  64.3  68.0  40.2  50.1  56.7  61.8  65.6 
LOMO+TED  43.8  54.9  61.2  65.7  68.8  42.2  51.4  56.9  62.1  65.7 
LOMO+DCCA  41.2  52.8  59.6  64.0  67.2  40.9  49.3  54.9  59.5  63.3 
LOMO+DSPA  40.2  51.8  59.2  63.8  66.9  38.8  47.9  53.8  59.0  62.7 
MLAPG metric  
ELF18  35.5  47.1  54.2  59.1  62.8  34.5  46.4  54.0  60.0  65.1 
ELF18+TED  38.6  49.7  56.6  61.8  65.1  38.5  49.3  55.7  60.8  65.7 
ELF18+DCCA  33.9  45.9  52.3  57.5  61.6  36.9  46.1  52.1  56.8  60.3 
ELF18+DSPA  34.5  46.7  53.3  59.1  62.9  34.2  45.0  52.2  58.4  62.5 
LOMO  47.1  58.5  64.5  68.5  71.7  40.6  51.8  59.4  65.2  69.4 
LOMO+TED  48.4  58.7  64.6  68.8  72.0  42.8  52.9  59.8  65.3  69.6 
LOMO+DCCA  43.7  55.2  62.3  66.5  69.7  41.6  50.0  56.3  60.9  64.8 
LOMO+DSPA  44.3  55.8  62.3  66.5  69.3  39.0  48.8  55.3  60.9  65.2 
KISSME metric  
ELF18  32.4  42.8  48.9  53.5  57.0  33.3  42.6  48.7  53.5  57.7 
ELF18+TED  35.3  45.4  52.1  56.7  59.8  36.3  45.6  50.9  55.3  59.5 
ELF18+DCCA  32.6  42.6  49.7  54.4  57.8  35.9  43.7  48.4  52.8  56.0 
ELF18+DSPA  32.4  42.6  48.5  53.5  57.0  33.1  42.1  47.8  52.4  56.7 
LOMO 
Comments
There are no comments yet.