The task of person re-identification (re-id) is to match people in a distributed multi-camera surveillance system at different time and locations, with wide applications to forensic search, multi-camera tracking and access control, etc. In most short-term applications, low-level features such as color and textures are important appearance cues used to match. It is apparent that lighting will significantly affect the performance of these low-level features. In more extreme cases, when lighting condition changes greatly (e.g., with v.s. without lighting), color information of clothes becomes unreachable. Moreover, when people change clothes, color and textures become unreliable. For example, Figure 1 shows how color histograms change when people change clothes or appear in extreme illumination. In these cases, most existing re-id systems are not workable, since they are RGB-based.
In comparison to RGB information, depth information can maintain more invariant even when suffering from clothing change and extreme illumination. As shown in Figure 1, shape and skeleton of body are likely more invariant under extreme lighting and clothing change. Nowadays, extracting depth and skeleton information with depth cameras (e.g., Microsoft Kinect) is not difficult in an indoor environment. Kinect sensor obtains depth value (distance to the camera) of each pixel by infrared, regardless of object color and illumination in indoor applications. With depth information, the life-size point cloud and skeleton of a person can be extracted, providing shape and physical information of his/her body. Moreover, with depth value of each pixel, pedestrians can be more easily segmented from background, so that background influence can be largely eliminated. Hence, using depth information could overcome some difficulties in RGB appearance-based methods, such as color change, illumination change and background clutters.
Although there are some advantages for depth-based re-identification as compared to the RGB appearance-based methods, challenges and limitations also come along with depth information. Firstly, the depth images captured by depth device change significantly when a person’s viewpoint changes. Secondly, noises from devices exist in the captured depth images. These two aspects will seriously affect the use of depth information for person re-identification. So far, a few methods [1, 2, 3, 4] have been developed to exploit depth information for person re-identification, but the above two problems are still not well solved in existing methods.  uses only skeletons to extract feature. In [2, 3], besides using skeleton to extract physical information, applying point clouds converted from depth images for 3D body shape matching is also considered, but alignment errors and noises of point clouds are the problems remained unsolved. In  Therefore, body shape description is still an important biometric cue which needs further study for person re-identification.
In this work, we aim to design a depth shape descriptor which is locally invariant to rotation111Local rotation invariance means that the feature of a body part is invariant when viewpoint change of pedestrian will not make that body part become invisible due to self-occlusion. and insensitive to noise. We propose two depth shape descriptors: depth voxel covariance descriptor and Eigen-depth feature. Eigen-depth feature is based on depth voxel covariance descriptor and locally rotation invariant. Then we combine depth shape descriptor with skeleton-based feature to form complete depth representation of body shape and physical information. The pipeline of constructing a descriptor for our depth-based person re-id method is illustrated in Figure 2. Our method takes the following steps: (1) segmentation and computation of point cloud and normals of torso and head; (2) extracting depth voxel covariance descriptor and locally rotation invariant Eigen-depth feature; (3) enriching body depth shape descriptor by additionally combining skeleton-based feature. In the second step, the Eigen-depth feature is more suitable due to its stability against local rotation of body when the viewpoint of a person varies obviously, while the depth voxel covariance descriptor will be more effective because of rich information it contains when the viewpoint change of a person is slight. In the third step, the skeleton-based feature can be complementary to the depth shape descriptor extracted from step (2), so more robust matching can be achieved.
In addition, in real-world applications, most of the deployed cameras in existing surveillance systems cannot capture depth information, so how to make depth-based method work in existing system is also a challenge, while existing depth-based and RGB-D-based methods assume depth information is available. Towards overcoming this limitation, we learn the relation between depth features and RGB-based appearance features by a kernelized implicit feature transfer scheme. For this purpose, an auxiliary RGB-D dataset is employed to learn the nonlinear transformation between RGB-based appearance feature and depth feature. When depth device is not ready/available, the depth feature is estimated from RGB image and used to augment the RGB-based appearance feature. The experiment results show that this makes extra improvement on the re-identification performance for top-ranked matching.
We tested our methods on three publicly available datasets, PAVIS , BIWI RGBD-ID  and IAS-Lab RGBD-ID . The results show the effectiveness of our depth-based approach for overcoming change of clothes and extreme illumination condition. When clothes are completely different between gallery and probe, RGB appearance-based methods fail while our depth-based method is effective. Our approach outperforms other existing depth-based re-identification methods including skeleton-based methods, PCM, combination of them 
and recurrent attention model. Compared to other favorable rotation invariant depth shape descriptors, our descriptor also outperforms RIFT2M  and Fehr’s descriptor .
In summary, the contributions of our work are: (1) proposing depth voxel covariance descriptor and Eigen-depth feature for depth-based re-identification and proving the local rotation invariance of Eigen-depth feature in theory; (2) forming a depth re-id recognition framework by unifying depth shape descriptor and skeleton-based feature for a complete representation; (3) proposing a kernelized implicit feature transfer scheme to estimate the Eigen-depth feature from RGB images implicitly when depth device is not available.
Ii Related Work
In this section, we present an overview of related image-based re-id technologies in three aspects: (1) RGB appearance-based re-id, (2) depth-based re-id, and (3) RGB-D re-id. Currently, most person re-identification approaches are based on 2D RGB-based appearance features.
Ii-a RGB Appearance-based Person Re-identification
Most existing works rely on RGB-based appearance features. Among them, color is most frequently used and it is often encoded into histograms [7, 8, 9, 10, 11, 12]. Besides, texture-based features are also employed, including HOG-like signature , Gabor feature [11, 14], graph model , differential filters [11, 14] and Haar-like representations .
Many other hand-crafted features such as covariance descriptor  , Fisher vector
, Fisher vector, spatial co-occurrence representation , custom pictorial structure  and SARC3D  were also developed for achieving more reliable representations. Recently, feature learning methods have been more focused on, such as salience learning , mirror representation , salient color names , reference descriptor , context-based feature 27, 28, 29, 30, 31, 32], dictionary learning [33, 34, 35] and attribute learning [36, 37, 38]. However, in the situations of clothing change or extreme illumination, these RGB-based appearance features tend to fail.
Besides feature representation, a large amount of metric/subspace models [39, 11, 40, 14, 41, 42, 43, 44, 45, 46, 47, 48, 12, 49, 50, 51, 52, 53, 54, 55, 56], have been developed to achieve more reliable matching, such as LMNN , RankSVM , RDC , PCCA , KISSME , LFDA , CVDCA , CRAFT , MLAPG , TDL  and DNS . Some other methods have also been proposed for this purpose, e.g., re-ranking [57, 58] and correspondence structure  . Unsupervised learning models
. Unsupervised learning models[60, 61] have also been developed for person re-identification. However, they cannot solve the illumination and clothing change problems. Compared to RGB-based appearance features, depth information is a solution to this problem, because it is independent of color and maintains more invariant for a longer period of time.
Ii-B Depth-based Person Re-identification
So far, only a few depth-based re-identification methods based on depth image, point cloud and anthropometric measurement [1, 3, 2, 62, 63, 64, 4] have been developed. To some extent, depth-based methods can solve the problems of changing clothes and extreme illumination. Barbosa et al. exploited skeleton-based feature  based on anthropometric measurement of distances between joints and geodesic distances on body surface. Munaro et al. built a point cloud model for each person as gallery by fusing a set of point clouds from different views and then applied Point Cloud Matching (PCM) to compute the distance between samples . In [3, 62], Munaro et al. combined PCM and skeleton-based feature modified based on Barbosa et al.’s work . These methods needed to align the point clouds, and no depth shape descriptor was applied for describing body shape. Haque et al. proposed a recurrent attention model  for depth-video-based person identification, in which 3D RAM model was for still 3D point clouds and 4D RAM model was for 3D point cloud sequences. However, among the above depth-based frameworks, PCM and Haque’s method were not suitable for solving person re-identification problem under the setting when there is no overlap on people between training and testing.
Compared to existing depth-based re-identification frameworks, the main difference of our work is that we propose depth voxel covariance descriptor and Eigen-depth feature to describe body shape. Eigen-depth feature is a covariance-based feature, and it is locally rotation invariant and does not require alignment of point clouds. The Eigen-depth feature can be viewed as a depth shape descriptor and thus can remove the ambiguity of using only anthropometric measurement of skeletons in the previous depth modeling for re-identification. Compared to direct utilization of point cloud in PCM , it deals with noises of non-rigid human body better.
We also discuss some related depth shape descriptors, including the covariance descriptor in , RIFT2M  and Fehr’s covariance descriptor , which were not applied for person re-identification. Compared to the covariance descriptor in , Eigen-depth feature is locally rotation invariant. Compared to rotation invariant descriptors RIFT2M  and Fehr’s covariance descriptor , Eigen-depth feature is densely extracted rather than using interest points, so that it contains richer information of body shape. Moreover, its rotation invariance is achieved in eigen-analysis level, so alignment of point cloud is not needed and more compact representation can be obtained by eigen-analysis.
Ii-C RGB-D Person Re-identification
Since RGB and depth information can be obtained simultaneously when using Kinect, some re-identification methods have been developed to combine depth information and RGB appearance cues in order to extract more discriminative feature representation. Pala et al.  improved accuracy of clothing appearance descriptors by fusing them with anthropometric measures extracted from depth data. Mogelmose et al.  presented a tri-modal method to combine RGB, depth and thermal features. Mogelmose et al.  combined color histogram and height feature extracted from depth information. John et al.  combined RGB-Height histogram and gait feature of depth information. Satta et al.  exploited skeleton to segment human body and extracted color feature. In , each color pixel was assigned to the nearest bone in the skeleton, and color histograms were computed for each region. In , the proposed feature bodyprint exploited the mean RGB values of regions in different heights. In , the descriptor was based on a 3D cylindrical grid that unified color variations together with angle and height. Takac et al.  exploited color histograms of upper body and lower body separately. Xu et al.  proposed a distance metric using RGB-D data to improve RGB-based person re-identification.
As reported in these works, the combination of RGB and depth is effective. They all assume that depth information is available along with RGB images. In our work, we propose to learn the relation between RGB and depth by a kernelized implicit feature transfer scheme, which enables estimation of depth features from RGB features so as to improve the re-identification performance even though the deployed cameras are not ready for capturing depth information.
A preliminary version of this work appeared in . In this work, apart from providing more in-depth discussion on the proposed Eigen-depth feature and the depth-based person re-identification framework, a kernelized implicit feature transfer scheme is proposed to learn the relation between depth features and RGB features so as to estimate depth features in RGB images when depth sensor is not ready. In addition, more extensive experiments have been conducted.
Iii Depth Voxel Covariance and Eigen-depth Feature
This section will present the extraction of depth voxel covariance descriptor and locally rotation invariant Eigen-depth feature. Our descriptors are extracted from point cloud, a set of points on object surface expressed by 3D coordinate in real world converted from depth image. We first tabulate the notations defined in this section in Table I.
Iii-a Basic Feature Extraction
We first extract basic features of point cloud. We assume another kind of biometric cue, skeleton joints of pedestrian body, is also available along with depth images (e.g., when using Kinect). We intend to extract features on the body parts whose surfaces are more invariant and reliable. As shown in Figure 3 (a) and (b), due to pose difference, sometimes a part of limb surface is not observed under self-occlusion, so the surface shapes of arms and legs contain more noises rather than valuable information. Therefore, we divide each point cloud of the whole body by two shoulder joints and two hip joints and only the points of head and torso are used for feature extraction, while the four limbs are not.
For each point in the point cloud, a normal vector  is computed as basic feature. The direction of normal vector describes the shape of a small neighbour region of that point. For a point , nearest neighbourhoods of are found, and then the direction on which data is least scattered is computed by PCA  as the unit normal vector direction . For each point (unit: mm), a feature vector is composed of the coordinate and the unit normal vector
Iii-B Depth Voxel Covariance Descriptor
To depict the variation of local feature vectors and alleviate noises, we exploit two types of covariance matrices, namely within-voxel covariance and between-voxel covariance.
Iii-B1 Within-voxel Covariance
We divide a point cloud into rectangular voxels (e.g., voxels in our case) with 50% overlap, and an example is shown in Figure 3 (c). In each voxel, within-voxel covariance matrix is computed to describe the shape. For a voxel , let be the 6-dimensional feature vectors inside . Within-voxel covariance matrix is then defined as follows:
where is the mean of the feature vectors of .
Iii-B2 Between-voxel Covariance
While within-voxel covariance describes shape in a voxel, the differences of shapes between voxels also contain discriminative information. Similar to standard covariance, we define a novel between-voxel covariance to represent the relation between different voxels.
As shown in Figure 3 (d), the point cloud is divided into voxels without overlap. Between-voxel covariance matrices are computed for each pair of 8-adjacent voxels. For two adjacent voxels and , let and be the 6-dimensional feature vectors inside and , respectively. We define the between-voxel covariance matrix as follows:
For a depth image, the unification of the within-voxel and between-voxel covariance matrices of all voxels is called the depth voxel covariance descriptor (DVCov).
Iii-C Local Rotation Invariance of Eigenvalues
Assume , are voxels of a person in depth camera . Let denote the feature vectors, denote the within-voxel covariance matrix of voxel and denote between-voxel covariance matrix between voxels and . We assume only viewpoint rotation and location change take place between two different camera views and and the rotation is local so that the body part within voxels and will not become invisible due to self-occlusions. To express the transformation from camera view to camera view , let denote the rotation transformation matrix of point coordinate , denote the rotation transformation matrix of unit normal vector , and denote the shift of pedestrian location. Then the transformations of feature vectors from camera view to camera view are
Since and satisfy and , we have , so that is orthogonal transformation. Hence, the eigenvalues of within-voxel covariance matrices and are the same, and the eigenvalues of between-voxel covariance matrices and are the same as well. That means the eigenvalues of within-voxel covariance and between-voxel covariance are invariant to rotation and location change.
Iii-D Eigen-depth Feature and Analysis
In this section, we provide more in-depth analysis about the role of those eigenvalues. Let denote two covariance matrices. The eigen-decomposition of and are and , respectively. Here are eigenvalues of , are eigenvalues of , and and are orthogonal matrices whose columns are the corresponding eigenvectors.
We note that rotation of point clouds and normal vectors can be normalized by matching the principal axes of and according to the descending order of eigenvalues. That is, one can find an orthogonal transformation matrix such that , where is the rotation transformation we want to estimate. Hence, we construct a normalized covariance matrix , where are eigenvalues of and contains eigenvectors of . We call the rotation normalized covariance matrix from to .
Now we present how to use the above eigenvalues to construct feature vectors. Let and . Interestingly, we can have the following theorem.
Computing the Euclidean distance between and is equivalent to computing the geodesic distance between covariance matrix and the rotation normalized covariance matrix on the Riemannian manifold.
Proof. The Euclidean distance between and is
The geodesic distance between and on Riemannian manifold  can be calculated as follows:
where are the generalized eigenvalues of and , computed by , i.e., eigenvalues of .
Hence, the generalized eigenvalue of and is
It can be seen that the geodesic distance on the Riemannian manifold is equivalent to the Euclidean distance between feature vectors and .
Eigen-depth Feature. The above theorem tells if there exists only local rotation variation, the logarithm eigenvalue vector can convert the distance between covariance matrices on Riemannian manifold to the Euclidean distance between two feature vectors. In our work, we define the Eigen-depth feature (ED) of a covariance matrix as
where is either a within-voxel covariance or a between-voxel covariance. Using eigenvalues makes the feature more compact than using depth voxel covariance descriptor.
To give a direct perception of Eigen-depth features, i.e., the logarithms of eigenvalues, we show some sample images, the Eigen-depth features and distances between positive and negative pairs in Figure 4. For demonstration, we selected one sample as probe image from BIWI RGBD-ID dataset and 18 samples of the same person captured from different views as gallery images. For a fixed voxel indicated in the red bounding box, we extracted its within-voxel Eigen-depth feature and obtained a 6-dimensional feature vector consisting of the logarithms of eigenvalues for each sample. The logarithms of eigenvalues are shown in the second row in Figure 4 by the histograms. We can find that the histograms of eigenvalues look very similar over the rotation change. Since there are still extra variations but not just local rotation variation in practice, we further make comparison between the distance of positive pair (i.e., samples from the same class) and the distance of negative pair (i.e., samples from different classes) on the third row of Figure 4. We first computed the Euclidean distance between the probe image and each gallery image given above as the distance of positive pair (plotted as blue curve), and computed the average distance between each gallery image and all samples from other classes in this dataset as the distance of negative pair (plotted as red dashed curve). We can observe that the distance between samples of the same class across multiple view angles is less sensitive to rotation in practice. Moreover, the distance of positive pair is smaller than the average distance of negative pair. So the Eigen-depth feature is a useful shape descriptor for recognition.
Remark. In existing literatures about covariance descriptor such as , geodesic distance on Riemannian manifold is used for measuring the similarities between covariance matrices. However, directly using geodesic distance is not invariant to rotation. Given two rotation transformation matrices and , the eigenvalues of and
are always different. Moreover, covariance matrix does not lie on Euclidean space, so most common machine learning methods are not proper to be applied directly.
In practice, although Eigen-depth feature is proved to be locally invariant to rotation, some problems come along with this property. As shown in Figure 5 on the left, given depth images (in which grayscale denotes depth) of two different voxels of body surface, they can be transformed to each other by rotation. Obviously, their shapes are clearly different but the Eigen-depth features of within-voxel covariance matrices are the same, and such a situation could make confusion in the matching stage, which is also a problem for other rotation invariant depth shape descriptors. This kind of confusion could take place if the voxel size is too small and the voxel contains only a small region of body surface. To alleviate this problem, as illustrated in Figure 5 on the right, we divide the point cloud into voxels to extract feature for more robust representation. So the voxels are large enough to contain a big area of body surface, making it less possible to cause confusion after rotation.
Iv Depth-based Re-identification Framework
In the previous section, we have extracted depth voxel covariance descriptors (DVCov), and constructed Eigen-depth feature (ED) for describing body shape. Besides using body shape, incorporating more physical information would have extra benefit on the identification of a person. As indicated in the previous section, the four limbs are not suitable for extracting invariant shape representation, but the lengths of limbs contain physical information which is also a biometric cue for distinguishing people. Hence, to obtain a complete feature representation of pedestrian, we additionally employ the skeleton-based feature (SKL) as complementary physical information, and then build a depth-based re-identification framework by combining the proposed depth shape descriptors and the skeleton-based feature together.
The whole framework is illustrated in Figure 2. For the feature representation of skeleton, we apply the skeleton-based feature in . This skeleton-based feature is a feature vector that contains 13 distance values and ratios computed from the positions of skeleton joints provided by a skeleton tracker. The elements of the feature vector includes: (a) head height, (b) neck height, (c) neck to left shoulder distance, (d) neck to right shoulder distance, (e) torso to right shoulder distance, (f) right arm length, (g) left arm length, (h) right upper leg length, (i) left upper leg length, (j) torso length, (k) right hip to left hip distance, (l) ratio between torso length and right upper leg length (i.e., j/h) and (m) ratio between torso length and left upper leg length (i.e., j/i) (the unit of distances is cm).
After extracting skeleton-based feature, in the stage of feature fusion, we combine our proposed depth shape descriptors and the skeleton-based feature together to form complete representation of human body. In this work, we offer two fusion models below.
DVCov+SKL: When the viewpoint variation of a person across camera views is not large in some special cases (e.g., security check or walking in narrow passage), the influence of rotation can be secondary. In such cases, we select our depth voxel covariance descriptor as depth shape descriptor, as it contains richer information about texture and is more effective for describing the shape. We measure the similarity of two subjects by computing the combined distance , where denotes the sum of geodesic distances between the corresponding within-voxel covariance matrices and between-voxel covariance matrices, and denotes the Euclidean distance between skeleton-based features.
ED+SKL: When the viewpoint variation of a pedestrian across different camera views is large, we select Eigen-depth feature as depth shape descriptor since it is locally rotation invariant. Let and denote the concatenated Eigen-depth feature vectors of all within-voxel covariance matrices and all between-voxel covariance matrices of a person. Let denote the skeleton-based feature. We fuse these three features by concatenating them to obtain a combined feature . Then we apply LDA  todiscriminant vectors by LDA, where is the number of classes. After dimension reduction, the projected features are matched by using Euclidean distance.
V Depth Feature Transfer
We have proposed a depth-based person re-identification framework in previous sections. However, in most existing surveillance systems, a large amount of cameras do not support capturing depth information, so only RGB images are available. In this section, we exploit a transfer technique to implicitly estimate depth features for RGB person images when depth device is not ready. We tabulate the notations defined in this section in Table II.
V-a Kernelized Implicit Feature Transfer Scheme
Depth features can describe body shape of a person, while some visual features (e.g., HOG  and LBP ) extracted from RGB images can also describe body shape coarsely to some extent. Therefore, we aim to learn the relation between depth features and these kinds of visual features, so as to estimate the depth features from RGB images when depth device is not ready.
For this purpose, we assume an auxiliary RGB-D dataset is given. This RGB-D dataset is regarded as source domain, and the RGB images from which we want to estimate depth features are regarded as target domain. We propose a kernelized implicit feature transfer scheme to transfer the depth feature from source domain to target domain. The overview of the feature transfer procedure is shown in Figure 6.
In details, suppose there exists an auxiliary RGB-D dataset that consists of RGB-D images for each person. Let the source domain samples be denoted by , where is the total number of samples, denotes the visual feature of the sample, denotes the depth feature of the sample, and denotes the label ( is the total number of classes/identities). Depth feature and visual feature are heterogeneous features, and we assume that they can be mapped onto a common latent subspace if they are transformed onto high dimensional nonlinear space implicitly by some kernel functions. Let us denote the nonlinear visual feature as and the nonlinear depth feature as , where the dimensions and are unknown. Then we project the visual features and depth features onto a common latent subspace by projection matrices and , respectively, where is the dimension of the common latent subspace. In the common latent subspace, we aim to make the projected visual feature close to the corresponding depth feature . For this purpose, we minimize the distance between the means of RGB-based visual features and the depth features of each person in the common latent subspace by minimizing
In addition, we wish that the above transformation between depth and RGB features is learned in a discriminative way. In order to make both visual features and depth features discriminative in the common latent subspace, it is expected to minimize the within-class variance and maximize the between-class variance of both visual features and depth features at the same time. The between-class scatter matrices and within-class scatter matrices are defined as follows:
where , denotes visual feature and denotes depth feature, and
and is the number of samples of class .
Then we combine the minimization of with the discriminant feature learning that maximizes between-class variance while minimizes within-class variance as follows:
where () and () are between-class scatter matrix and within-class scatter matrix of visual features (depth features) respectively, and , , , and are non-negative trade-off parameters. We call the above transfer model the kernelized implicit feature transfer scheme. It is unsupervised without using information in target domain.
We show that the model developed in the last section can be converted to a generalized eigen-decomposition problem. As suggested by the representer theorem , the projection matrices can be represented by the combination of training samples, i.e., , , where and are visual feature matrix and depth feature matrix of training samples respectively and and are the combination coefficient matrices. For visual feature and depth feature , we define
where and are kernel functions for visual feature and depth feature, respectively. The projection of a visual feature is expressed as . In the same way, . To jointly solve and , we define
, zero-padding transformation matrixfor and for . Now the objective function (18) can be reformulated as:
where , , , are scatter matrices defined by and , .
where , .
Let , , , denote the zero-padding scatter matrices. Finally, the objective function is formulated by:
where , . Hence, a generalized eigen-decomposition problem can be derived below:
Solving the above is to compute the eigen-decomposition , in which is a diagonal matrix with sorted eigenvalues in descending order lying on the diagonal and contains the eigenvectors. Since , we can obtain by extracting the first rows of . To specify the dimension of the common latent subspace, we use the first columns of to form the projection matrix so that we can project visual feature to the -dimensional common latent subspace.
V-C Depth Feature Estimation on Target Domain
After learning the projection to the discriminative common latent subspace, we can implicitly estimate the depth feature of an RGB image in target domain by mapping visual feature to high-dimensional nonlinear space by and projecting it to the learned common latent subspace by . Given two new samples and in target domain, let , denote the visual features of RGB images. The estimated depth features in the learned discriminative common latent subspace are computed by and . Then the distance of depth features between and is computed by
Our depth-based person re-identification framework was evaluated on three RGB-D person re-identification datasets PAVIS , BIWI RGBD-ID  and IAS-Lab RGBD-ID , which were captured by Kinect. In Section VI-E, the kernelized implicit feature transfer scheme was evaluated on 3DPeS  and CAVIAR4REID . The experiment results were presented in Cumulative Matching Characteristic (CMC) curve  and rank- accuracy. Rank- accuracy is the cumulative recognition rate of correct matches at rank . The CMC curve represents the cumulative recognition rates at all ranks. The evaluation was repeated 10 times and average results were reported.
Compared Methods. By following the general re-id setting, we tested Eigen-depth feature (ED), our depth voxel covariance descriptor (DVCov), skeleton-based feature (SKL) and the combinations of depth shape descriptors and skeleton-based feature (ED+SKL and DVCov+SKL). We conducted comparisons with RGB-based appearance features including LOMO feature , ELF18 feature , color histograms (RGB, HS and YCbCr space) , HOG  and LBP , rotation invariant depth shape descriptors including RIFT2M  and Fehr’s covariance descriptor , and skeleton-based feature designed for depth-based re-id . All RGB-based appearance features were extracted from images which were resized to . RIFT2M and Fehr’s descriptor were densely extracted using the same voxels as Eigen-depth feature. We used LDA to learn the distance metric for all features, except that the skeleton-based feature was matched by Euclidean distance and our depth voxel covariance descriptor was matched by geodesic distance using Equation (8).
Vi-a Evaluation on PAVIS
We used two groups of dataset images in PAVIS dataset  for evaluation here. These two groups are denoted by “Walking1” and “Walking2”. Images of “Walking1” and “Walking2” were obtained by recording the same 79 people with a frontal view, walking slowly in an indoor scenario. Among the 79 people, 60 people in “Walking2” dressed different clothes from “Walking1”.
The characteristic of this experiment is that some people changed their clothes (by wearing one more red shirt) from “Walking1” to “Walking2” as shown in Figure 7. However, one could still explore some appearance cues (e.g., trousers and body shape) for matching persons across these two sets. Since the images of frontal bodies were captured from nearly the same view in these two sets, there was little rotation variation of point clouds. In this case, we can apply DVCov+SKL in our framework.
We used images in “Walking1” to form the gallery set and the images in “Walking2” to form the probe. By following the usual train-test policy for person re-identification, we randomly sampled half of the group “Walking1”, i.e., images of 40 persons for training, and the remaining 39 persons were used for testing. Images of these 39 testing persons in “Walking1” were randomly selected as gallery and all images of these 39 persons in “Walking2” were used as probe. In single-shot experiments, one image of each person was randomly selected as gallery. In multi-shot experiments, five images of a person were selected as gallery, and in such a case the distance between each probe image and each gallery class was the minimum distance between each probe image and each gallery image of that class. The performance of the tested methods was reported in Figure 8, Figure 9 (a) and Table III.
|RGB-based appearance features|
|DVCov (depth voxel covariance)||61.49||81.23||66.00||82.92|
|ED (Eigen-depth feature)||44.67||72.10||51.59||76.15|
The results suggest that both Eigen-depth feature (ED) and our depth voxel covariance descriptor (DVCov) are more effective than RIFT2M and Fehr’s descriptor for describing body shape. Since view angles of persons are nearly the same in “Walking1” and “Walking2”, our depth voxel covariance descriptor is more effective than Eigen-depth feature, because it contains richer information about textures than using only eigenvalues. However, Eigen-depth feature is still more effective than other methods except for our depth voxel covariance descriptor. Using skeleton-based feature alone cannot achieve high performance, but it is complementary information for our depth voxel covariance descriptor and Eigen-depth feature. The combination of our depth voxel covariance descriptor and skeleton-based feature achieves encouraging performance, where the rank-1 accuracy is 67.64% for single-shot recognition and 71.74% for multi-shot recognition. It is clear that the fusion outperformed RGB appearance-based methods and other tested depth-based methods. We note that not all RGB-based appearance features performed badly as we expected, because among the 79 people, 19 people did not change clothes and the other 60 people’s trousers did not change as well from the gallery set to the probe. In conclusion, this test showed the effectiveness of our depth voxel covariance descriptor for shape description.
Vi-B Evaluation on BIWI RGBD-ID
The BIWI RGBD-ID dataset  contains three groups of sequences “Training”, “Still” and “Walking” captured from 50 different people. For a sequence of each person, there are about 300 frames of depth images and skeletons. Before feature extraction, we converted depth images to point clouds. In “Training”, people performed motions, such as walking and rotating. Only 28 people presented in “Training” were recorded in “Still” and “Walking”, which were collected in a different day and in a different scene, so that most persons were dressed differently. In “Still”, people slightly moved, while in “Walking”, every person walked in different view angles. Examples of images in “Training”, “Still” and “Walking” are shown in Figure 10. Since pedestrians’ viewpoint variation was large here, it was more suitable to use ED+SKL in our framework.
For BIWI RGBD-ID dataset, images of the 22 people who only appeared in “Training” were used for training, and images of the remaining 28 people were used for testing. In the testing set, we used images in “Training” as gallery and images in “Still” and “Walking” as probe, so the same person wore different clothes in gallery and probe.
We selected the samples for evaluation by face detection as advised in
We selected the samples for evaluation by face detection as advised in. Since the persons were captured from different view angles, this dataset is suitable to evaluate the effect of the local rotation invariance property of the proposed Eigen-depth feature. The average results of CMC curve and rank- accuracy over 10 trials were reported in Figure 9 (b), (c) and Table IV.
|RGB-based appearance features|
As shown in Figure 9 (b), (c), RGB-based appearance features completely failed, because most people changed clothes so that color feature was not reliable. Our depth-based methods outperformed all RGB appearance-based methods. On BIWI RGBD-ID, people appeared in different view angles, so the problem became more challenging than the one on PAVIS. On “Walking”, the problem was even more difficult since more frames were captured in multiple viewpoints. In these situations, rotation invariant depth shape descriptor is more suitable, so that our Eigen-depth feature outperformed our depth voxel covariance descriptor. Compared with other rotation invariant depth shape descriptors, Eigen-depth feature outperformed RIFT2M and Fehr’s descriptor. The combination of Eigen-depth feature and skeleton-based feature (ED+SKL) can achieve better performance than using them separately, which is the best on BIWI RGBD-ID. The results showed the local rotation invariance and the effectiveness of body shape description of Eigen-depth feature.
Vi-C Evaluation on IAS-Lab RGBD-ID
There are 11 different people in IAS-Lab RGBD-ID dataset . In this dataset, three groups of sequences “Training”, “TestingA” and “TestingB” were recorded, and each person rotated on himself and walked during the recording. There are about 500 frames of depth images and skeletons for each person. The sequences in “Training” and “TestingA” were acquired when the same person was wearing different clothes. The sequences in “TestingB” were collected in a different room, where each person dressed the same as in “Training”. Some sequences in “TestingB” were recorded in dark environment. Examples of images in “Training”, “TestingA” and “TestingB” are shown in Figure 11.
On this dataset, the evaluation also followed the settings on PAVIS. Half of “Training” sequences were randomly selected to form the training set and the rest were selected to form the gallery in the test. The samples in “TestingA” and “TestingB” corresponding to the gallery persons were selected to form the probe set. By following the settings in , all images were used in this experiment. On this dataset, mismatch would be observed when performing the matching between a person image of rear view and his/her image of frontal view, so that it challenges body shape descriptors. The average rank-1 and rank-3 accuracies over 10 trials of evaluation were reported in Table V.
|RGB-based appearance features|
On “TestingA”, the RGB-based appearance features nearly failed, and Eigen-depth feature and skeleton-based feature outperformed them. On “TestingB”, HOG and LBP can still adapt to illumination change to some extent, while color histogram completely failed. Since rotation of samples took place in this dataset, the proposed Eigen-depth feature outperformed our depth voxel covariance descriptor and is more suitable for shape description in such a situation. Eigen-depth feature also outperformed the compared rotation invariant depth shape descriptors RIFT2M and Fehr’s descriptor. In most cases, combining Eigen-depth feature with skeleton-based feature worked better than using them separately. Since viewpoint of pose changed from 0 to 360 for each person in the training and testing sets, shape description from front to back for the same person changes notably and thus would cause confusion for matching. Skeleton-based feature is more effective in the cases when there are only 5 persons in testing set, because there are fewer persons of similar somatotype. So skeleton-based feature is better than Eigen-depth feature in this case. In general, the combination of Eigen-depth feature and skeleton-based feature is the most effective. The test showed the effectiveness of our depth-based method when people change clothes and appear in the extreme lighting condition.
Vi-D Comparison to Depth-based Re-id Frameworks
Existing well-known methods related to depth-based person re-identification include still-image-based recurrent attention model (3D RAM) , skeleton-based feature (SKL), Point Cloud Matching (PCM) and the combination of PCM and SKL (PCM+SKL) . 3D RAM, PCM and PCM+SKL are designed under a different setting from the usual one for person re-id; that is they require that the group of persons for training is the same as the one of persons for testing, while there is no overlap on persons between training and testing in the usual re-id setting. To compare our method with the above methods, we tested our Eigen-depth (ED) feature and the combination of Eigen-depth feature and skeleton-based feature (ED+SKL) on PAVIS and IAS-Lab RGBD-ID under the same setting as the compared methods when they were reported in [4, 3]. The experiment results were reported in Table VI. As shown, our method ED and ED+SKL clearly outperformed other existing depth-based frameworks, especially on PAVIS, a much larger dataset with more persons involved.
|Dataset||Probe||ED||ED+SKL||3D RAM||PCM||PCM+SKL ||SKL|
*The experiments here are under a different setting from the experiments in previous sections. See Sec. VI-D for details.
Vi-E Depth Feature Transfer Evaluation
The effectiveness of the kernelized implicit feature transfer scheme was evaluated on RGB datasets 3DPeS  and CAVIAR4REID . Before showing the experiment results, we first present implementation details of the feature transfer scheme.
Implementation Details. In this work, we selected the BIWI RGBD-ID dataset  as the auxiliary dataset. In “Training” of BIWI RGBD-ID, there were 50 persons performing actions of rotation and walking. For each of the 50 persons in “Training”, 8 RGB images from 8 different views ranged from 0 to 360 were selected as auxiliary RGB images. As for depth information, for each person, 8 point clouds from frontal view were selected for extracting depth features corresponding to those 8 RGB images. Some samples of auxiliary RGB-D dataset are shown in Figure 6.
After constructing the RGB-D auxiliary dataset, we extracted visual features and depth features to establish the connection between two modalities by the kernelized implicit feature transfer scheme. Since depth features describe body shape of pedestrians, the visual features for learning the transformation should also be able to describe body shape to some extent. We used HOG  and LBP  for describing body silhouette and textures. All RGB images in auxiliary dataset were resized to for extracting HOG and LBP features using cells. We also extracted the same visual features for samples in target domain. As for the point clouds, Eigen-depth feature was extracted to describe body shape.
With the extracted visual features and depth features, we conducted the proposed kernelized implicit feature transfer scheme. We chose the guassian kernel functions for visual feature and depth feature, which are and , respectively. Let and denote the means of the distances of visual features and depth features between any two samples in the auxiliary dataset, respectively. We set the bandwidth parameters as and . As for the parameters setting of the objective function, we empirically set the default parameters as , , , , which were normalized by traces. That is to say, the terms related to depth features were assigned much larger weights since we focused on learning the relation between depth feature and visual feature in order to take advantage of the discriminative information in depth features. As for the dimension of the common latent subspace, we set .
Score-level Feature Fusion. We estimated depth features on RGB images in order to augment the visual features with complementary information in depth features. Let denote the distance between RGB-based appearance features of RGB images between two samples and , and denote the distance between depth features computed according to Equation (25). We fused these two types of distances with a weight as follow:
In our experiments, each type of distance was normalized by its mean distance between any two samples in training set.
Experiment Settings. We evaluated how the transferred Eigen-depth feature (TED) can help to improve the performance when combined with LOMO  and ELF18 , which were two recently proposed effective RGB-based appearance features in person re-identification. To compute the similarity of RGB-based appearance features, we applied three favorable distance metric learning methods LFDA , MLAPG  and KISSME . So we had the following different settings, ELF18(LFDA)+TED, LOMO(LFDA)+TED, ELF18(MLAPG) +TED, LOMO(MLAPG)+TED, ELF18(KISSME)+TED, LOMO(KISSME)+TED. For these settings, the corresponding default distance fusion weight was set to 0.3, 0.2, 0.3, 0.15, 0.3, 0.2, respectively. It is reasonable that the distance fusion weight was set to different values when fusing different RGB-based distance metrics with the depth one. Experiments were conducted on 3DPeS and CAVIAR4REID. We followed the experiment settings on PAVIS. For each person in testing set, one image was randomly selected as gallery and the remaining images were used for probing.
As for baseline methods, CCA  and sparse regression  were compared. In details, we used CCA to maximize the correlation between RGB feature and depth feature on the auxiliary dataset. As for sparse regression, we made the sparse representation shared between RGB and depth feature dictionaries so as to derive a transferred depth feature. The depth features transferred by CCA and sparse regression are denoted by D-CCA and D-SPA, respectively. We combined the distance of the transferred depth feature with the distance of RGB-based appearance feature for recognition. The average rank-1 to rank-5 accuracies over 10 trials were reported in Table VII.
Results. The transferred Eigen-depth feature (TED) can achieve rank-1 accuracy 16.0% on 3DPeS and 27.8% on CAVIAR4REID. For all RGB-based appearance features and distance metrics in our experiments, TED is effective for improving the top-rank matching accuracies. The augmentation of TED can boost rank-1 accuracy of ELF18 using LFDA metric by 4.4% on both 3DPeS and CAVIAR4REID. Although LOMO is a state-of-the-art feature for person re-identification, the transferred depth feature makes consistent improvement especially at the rank-1 matching case. Compared to the baseline methods, the proposed implicit feature transfer scheme clearly outperformed CCA (D-CCA) and sparse regression (D-SPA) when applied for the same purpose. The results indicate that it may not be effective to use CCA and sparse regression to exploit transferred depth feature. Overall, it is evident that the transferred Eigen-depth feature (TED) is complementary to RGB color and texture features, so that it can augment the RGB feature representation and help to get better ranking results.