Person Re-ID framework in development during my PhD with the collaboration of Prof. Dr. William Robson Schwartz (UFMG)
Feature representation and metric learning are two critical components in person re-identification models. In this paper, we focus on the feature representation and claim that hand-crafted histogram features can be complementary to Convolutional Neural Network (CNN) features. We propose a novel feature extraction model called Feature Fusion Net (FFN) for pedestrian image representation. In FFN, back propagation makes CNN features constrained by the handcrafted features. Utilizing color histogram features (RGB, HSV, YCbCr, Lab and YIQ) and texture features (multi-scale and multi-orientation Gabor features), we get a new deep feature representation that is more discriminative and compact. Experiments on three challenging datasets (VIPeR, CUHK01, PRID450s) validates the effectiveness of our proposal.READ FULL TEXT VIEW PDF
In person re-identification (re-ID), the key task is feature representat...
We address the person re-identification problem by effectively exploitin...
Audio impairment recognition is based on finding noise in audio files an...
Detecting and recognizing objects interacting with humans lie in the cen...
Instance search is an interesting task as well as a challenging issue du...
Owing to the rapid growth of touchscreen mobile terminals and pen-based
This paper proposes a novel approach to person re-identification, a
Person Re-ID framework in development during my PhD with the collaboration of Prof. Dr. William Robson Schwartz (UFMG)
My first IEEE conference paper on Computer Vision.
Person re-identification aims at matching people from different views under surveillance cameras, which has been studied extensively in the past five years. To address the re-identification problem, existing methods exploit either cross-view invariant features [9, 7, 27, 19, 14, 33, 12, 20, 18] or cross-view robust metrics [4, 5, 12, 17, 33, 23, 3, 28, 34, 25].
. Deep Learning provides a powerful and adaptive approach to handle computer vision problems without excessive handcraft on image features. The back propagation algorithm dynamically adjusts the parameters in CNN, which unifies both feature extraction and pairwise comparison process in a single network.
However, in real-world person re-identification, a person’s appearance often undergoes large variations across non-overlapping camera views, due to significant changes in view angle, lighting, background clutter and occlusion (see Fig. 1). Hand-crafted concatenation of different appearance features, RGB, HSV colorspaces and LBP descriptor, which are designed to overcome cross-view appearance variations in re-identification tasks, sometimes would be more distinctive and reliable.
In order to effectively combine hand-crafted features and deeply learned features, we investigate the combination and complementary of a multi-colorspace hand-crafted features (ELF16) and deep features extracted from CNN. A deep feature fusion Network (FFN) is proposed in order to use hand-crafted features to regularize the CNN process so as to make the convolution neural networks extract features complementary to hand-crafted features. After extracting features by our FFN, traditional metric learning methods can be applied to boost the performance. Experimental results on three challenging person re-identification datasets (VIPeR, CUHK01, PRID450s) demonstrate the effectiveness of our new features. A significant improvement of Rank-1 matching rate is achieved as compared to state-of-the-art methods (8.09%, 7.98% and 11.2%) on the three datasets. In a word, we show that hand-crafted features could improve the extraction process of CNN features in FFN, achieving a more robust image representation.
Hand-crafted Features. Color and texture are two of the most useful characteristics in image representation. For example, HSV and LAB color histograms are used to measure the color information in the image. LBP histogram  and Gabor filter describe the textures of images. Recent papers use a combination of different features to produce more effective features [27, 9, 7, 9, 32, 33, 20].
Recently, features specifically designed for person re-identification significantly boost the matching rate. Local descriptors encoded by Fisher Vectors (LDFV) build descriptors on Fisher Vector. Color invariants (ColorInv)  use color distributions as the sole cue for good recognition performance. Symmetry-driven accumulation of local features (SDALF)  proves that symmetry structure of segments can improve the performance significantly, and an accumulative method of features provides robustness to image distortions. Local maximal occurrence features (LOMO)  analyzes the horizontal occurrence of local features and maximizes the occurrence to stably represent re-identification images.
Deep Learning. Convolutional Neural Network has been widely used in many computer vision problems, but only a few papers concern deep learning on person re-identification.
Li first proposed deep filter pairing nerual network (FPNN)  which used patch-matching layer and maxout pooling layer to handle pose and viewpoint variant. FPNN was also the first work to employ deep learning on person re-identification problems. Ahmed improved deep learning architecture by specifically designing cross-input neighbourhood difference layer . Later, the deep metric learning in  used “siamese” deep neural structure and a cosine layer to deal with big variations of person images. Hu proposed deep transfer metric learning (DTML) , which transfers cross-domain visual knowledge into target datasets.
These deep methods combine feature extraction and image-pair classification into a single CNN network. Pairwise comparison and symmetry structures are widely used among them, which could be inheritances of traditional metric learning methods [9, 7, 27, 19, 14, 33, 12, 20, 18, 34, 25]
. Since pairwise comparison is form to learn the deep neural network, it is demanded to form quite a lot of pairs for each probe image and perform deep convolution on these pairs. Compared to these works, our FFN is not based on pairwise input but directly extracts deep features on a single image, so that our deep architecture can be followed by any conventional classifiers, while existing deep learning works cannnot.
We use our modification of convolutional neural network (Feature Fusion Net, FFN) to learn new features. The network architecture is shown in Fig. 2
. Our Feature Fusion Network consists of two parts. The first part deals with traditional convolution, pooling and activation neurons for input images; the second part processes additional hand-crafted feature representations of the same image. These two sub-networks are finally linked together to produce a full-fledged image description, so the second part will regularize the first part during learning. Finally, our new feature (4096D vector) is extracted from the last Full Convolution Layer (Fusion Layer) of FFN.
The upper part of Fig. 2 describes a traditional process of convolution and pooling. Every convolution layer is followed by a pooling layer and a local response normalizaion (LRN) layer , except for the layer. Finally, the output of the pooling layer is a 4096D vector, which we regarded as CNN Features.
Most re-identification models regard CNN as a whole binary classifier with direct image input like DeepReID  and Ahmed’s Improved Deep Re-id Model . However, the work in  inspires us to come up with strong reason for taking the convolution layer as a feature extractor. One major characteristic of Re-identification images are whole-body images under different camera views. Most of the body parts could be found in all the camera views, but suffer from serious malposition, distortion and misalignment. The convolution in CNN allows part displacement and visual changes to be alleviated in higher-level convolution layers. Multiple convolution kernels provide different descriptions for pedestrian images. In addition, pooling and LRN layers provide nonlinear expression of corresponding description, which significantly reduces the overfitting problem. These layers contribute to a stable Convolution Neural Network that could be applied to new datasets (See Section 4 for detailed training process).
The lower part of Fig.2 extracts conventional hand-crafted features widely used in person re-identification. In this work, we employ the Ensemble of Local Features (ELF)  and is improved in [32, 33]. It extracts RGB, HSV and YCbCr histograms of 6 horizontal stripes of input image. Also, 8 Garbor filters and 13 Schmid filters are applied to get corresponding texture information.
We modify ELF feature by improving the color space and stripe division . Input image is equally partitioned into 16 horizontal stripes, and our featurs are composed of color features including RGB, HSV, LAB, XYZ, YCbCr and NTSC and texture features including Gabor, Schmid and LBP. A 16D histogram is extracted for each channel and then normalized by -norm. All histograms are concatenated together to form a single vector. In this work, we denote the above type of hand-crafted features as ELF16.
We aim to jointly map CNN features and hand-crafted featues to a unitary feature space. A feature fusion deep neural network is proposed in order to use hand-crafted features to regularize CNN features so as to make CNN extract complementary features. In our framework, by using back propagation, the parameters of the whole CNN network could be affected by hand-craft features. In general, as a results of the fusion, the regularized CNN features output by our proposal network should be more discriminative than both CNN features and the employed hand-crafted features.
Fusion Layer and Buffer Layer. Our Fusion Layer uses full connection to provide self-adaptation on person re-identification problems. Both ELF16 Features and CNN Features are followed by a 4096D-output full connection layer (Buffer Layer), which provides buffer for the fusion action. Buffer Layer is essential in our architecture, since it bridges the gap between two features with huge difference, and guarantees the convergence of FFN.
If the input of Fusion Layer is
then the output of this layer is computed by:
where. According to back propagation algorithm, parameters of layer after a new iteration are written as:
where parameters , and are set under the guidance of .
as loss function. But we aim at extracting deep features on every image effectively rather than performing pairwise comparison through a deep neural network. Therefore, softmax loss function is applied in our model, and intuitively speaking a more discriminative feature representation should result in lower softmax loss as well. For a single input vectorand a single output node in the last layer, the loss could be calculated by:
The last layer of our network is designed to minimize the cross-entropy loss:
in which the number of output node varies on different training sets as described in Section 4.
If the parameters of the network are influenced by the ELF16 features , , the gradient of the network parameters are adjusted according to , then ELF16 features in the lower part of FFN could make CNN features more complementary with it, since the final objective of FFN is to make our features more discriminative in different images.
Denote CNN features (in FC7 layer) as and ELF16 features as , Denote the weight connecting the node in layer and the node in layer as . Let where . Denote
Note that . We show that by using back propagation, is influenced by . In this way, CNN Features learn its parameters which will form features complementary to the ELF16 features . Note that
is influenced by in two ways. Firstly,
In other words, the information in ELF16 features could propagate through , and thus the convolution filters of Deep Feature Extraction part would adapt itself according to . Secondly, the output of softmax loss layer is influenced by during the forward propagation process, and thus is also influenced by .
Market-1501 is a multi-shot person re-identification dataset recently reported by . It consists of 38195 images from 1501 identities, which is the largest public person re-identification dataset available. We trained our Feature Fusion Network on Market-1501, and used it to extract features in Section 5.
Our training strategy applied mini-batch stochastic gradient descent (SGD) for faster back propagation and smoother convergence. In each iteration of training phase, 25 images form a mini-batch and were forwarded to softmax loss Layer. The initial learning rate , which is significantly smaller than most of other CNN models. Every 20000 iterations the learning rate decreased by
. We finetuned our network based on ImageNet model provided by . Our FFN model took 50000 iterations to converge (about 4 hours on a Tesla K20m GPU). In order to improve the adaptation of our model, we further use difficult samples to finetune the network.
Hard negative mining  gives us a logical way to emphasize difficult samples in CNN. This training strategy is originally designed to balance the positive and negative samples in pairwise comparison for person re-identification. We applied this strategy to our Feature Fusion Network as well. About 12000 images of 630 IDs were wrongly-labeled by the previous network, and were manually picked out for further finetuning. We replaced the last softmax loss layer with less output nodes and continued to finetune our model on these difficult samples, with lower learning rate () and fewer iterations (about 10000). The whole training process took about 5-6 hours to converge to a tolerable training loss (about 0.05 typically).
This section evaluated our new features in different perspectives. We presented extensive experimental results on three benchmark datasets in order to clearly demonstrate the effectiveness of our features.
Our test was based on three publicly available datasets: VIPeR , CUHK01  and PRID450s . Each of our datasets was presented in two disjoint camera views, with significant misalignment, light change and body part distortion. Table 1 briefly introduces these three datasets. Also, some sample images of these datasets are shown in Fig. 1.
In each individual experiment, we randomly selected half of the identities as training set, and the other half as testing set. Training set was used to train projection matrix (in metric learning methods). Testing set used to get the final projection of and measures the distance between a pair of input images. For the reliability and stableness of our results, each experiment was repeated 10 times and the average Rank-i accuracy rate was computed. Cumulative Matching Curve (CMC) was also provided in Fig. 3, providing a more intuitional comparison between different algorithms.
We applied single-shot protocol in our experiment, that is during testing phase, one image was chosen from View2 as probe and all the images in View1 were regarded as the gallery. For CUHK01 specifically, which has 2 images of the same person in one camera view, we randomly chose one image of each identity as the gallery.
Six feature extraction approaches were evaluted in our experiments for comparison, including LDFV , gBiCov , ImageNet  CNN features, LOMO features , ELF16 features and our proposed features111Our proposed features are available at http://isee.sysu.edu.cn/resource. For ImageNet CNN features alone, FC7 layer data, which produced the highest accuracy rate in our tests, was chosen in our experiments. Local Maximal Occurrence Representation (LOMO) is another high-performance feature representation specifically designed for re-identification problem. LDFV features were evaluated only on VIPeR dataset due to its copyrights of code. All images were resized to 224224 for our feature extraction.
In order to demonstrate the effectiveness of our new feature, two compound features (ELF16+CNN-FC7 and Ours+LOMO) were also added for the comparison. ELF16+CNN-FC7 denotes the concatenation of normalized CNN-FC7 feature to ELF16 feature. Ours+LOMO denotes the concatenation of our new features and normalized LOMO features.
All of these features were extracted and evaluated in its default dimension (see Table 5).
|No. of images||1264||3884||900|
|No. of identities||632||971||450|
|No. of images in training set||316||485||225|
|No. of camera views||2||2||2|
|No. of images per view per ID||1||2||1|
Fig. 3 (a)-(c) shows the performance of our features compared to other features on , evaluating an algorithm’s capability in an original and unsupervised perspective. Our features significantly outperformed other stand-alone features (see Fig.3 (a)-(c)), suggesting that raw information provided by our feature is more accurate for representing re-identification images in most cases.
ELF16+CNN-FC7 features performed the second-best and outperformed both ELF16 and CNN-FC7, which provides supports on our assumption that traditional feature and CNN features are complementary. Also, our new features significantly outperformed ELF16+CNN-FC7, which may be bause of the following two reasons:
CNN features in our network were trained to be complementary to the traditional features, while in ELF16+CNN-FC7, the CNN features are simply cascaded with ELF16 features, which may not be optimal.
The use of Buffer Layer and Fusion layer could automatically tune the weights for each feature, and makes the fused feature perform much better.
LOMO features were specifically designed to describe person re-identification images. However, it ranked the seventh on VIPeR and the third on CUHK01, which is not stable enough for .
To demonstrate the maximal effectiveness of our image description, we put it into two metric learning methods: LFDA  and Mirror KMFA , along with other widely-used features. We used each of the features to learn distance metric between each probe image and gallery set.In this experiments, we evaluated their capability on supervised metric learning methods.
Fig. 3 (d)-(i) shows the CMC curves on three datasets, with Rank-1 identification rate labeled on each feature type. Note that LDFV performed badly using chi-square kernel, so we adopted Mirror MFA without kernel trick in the comparison.
The results clearly show the outstanding performance of our proposed features, as it exceeded all the stand-alone features in VIPeR and CUHK01. Compared to ELF16 and CNN-FC7 features alone, our new features yielded much better results. Also, the simple concatenation of these two features (ELF+CNN-FC7) could not represent the image as good as ours, and it indicates the necessity of Fusion Layer in the proposed FFN.
|Deep Feature Learning||40.50||60.80||70.40||84.40|
|Mirror KMFA() ||42.97||75.82||87.28||94.84|
|Mirror KMFA() ||40.40||64.63||75.34||84.08|
|Ahmed’s Deep Re-id ||47.53||72.10||80.53||88.49|
|Mirror KMFA() ||55.42||79.29||87.82||93.87|
|Ahmed’s Deep Re-id ||34.81||63.72||76.24||81.90|
Our proosed features are always better than LOMO features. Since LOMO emphasizes on HSV and SILTP histograms, it performed better on PRID450s, which was undergoing specific lighting conditions. But on other datasets, our new features are still better than LOMO features.
The concatenation of these two features (Ours+LOMO) has a strong discriminative ability and outperformed all other features on Mirror KMFA. This indicates that our deep learning methods is complementary to LOMO features. Thus, we regard this 31056D mix features as the final image representation in Mirror KMFA person re-identification model.
This experiment compared overall performance between state-of-the-art person re-identification model and ours. Our model is based on Mirror KMFA, using the concatenation of our new features and normalized LOMO features (Ours+LOMO).
Table 2-4 summarize some of the highest performance models on VIPeR, CUHK01 and PRID450s, including LOMO+XQDA , Mirror KMFA , Ahmed’s Improved Deep ReID  and Mid-level Filter . Our model can beat them by about 10% in Rank-1 matching rate.
Three Deep Learning methods (DeepReID , Ahmed’s Deep Re-id , Ding’s Deep Feature Learning )are specifically listed in Table 3 and 4. All of them modified CNN for pairwise comparison, and employed unique layers to match two views of the input images.
In comparison, our model regards CNN as a feature extractor, while adopting metric learning to calculate relative distance of different images. This not only contributes to the improvement in accuracy, but also enables us to use larger datasets in CNN training process. Our model also clearly exceeded their performance on CUHK01 (7.98%) and PRID450s (11.2%).
We evaluate the running time of these feature-extraction algorithms, as shown in Fig. 5. The reporeted time is the average feature extraction time for a single image on VIPeR dataset (in its default dimension). Note that we have included the time of extracting ELF16 features in the last row. bottou2012stochastic It can be seen that our Fusion Feature Network is even faster than some of the hand-crafted methods (such as gBiCov), which breaks the stereotype of huge and clumsy Convolutional Neural Network. Also, most of the time was spent on the extraction of ELF16 features. Compared to LOMO features, our features have much lower dimension, and will perform faster in the metric learning step followed. With a balance between the speed and dimensional complexity, our Feature Fusion Network can be easily applied to actual use. Besides, compared to other CNN-based models, our FFN does not need to finetune on target datasets, which makes it faster to apply.
|Feature||Extraction Time||Default Dimension|
|Ours (with ELF16)||0.1769s+0.5720s||4096|
In this paper, we have presented a novel and effective way of feature extraction for person re-identification called Feature Fusion Network (FFN). This model jointly utilizes both CNN feature and hand-crafted features, including RGB, HSV, YCbCr, Lab, YIQ color feature and Gabor texture feature. It could adjust the weights of these information automatically with the back propagation process of Neural Network. Also, we have proved that FFN regularizes the CNN process so as to make CNN focus on extracting complementary features. Experiments on three challenging person re-identification datasets (VIPeR, CUHK01, PRID450s) show the effectiveness of our learned deep features. By using Mirror Kernel Marginal Fisher Analysis (KMFA), our proposed features significantly outperform the state-of-the-art person re-identification models on these three datasets by 8.09%, 7.98%, and 11.2% (in Rank-1 accuracy rate), respectively.
This research was partly supported by Guangdong Provincial Government of China through the Computational Science Innovative Research Team Program, and partially by Natural Science Foundation of China (Nos. 61472456, 61522115, 61573387), Guangzhou Pearl River Science and Technology Rising Star Project under Grant 2013J2200068, the Guangdong Natural Science Funds for Distinguished Young Scholar under Grant S2013050014265, and the GuangDong Program (No. 2015B010105005).