Code for https://arxiv.org/abs/1703.05693
This paper proposes the SVDNet for retrieval problems, with focus on the application of person re-identification (re-ID). We view each weight vector within a fully connected (FC) layer in a convolutional neuron network (CNN) as a projection basis. It is observed that the weight vectors are usually highly correlated. This problem leads to correlations among entries of the FC descriptor, and compromises the retrieval performance based on the Euclidean distance. To address the problem, this paper proposes to optimize the deep representation learning process with Singular Vector Decomposition (SVD). Specifically, with the restraint and relaxation iteration (RRI) training scheme, we are able to iteratively integrate the orthogonality constraint in CNN training, yielding the so-called SVDNet. We conduct experiments on the Market-1501, CUHK03, and Duke datasets, and show that RRI effectively reduces the correlation among the projection vectors, produces more discriminative FC descriptors, and significantly improves the re-ID accuracy. On the Market-1501 dataset, for instance, rank-1 accuracy is improved from 55.3 CaffeNet, and from 73.8READ FULL TEXT VIEW PDF
Code for https://arxiv.org/abs/1703.05693
This paper considers the problem of pedestrian retrieval, also called person re-identification (re-ID). This task aims at retrieving images containing the same person to the query.
Person re-ID is different from image classification in that the training and testing sets contain entirely different classes. So a popular deep learning method for re-ID consists of 1) training a classification deep model on the training set, 2) extracting image descriptors using the fully-connected (FC) layer for the query and gallery images, and 3) computing similarities based on Euclidean distance before returning the sorted list[33, 31, 26, 10].
Our work is motivated by the observation that after training a convolutional neural network (CNN) for classification, the weight vectors within a fully-connected layer (FC) are usually highly correlated. This problem can be attributed to two major reasons. The first reason is related to the non-uniform distribution of training samples. This problem is especially obvious when focusing on the last FC layer. The output of each neuron in the last FC layer represents the similarity between the input image and a corresponding identity. After training, neurons corresponding to similar persons (, the persons who wear red and pink clothes) learns highly correlated weight vectors, as shown in Fig.1. The second is that during the training of CNN, there exists few, if any, constraints for learning orthogonalization. Thus the learned weight vectors may be naturally correlated.
Correlation among weight vectors of the FC layer compromises the descriptor significantly when we consider the retrieval task under the Euclidean distance. In fact, a critical assumption of using Euclidean distance (or equivalently the cosine distance after -normalization) for retrieval is that the entries in the feature vector should be possibly independent. However, when the weight vectors are correlated, the FC descriptor – the projection on these weight vectors of the output of a previous CNN layer – will have correlated entries. This might finally lead to some entries of the descriptor dominating the Euclidean distance, and cause poor ranking results. For example, during testing, the images of two different persons are passed through the network to generate the green and black dotted feature vectors and then projected onto the red, pink and blue weight vectors to form the descriptors, as shown in Fig. 1. The projection values on both red and pink vectors are close, making the two descriptors appear similar despite of the difference projected on the blue vector. As a consequence, it is of vital importance to reduce the redundancy in the FC descriptor to make it work under the Euclidean distance.
To address the correlation problem, we proposes SVDNet, which is featured by an FC layer containing decorrelated weight vectors. We also introduce a novel three-step training scheme. In the first step, the weight matrix undergoes the singular vector decomposition (SVD) and is replaced by the product of the left unitary matrix and the singular value matrix. Second, we keep the orthogonalized weight matrix fixed and only fine-tune the remaining layers. Third, the weight matrix is unfixed and the network is trained for overall optimization. The three steps are iterated to approximate orthogonality on the weight matrix. Experimental results on three large-scale re-ID datasets demonstrate significant improvement over the baseline network, and our results are on par with the state of the art.
Deep learning for person re-ID.
In person re-ID task, deep learning methods can be classified into two classes: similarity learning and representation learning. The former is also called deep metric learning, in which image pairs or triplets are used as input to the network[25, 24, 1, 13, 5, 19]. In the two early works, Yi  and Li  use image pairs and inject part priors into the learning process. In later works, Varior 
incorporate long short-term memory (LSTM) modules into a siamese network. LSTMs process image parts sequentially so that the spatial connections can be memorized to enhance the discriminative ability of the deep features. Varior insert a gating function after each convolutional layer to capture effective subtle patterns between image pairs. The above-mentioned methods are effective in learning image similarities in an adaptive manner, but may have efficiency problems under large-scale galleries.
The second type of CNN-based re-ID methods focuses on feature learning, which categorizes the training samples into pre-defined classes and the FC descriptor is used for retrieval [33, 21, 26]. In [33, 34], the classification CNN model is fine-tuned using either the video frames or image bounding boxes to learn a discriminative embedding for pedestrian retrieval. Xiao  propose learning generic feature representations from multiple re-ID datasets jointly. To deal with spatial misalignment, Zheng  propose the PoseBox structure similar to the pictorial structure  to learn pose invariant embeddings. To take advantage of both the feature learning and similarity learning, Zheng  and Geng  combine the contrastive loss and the identification loss to improve the discriminative ability of the learned feature embedding, following the success in face verification . This paper adopts the classification mode, which is shown to produce competitive accuracy without losing efficiency potentials.
PCANet and truncated SVD for CNN. We clarify the difference between SVDNet and several “look-alike” works. The PCANet 
is proposed for image classification. It is featured by cascaded principal component analysis (PCA) filters. PCANet is related to SVDNet in that it also learns orthogonal projection directions to produce the filters. The proposed SVDNet differs from PCANet in two major aspects. First, SVDNet performs SVD on the weight matrix of CNN, while PCANet performs PCA on the raw data and feature. Second, the filters in PCANet are learned in an unsupervised manner, which does not rely on back propagation as in the case of SVDNet. In fact, SVDNet manages a stronger connection between CNN and SVD. SVDNet’s parameters are learned through back propagation and decorrelated iteratively using SVD.
is widely used for CNN model compression. SVDNet departs from it in two aspects. First, truncated SVD decomposes the weight matrix in FC layers and reconstructs it with several dominant singular vectors and values. SVDNet does not reconstruct the weight matrix but replaces it with an orthogonal matrix, which is the product of the left unitary matrix and the singular value matrix. Second, Truncated SVD reduces the model size and testing time at the cost of acceptable precision loss, while SVDNet significantly improves the retrieval accuracy without impact on the model size.
Orthogonality in the weight matrix. We note a concurrent work  which also aims to orthogonalize the CNN filters, yet our work is different from . In , the regularization effect of orthogonalization benefits the back-propagation of very deep networks, thus improving the classification accuracy. The regularization proposed in  may not directly benefit the embedding learning process. But in this paper, orthogonalization is used to generate decorrelated descriptors suitable for retrieval. Our network may not be suitable for improving classification.
This section describes the structure of SVDNet, its training strategy, and its working mechanism.
SVDNet mostly follows the backbone networks, , CaffeNet and ResNet-50. The only difference is that SVDNet uses the Eigenlayer as the second last FC layer, as shown in Fig. 2
, the Eigenlayer contains an orthogonal weight matrix and is a linear layer without bias. The reason for not using bias is that the bias will disrupt the learned orthogonality. In fact, our preliminary experiments indicate that adding the ReLU activation and the bias term slightly compromises the re-ID performance, so we choose to implement the Eigenlayer based on a linear layer. The reason for positioning Eigenlayer at the second last FC layer, rather than the last one is that the model fails to converge when orthogonality is enforced on the last FC layer, which might be due to that the correlation of weight vectors in the last FC layer is determined by the training sample distribution, as explained in the introduction.
During training, the input feature from a previous layer is passed through the Eigenlayer. Its inner products with the weight vectors of the Eigenlayer form the output feature, which is fully connected to the last layer of -dim, where denotes the number of training classes.
During testing, we extract the learned embeddings for the query and gallery images. In this step, we can use either the input or the output of Eigenlayer for feature representation, as shown in Fig. 2. Our experiment shows that using the two features can achieve similar performance, indicating that the orthogonality of Eigenlayer improves the performance of not only output but also input. The reason is a bit implicit, and we believe it originates from the back-propagation training of CNN, during which the orthogonal characteristic of weight matrix within the Eigenlayer will directly impact the characteristic of its input feature.
The algorithm of training SVDNet is presented in Alg. 1. We first briefly introduce Step 0 and then describe the restraint and relaxation Iteration (RRI) (Step 1, 2, 3).
Step 0. We first add a linear layer to the network. Then the network is fine-tuned till convergence. Note that after Step 0, the weight vectors in the linear layer are still highly correlated. In the experiment, we will present the re-ID performance of the CNN model after Step 0. Various output dimensions of the linear layer will be evaluated.
Restraint and Relaxation Iteration (RRI). It is the key procedure in training SVDNet. Three steps are involved.
Decorrelation. We perform SVD on the weight matrix as follows:
where is the weight matrix of the linear layer, is the left-unitary matrix, is the singular value matrix, and is the right-unitary matrix. After the decomposition, we replace with
. Then the linear layer uses all the eigenvectors ofas weight vectors and is named as Eigenlayer.
Restraint. The backbone model is fine-tuned till convergence, but the Eigenlayer is fixed.
Relaxation. The fine-tuning goes on for some more epochs with Eigenlayerunfixed.
After Step 1 and Step 2, the weight vectors are orthogonal, , in an eigen state. But after Step 3, , relaxation training, shifts away from the eigen state. So the training procedure enters another iteration of “restraint and relaxation”.
Why is SVD employed? Our key idea is to find a set of orthogonal projection directions based on what CNN has already learned from training set. Basically, for a linear layer, a set of basis in the range space of (, linear subspace spanned by column vectors of ) is a potential solution. In fact, there exists numerous sets of orthogonal basis. So we decide to use the singular vectors of as new projection directions and to weight the projection results with the corresponding singular values. That is, we replace with . By doing this, the discriminative ability of feature representation over the whole sample space will be maintained. We make a mathematical proof as follows:
Given two images and , we denote and as the corresponding features before the Eigenlayer, respectively. and are their output features from the Eigenlayer. The Euclidean distance between the features of and is calculated by:
There are some other decorrelation methods in addition to SVD. But these methods do not preserve the discriminative ability of the CNN model. To illustrate this point, we compare SVD with several competitors below.
Use the originally learned (denoted by ).
Replace with (denoted by ).
Replace with (denoted by ).
Replace with (denoted by ).
Replace (Q-R decomposition) with , where is the diagonal matrix extracted from the upper triangle matrix (denoted by ).
Comparisons on Market-1501  are provided in Table 1. We replace the FC layer with a 1,024-dim linear layer and fine-tune the model till convergence (Step 0 in Alg. 1). We then replace the fine-tuned with methods 2 - 5. All the four decorrelation methods 2 - 5 update to be an orthogonal matrix, but Table 1 indicates that only replacing with retains the re-ID accuracy, while the others degrade the performance.
When does performance improvement happen? As proven above, Step 1 in Alg. 1, , replacing with , does not bring an immediate accuracy improvement, but keeps it unchanged. Nevertheless, after this operation, the model has been pulled away from the original fine-tuned solution, and the classification loss on the training set will increase by a certain extent. Therefore, Step 2 and Step 3 in Alg. 1 aim to fix this problem. The major effect of these two steps is to improve the discriminative ability of the input feature as well as the output feature of the Eigenlayer (Fig. 2). On the one hand, the restraint step learns the upstream and downstream layers of the Eigenlayer, which still preserves the orthogonal property. We show in Fig. 7 that this step improves the accuracy. On the other hand, the relaxation step will make the model deviate from orthogonality again, but it reaches closer to convergence. This step, as shown in Fig. 7, deteriorates the performance. But within an RRI, the overall performance improves. Interestingly, when educating children, an alternating rhythm of relaxation and restraint is also encouraged.
Till now, we have not provided a metric how to evaluate vector correlations. In fact, the correlation between two vectors can be estimated by the correlation coefficient. However, to the best of our knowledge, it lacks an evaluation protocol for diagnosing theoverall correlation of a vector set. In this paper, we propose to evaluate the overall correlation as below. Given a weight matrix , we define the gram matrix of as,
where is the number of weight vectors in ( = 4,096 in FC7 of CaffeNet), are the entries in , and are the weight vectors in . Given , we define as a metric to denote the extent of correlation between all the column vectors of :
From Eq. 5, we can see that the value of falls within . achieves the largest value only when is an orthogonal matrix, , . has the smallest value when all the weight vectors are totally the same, , . So when is close to or is very small, the weight matrix has a high correlation extent. For example, in our baseline, when directly fine-tuning a CNN model (without SVDNet training) using CaffeNet, , indicating that the weight vectors in the FC7 layer are highly correlated. As we will show in Section 4.5, is an effective indicator to the convergence of SVDNet training.
Convergence Criteria for RRI. When to stop RRI is a non-trivial problem, especially in application. We employ Eq. 5 to evaluate the orthogonality of after the relaxation step and find that increases as the iteration goes on. It indicates that the correlation among the weight vectors in is reduced step-by-step with RRI. So when becomes stable, the model converges, and RRI stops. Detailed observations can be accessed in Fig. 7.
Datasets. This paper uses three datasets for evaluation, , Market-1501 , CUHK03  and DukeMTMC-reID [18, 37]. The Market-1501 dataset contains 1,501 identities, 19,732 gallery images and 12,936 training images captured by 6 cameras. All the bounding boxes are generated by the DPM detector . Most experiments relevant to mechanism study are carried out on Market-1501. The CUHK03 dataset contains 13,164 images of 1,467 identities. Each identity is observed by 2 cameras. CUHK03 offers both hand-labeled and DPM-detected bounding boxes, and we use the latter in this paper. For CUHK03, 20 random train/test splits are performed, and the averaged results are reported. The DukeMTMC-reID dataset is collected with 8 cameras and used for cross-camera tracking. We adopt its re-ID version benchmarked in . It contains 1,404 identities (one half for training, and the other for testing), 16,522 training images, 2,228 queries, and 17,661 gallery images. For Market-1501 and DukeMTMC-reID, we use the evaluation packages provided by  and , respectively.
|Models & Features||dim||Market-1501||CUHK03||DukeMTMC-reID|
For performance evaluation on all the 3 datasets, we use both the Cumulative Matching Characteristics (CMC) curve and the mean Average Precision (mAP).
Backbones. We mainly use two networks pre-trained on ImageNet  as backbones, , CaffeNet  and ResNet-50 . When using CaffeNet as the backbone, we directly replace the original FC7 layer with the Eigenlayer, in case that one might argue that the performance gain is brought by deeper architecture. When using ResNet-50 as the backbone, we have to insert the Eigenlayer before the last FC layer because ResNet has no hidden FC layer and the influence of adding a layer into a 50-layer architecture can be neglected. In several experiments on Market-1501, we additionally use VGGNet  and a Tiny CaffeNet as backbones to demonstrate the effectiveness of SVDNet on different architectures. The Tiny CaffeNet is generated by reducing the FC6 and FC7 layers of CaffeNet to containing 1024 and 512 dimensions, respectively.
Baseline. Following the practice in , baselines using CaffeNet and ResNet-50 are fine-tuned with the default parameter settings except that the output dimension of the last FC layer is set to the number of training identities. The CaffeNet Baseline is trained for 60 epochs with a learning rate of 0.001 and then for another 20 epochs with a learning rate of 0.0001. The ResNet Baseline is trained for 60 epochs with learning rate initialized at 0.001 and reduced by 10 on 25 and 50 epochs. During testing, the FC6 or FC7 descriptor of CaffeNet and the Pool5 or FC descriptor of ResNet-50 are used for feature representation.
On Market-1501, CaffeNet and Resnet-50 achieves rank-1 accuracy of 55.3% (73.8%) with the FC6 (Pool5) descriptor, which is consistent with the results in .
Detailed settings. CaffeNet-backboned SVDNet takes 25 RRIs to reach final convergence. For both the restraint stage and the relaxation stage within each RRI except the last one, we use 2000 iterations and fix the learning rate at 0.001. For the last restraint training, we use 5000 iterations (learning rate 0.001) + 3000 iterations (learning rate 0.0001). The batch size is set to 64. ResNet-backboned SVDNet takes 7 RRIs to reach final convergence. For both the restraint stage and the relaxation stage within each RRI, we use 8000 iterations and divide the learning rate by 10 after 5000 iterations. The initial learning rate for the 1st to the 3rd RRI is set to 0.001, and the initial learning rate for the rest RRIs is set to 0.0001. The batch size is set to 32.
The effectiveness of SVDNet. We comprehensively evaluate the proposed SVDNet on all the three re-ID benchmarks. The overall results are shown in Table 2.
|ReRank (C) ||61.3||46.8||58.5||64.7|
|ReRank (R) ||77.1||red63.6||64.0||69.3|
|PIE (A)* ||65.7||41.1||62.6||67.9|
|PIE (R)* ||79.3||56.0||67.1||71.3|
|SOMAnet (VGG)* ||73.9||47.9||72.4||-|
|DLCE (C)* ||62.1||39.6||59.8||65.8|
|DLCE (R)* ||79.5||59.9||red83.4||blue86.4|
|Transfer (G)* ||blue83.7||blue65.5||blue84.1||-|
|GAN (R) ||67.7||47.1||-||-|
|PAN (R) ||71.6||51.5||36.3||34.0|
The improvements achieved on both backbones are significant: When using CaffeNet as the backbone, the Rank-1 accuracy on Market-1501 rises from 55.3% to 80.5%, and the mAP rises from 30.4% to 55.9%. On CUHK03 (DukeMTMC-reID) dataset, the Rank-1 accuracy rises by +26.3% (+20.7%), and the mAP rises by +24.7% (+17.5%). When using ResNet as the backbone, the Rank-1 accuracy rises by +8.4%, +15.6% and +11.2% respectively on Market-1501, CUHK03 and DukeMTMC-reID dataset. The mAP rises by +14.2%, +13.7% and +12.7% correspondingly. Some retrieval examples on Market-1501 are shown in Fig. 3.
Comparison with state of the art. We compare SVDNet with the state-of-the-art methods. Comparisons on Market-1501 and CUHK03 are shown in Table 3. Comparing with already published papers, SVDNet achieves competitive performance. We report rank-1 = 82.3%, mAP = 62.1% on Market-1501, and rank-1 = 81.8%, mAP = 84.8% on CUHK03. The re-ranking method  is higher than ours in mAP on Market-1501, because re-ranking exploits the relationship among the gallery images and results in a high recall. We speculate that this re-ranking method will also bring improvement for SVDNet. Comparing with the unpublished Arxiv papers, (some of) our numbers are slightly lower than  and . Both works  and  combine the verification and classification losses, and we will investigate into integrating this strategy into SVDNet.
Moreover, the performance of SVDNet based on relatively simple CNN architecture is impressive. On Market-1501, CaffeNet-backboned SVDNet achieves 80.5% rank-1 accuracy and 55.9% mAP, exceeding other CaffeNet-based methods by a large margin. Additionally, using VGGNet and Tiny CaffeNet as backbone achieves 79.7% and 77.4% rank-1 accuracy respectively. On CUHK03, CaffeNet-backboned SVDNet even exceeds some ResNet-based competing methods except DLCE(R). This observation suggests that our method can achieve acceptable performance with high computing effectiveness.
In Table 4, comparisons on DukeMTMC-reID and CUHK03 under a new training/testing protocol (denoted as CUHK03-NP) raised by  are summarized. Relatively fewer results are reported because both DukeMTMC-reID and CUHK03-NP have only been recently benchmarked. On DukeMTMC-reID, this paper reports rank-1 = 76.7%, mAP = 56.8%, which is higher than the several competing methods including a recent GAN approach . On CUHK03-NP, this paper reports rank-1 = 41.5%, mAP = 37.3%, which is also the highest among all the methods.
We vary the dimension of the output of Eigenlayer. Results of CaffeNet and ResNet-50 are drawn in Fig. 6.
When trained without RRI, the model has no intrinsic difference with a baseline model. It can be observed that the output dimension of the penultimate layer significantly influences the performance. As the output dimension increases, the re-ID performance first increases, reaches a peak and then drops quickly. In this scenario, we find that lowering the dimension is usually beneficial, probably due to the reduced redundancy in filters of FC layer.
The influence of the output dimension on the final performance of SVDNet presents another trend. As the output dimension increases, the performance gradually increases until reaching a stable level, which suggests that our method is immune to harmful redundancy.
This experiment reveals how the re-ID performance changes after each restraint step and each relaxation step, and how SVDNet reaches the stable performance step by step. In our experiment, we use 25 epochs for both the restraint phase and the relaxation phase in one RRI. The output dimension of Eigenlayer is set to 2,048. Exhaustively, we test re-ID performance and values of all the intermediate CNN models. We also increase the training epochs of baseline models to be equivalent of training SVDNet, to compare of models trained with and without RRI. Results are shown in Fig. 7, from which four conclusions can be drawn.
First, within each RRI, rank-1 accuracy takes on a pattern of “increase and decrease” echoing the restraint and relaxation steps: When is fixed to maintain orthogonality during restraint training, the performance increases, implying a boosting in the discriminative ability of the learned feature. Then during relaxation training, is unfixed, and the performance stagnates or even decreases slightly. Second, as the RRI goes, the overall accuracy increases, and reaches a stable level when the model converges. Third, it is reliable to use – the degree of orthogonality – as the convergence criteria for RRI. During RRI training, gradually increases until reaching stability, while without RRI training, fluctuates slightly around a relatively low value, indicating high correlation among weight vectors. Fourth, ResNet-backboned SVDNet needs much fewer RRIs to converge than CaffeNet-backboned SVDNet.
In Section 3.3, several decorrelation methods are introduced. We show that only the proposed method of replacing with maintains the discriminative ability of the output feature of Eigenlayer, while all the other three methods lead to performance degradation to some extent. Here, we report their final performance when RRI training is used.
Results on Market-1501 are shown in Table 5. It can be observed that the proposed decorrelating method, , replacing with , achieves the highest performance, followed by the “”, “” and “” methods. In fact, the “” method does not bring about observable improvement compared with “”. This experiment demonstrates that not only the orthogonality itself, but also the decorrelation approach, are vital for SVDNet.
In this paper, SVDNet is proposed for representation learning in pedestrian retrieval, or re-identification. Decorrelation is enforced among the projection vectors in the weight matrix of the FC layer. Through iterations of “restraint and relaxation”, the extent of vector correlation is gradually reduced. In this process, the re-ID performance undergoes iterative “increase and decrease”, and finally reaches a stable accuracy. Due to elimination of correlation of the weight vectors, the learned embedding better suits the retrieval task under the Euclidean distance. Significant performance improvement is achieved on the Market-1501, CUHK03, and DukeMTMC-reID datasets, and the re-ID accuracy is competitive with the state of the art.
In the future study, we will investigate more extensions of SVDNet to find out more about its working mechanism. We will also apply SVDNet on the generic instance retrieval problem.
Person re-identification by multi-channel parts-based cnn with improved triplet loss function.In CVPR, 2016.
European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, 2016.
Restructuring of deep neural network acoustic models with singular value decomposition.In Interspeech, 2013.