Biometric recognition, especially face recognition and person re-identification (re-id), has attracted significant research attention as the demand of identification using images captured by CCTV cameras and video surveillance systems growing rapidly. In these scenarios, the random poses and perspectives of the target object, unwilling occlusions caused by other objects (e.g. hair, sunglasses even other individuals for a person or eyelids/eyelashes for irises) and only partly captured images of target objects would degrade the performance of surveillance systems. With the great progress has been made on biometric identification in recent years due to the development of deep learning, many approaches are proposed from global researches. And we consider these approaches can be divided into two generations.
The first generation approaches generally assume that each image covers full glance of one object. However, the assumption of biometric matching on full and frontal images does not always hold in real-world scenarios, where we merely have access to a few parts of images for identification. For instance shown in Fig. 1, a face is easily occluded by accessories such as sunglasses, scarfs, and a person on the street can easily be occluded by moving obstacles (e.g., cars, other persons) and static ones (e.g., trees, barriers), resulting in partial observations of the target object. Besides, the frequently presented arbitrary posture of an object in video surveillance introduces additional difficulties to real-world biometric identification problems. Moreover, an object may be positioned partially outside cameras view, resulting in an arbitrary-size image. These emerging problems would reduce the performance of the first generation methods.
The drawbacks of first generation approaches makes researchers to design a framework to address partial biometric identification problems, where the second generation approaches advent. To match an arbitrary patch of an image, some researchers resort to re-scale an arbitrary patch of the image to a fixed-size image. However, the performance would be significantly degraded due to the undesired deformation. Part-based models , , , , , , ,  indeed introduce a possible solution for partial biometric identification by dividing an image into multiple patches and then fusing patch-to-patch matching. However, these methods may fail because of requiring the presence of certain person components and pre-alignment. To address the problem of alignment, human parsing, mask , ,  and skeleton , , ,  in person re-identification, landmarks in face recognition as external cues are widely used to align persons/faces. But, over-reliance on external cues would result in the biometric system to be unstable in real-world scenarios. Thus, it can be seen that image alignment is a crucial problem for partial biometric identification.
In this paper, we propose a new general robust framework as shown in Fig. 2 for biometric matching that addresses all problems mentioned above and gets rid of fixed inputs on multiple partial biometric identification tasks. In the proposed framework, Fully Convolutional Network (FCN) is utilized to generate spatial feature maps of a certain size. And then a feature post-processing unit consist of global averaging pooling and pyramid pooling is utilized to produce multi-scale spatial feature to avoid the influence of scale variation. Motivated by the remarkable successes achieved by dictionary learning in face recognition [20, 42, 45], the Spatial Feature Reconstruction (SFR) makes that each spatial feature in the multi-scale spatial maps of the probe image can be sparsely reconstructed on the basis of multi-scale spatial maps of gallery images. In this manner, the model is independent of the size of images and naturally avoids the time-consuming alignment step. Besides, we introduced an objective function namely batch hard triplet which encourages the reconstruction error of the spatial feature maps extracted from the same identity to be minimized while that of different identities being maximized. Generally, the major contributions of our work are summarized as follows:
We propose a robust biometric matching method based on Spatial Feature Reconstruction (SFR) for biometric identification on partial captured objects, which is alignment-free and flexible to arbitrary-sized/ scale images. Hence, the proposed SFR can work well for partial biometric identification.
Spatial feature reconstruction combined with pyramid pooling and global feature matching makes the SFR more robust to scale various, so as to enhance the performance.
We embed the dictionary learning into batch hard triplet learning in a unified framework, and train an end-to-end deep model through minimizing the reconstruction error for coupled images from the same identity and maximizing that of different identities.
The paper is built upon our preliminary work reported in  with following improvements: pyramid pooling layer is added to improve the robustness of scale various, regularization takes the place of regularization in spatial reconstruction equation for solving the reconstruction coefficient fast, SFR Embedded batch hard triplet learning is utilized to improve the discriminative of spatial feature instead of pairwise learning, spatial feature reconstruction and global feature matching are fused to improve the model performance and we extend the SFR method for more person re-id datasets such as CUHK03  and DukeMTMC-reID  and partial face datasets CAISA-NIR-Distance , Partial LFW, which shows the strong expansibility of SFR.
The remainder of this paper is organized as follows: In Sec. 2, we review the related work about the existing person re-id and partial person re-id algorithms. Sec. 3 introduces the technical details of spatial feature reconstruction and batch hard triplet SFR learning. Sec. 4 shows the experimental results and analyzes the performance in accuracy. Sec. 5 discuss the advantages and disadvantages of the proposed approach. Finally, we conclude our work in Sec. 6.
2 Literature Review
As our approach is expected to settle multiple biometric identification problems yet current existing approaches are specialized to one of person re-identification, partial person re-identification or partial face re-identification, we would love to review some of them and give comparisons in Sec. 4 to show our approach holds state-of-the-art on these problems without any specializing and pre-alignment.
2.1 Person Re-identification
Part-based models , , , , ,  are widely applied to person re-identification since they could achieve significant performance. Zhao et al.  proposed a novel Spindle Net based on human body region guided multi-stage feature decomposition and tree-structured competitive feature fusion. Li et al. 
design a Multi-Scale Context-Aware Network (MSCAN) to learn powerful features over full body and body parts, which can well capture the local context knowledge by stacking multi-scale convolutions in each layer. Moreover, instead of using predefined rigid parts, they proposed to learn and localize deformable pedestrian parts using Spatial Transformer Networks (STN) with novel spatial constraints, which can release some difficulties, e.g. pose variations and background clutters, in part-based representation. Besides, Sunet al.  proposed a network named Part-based Convolutional Baseline (PCB) that outputs a convolutional descriptor consisting of several part-level features. PCB is able to lay emphasis on the content consistency within each part. However, these methods require the presence of certain person components and pre-alignment.
Mask-guided models , ,  provide a solution for person re-identification. Mask as external cue helps to remove the background clutters in pixel-level and contain body shape information. Song et al. 
introduced the binary segmentation masks to construct synthetic RGB-Mask pairs as inputs, then they design a mask-guided contrastive attention model (MGCAM) to learn features separately from the body and background regions. Kalayehet al.  proposed a person re-identification model that integrated human semantic parsing in person re-identification. Similar to , Qi et al.  combined source images with person masks as the inputs to remove the appearance variations (illumination, pose, occlusion, etc.). Although mask-guided approaches can achieve satisfying performance, they extremely rely on accurate pedestrian segmentation model, otherwise, it would result in poor performance.
Pose-guided models , , ,  utilize skeleton as a external cue in person re-identification to reduce the part misalignment problem. Each part can be well located using person landmarks. Su et al.  proposed a Pose-driven Deep Convolutional (PDC) model to learn improved feature extractors and matching models from end-to-end, PDC can explicitly leverages the human part cues to alleviate the pose variations. Suh et al.  proposed a two-stream network that consisted appearance map extraction stream and body part map extraction stream. And then a part-aligned feature map is obtained by a bilinear mapping of the corresponding local appearance and body part descriptors. Except for the person alignment, some works , 
proposed pose-transferrable models that combined pose estimation and Generative Adversarial Networks (GAN) to augment training samples. The same as the mask-guided models, pose estimation may fail to work due to the loss of person component and severe occlusions.
take advantages of attention mechanism to extract more discriminative feature. In fact, attention mechanism is a feature selection approach. Liet al.  formulated a novel Harmonious Attention CNN (HA-CNN) model for joint learning of soft pixel attention and hard regional attention along with simultaneous optimisation of feature representations, dedicated to optimise person re-id in uncontrolled (misaligned) images. Si et al.  proposed a dual attention mechanism, in which both intra-sequence and inter-sequence attention strategies are used for feature refinement and feature-pair alignment, respectively. Besides, attentive spatial-temporal networks , ,  are widely used in video-based person re-identification task.
2.2 Partial Person Re-identification
warp an arbitrary patch of an image to a fixed-size image, and then extract fixed-length feature vectors for matching. However, such method would result in undesired deformation. Part-based models are considered as a solution to partial person re-id. Patch-to-patch matching strategy is employed to handle occlusions and cases where the target is partially out of the camera’s view. Zhenget al. 
proposed a local patch-level matching model called Ambiguity-sensitive Matching Classifier (AMC) based on dictionary learning with explicit patch ambiguity modeling, and introduced a global part-based matching model called Sliding Window Matching (SWM) that can provide complementary spatial layout information. However, the computation cost of AMC+SWM is rather expensive as features are calculated repeatedly without further acceleration.
2.3 Partial Face Recognition
Many approaches [14, 20, 41] proposed for solving partial face recognition are keypoint-based. Hu et al.  proposed an approach based on SIFT descriptor  representation that does not require alignment, and the similarities between a probe patch and each face image in the gallery are computed by the instance-to-class (I2C) distance with the sparse constraint. Liao et al.  proposed an alignment-free approach called multiple key points descriptor SRC (MKD-SRC), where multiple affine invariant key points were extracted for facial features representation and sparse representation based on classification (SRC)  was used for classification. Weng et al.  proposed a Robust Point Set Matching (RPSM) method based on SIFT descriptor, SURF descriptor  and LBP  histogram for partial face matching. Their approach first aligned the partial faces and then computed the similarity of the partial face and a gallery face image. However, the computational cost of each algorithms is expensive and the required alignment step limits its practical applications. Besides, region-based models [3, 9, 23, 24, 25, 32, 33] also offered a solution for partial face recognition. They only required face sub-regions as input, such as eye , nose , half (left or right portion) of the face , or the periocular region. He et al.  proposed a Dynamic Feature Matching (DFM) model and achieves the highest performance (94.96%)for partial face recognition on CASIA-NIR-Distance database . However, these methods require the presence of certain facial components and pre-alignment. To this end, we propose an alignment-free partial re-identification algorithm that achieves better performance with higher computation efficiency.
3 The Proposed Approach
We will give a clear explanation of the proposed approach in this section from network definition to loss construction. The code is available on https://github.com/lingxiao-he/Partial-Person-ReID.
3.1 Architecture of Deep Network
For a quick view, the feature matching process is shown in Fig. 2
. In the proposed network, a Fully Convolution Network (FCN) is adopted to extract spatial features, which are post-processed by a unit consist of two feature extraction branches are implemented: global features are extracted by global average pooling layer (GAP) and multi-scale spatial features are extracted by pyramid pooling layer. Then, multi-scale spatial features are fed to SFR, a dictionary learning based reconstruction mechanism supporting matches on arbitrary sized inputs, in feature matching step. Finally, the matching score equals to the weighted sum of results from global matching and SFR matching.
3.1.1 FCN Encoder
Models pre-trained on ImageNet such as VGG  and ResNet  can be viewed as a stack of multi-stage convolution layers and a sequence of fully-connected layers. Here we make use of those convolution layers (FCN) in ResNet as our feature encoder. The parameters of the encoder will be fine-tuned in the training process.
3.1.2 Feature Representation
This part introduces the two branches in feature representation step. Basicly, global averaging pooling(GAP) produces one scalar representing the feature of whole picture and pyramid pooling gives a batch of features calculated on different receptive fields, which leads better performance in matching objects in arbitrary size and posture.
Global Feature. Global feature is wildly exploited in modern person re-id algorithms. Basicly, Global Averaging Pooling (GAP) realized by a single averaging layer takes the feature maps from FCN as input and outputs one scalar value each image as its global feature. As tested in existing re-id methods that global feature holds relative valid information for matching, we make it in consideration as one of our reference.
Pyramid Feature. Invariance to varying person scale is a challenging problem for an arbitrary-size person image. It is difficult to align arbitrary-size person image to pre-defined scale. Therefore, the scales between two person images are easily mismatched, resulting in the degraded performance. To this end, we propose pyramid pooling layer to extract multi-scale spatial features to alleviate the influence of scale mismatching.
As shown in Fig. 3, pyramid pooling (PP) layer consists of multiple average pooling layers of different kernel sizes so that it has different receptive fields. For a input person image, we implement 4 pooling layers of different sizes: , , , and
in the pyramid pooling layer. The pyramid pooling layer filters the output spatial features at the stride of 1 to generate multi-scale spatial features. The output spatial features inferred by pooling layer of small kernel size generate dense spatial features, and each spatial feature represents the local feature of the small local region. The output spatial features inferred by pooling layer of large kernel size generate sparse spatial features, and each spatial feature represents the relatively large source region. Finally, we concat these output spatial features to obtain multi-scale spatial features. And the multi-scale features are defined as PP.
3.1.3 Spatial Feature Reconstruction
Spatial feature reconstruction (SFR) between a pair of person images is introduced in this part. As shown in Fig. 4, for a pair of given person images: and with different sizes, correspondingly-size multi-scale spatial features and are then extracted, where denotes the parameters of FCN. Then, can be represented by linear combination of . That is to say, we attempt to search similar spatial features in to reconstruct . Therefore, we wish to solve for the linear representation coefficients of with respect to , where . We constrain using -norm. Then, the linear representation formulation is defined as
For spatial features in , the Eq. (1) can be rewritten as
where , and controls the smoothness of coding vector .
We use the least square algorithm to solve , so . Let , then the spatial feature reconstruction between and can be defined as
where is Spatial Feature Reconstruction between a pair of person images.
3.2 Loss Function
Though pairwise loss with regularization in our previous work in , we replace it in this paper by proposed batch hard triple loss with regularization, which is found performs better than earlier implementation.
3.2.1 Batch Hard Triplet Loss
The goal of triplet embedding learning is to learn a function . Here, we want to ensure that an image (anchor) of a specific person is closer to all other images (positive) of the same person than it is to any image (negative) of any other person. Thus, we want , where is Euclidean measure between a pair of person images. So the Triplet Loss with samples is defined as
where is a margin that is enforced between positive and negative pairs, and = GAP, = GAP() and = GAP().
To effectively select triple samples, batch hard triplet loss modified by triplet loss is adopted: the core idea is to form batches by randomly sampling subjects, and then randomly sampling images of each subject, thus resulting in a batch of images. Now, for each anchor sample in the batch, we can select the hardest positive and hardest negative samples within the batch when forming the triplets for computing the loss, which is called as Batch Hard Triplet Loss:
which is defined for a mini-batch and where a data point corresponds to the - image of the - person in the batch. This results in terms contributing to the loss. Additionally, the selected triplets can be considered moderate triplets, since they are the hardest within a small subset of the data, which is exactly what is best for learning with the triplet loss.
3.2.2 SFR Embedded Batch Hard Triplet
Batch Hard Triplet Spatial Feature Reconstruction is proposed to improve the discriminative of spatial features (see Fig. 5). It encourages the spatial features of the same identity to be similar while spatial features of the different identities stay away. Batch Hard Triplet Spatial Feature Reconstruction can be defined as
where is Euclidean distance, is Spatial Feature Reconstruction distance.
It can be seen that, the similarity distance consists of global feature matching distance (Euclidean distance) and local feature matching distance (spatial feature reconstruction).
We employ an alternating optimization method to optimize .
step 1: fix , obtain and . The aim of this step is to solve linear reconstruction coefficient matrix and where and .
step 2: fix and , optimize . We only give the gradients of with respect to and , and the gradients of with respect to and .
3.3 Weighted Feature Matching
This subsection will demonstrate the detail of global feature matching, spatial feature reconstruction matching and the weighted fusion of them. Suppose global feature and spatial feature are generated from subject in the gallery. So the gallery global feature set and spatial feature set are built as respectively:
where , . is the number of spatial features. Given an arbitrary-size probe face image, global feature and spatial feature are generated respectively. Global feature represents the appearance information of person, we directly use the Euclidean distance: to measure the similarity between two images. Then a distance vector of global feature matching for all the subjects is denoted as
Moreover, the spatial feature matching presented above not only capture the spatial layout information of local feature, but it also achieves spatial feature matching without alignment. Therefore, it is robust to pose/view variations and person deformation. Meanwhile, such multi-scale spatial feature representation benefits scale inconsistency. Spatial feature reconstruction can always search similar spatial features from multi-scale spatial feature pool to reconstruct probe spatial feature with minimum error. The spatial feature reconstruction distance is represented as
where , and . Then, a distance vector for all the subjects is denoted as
To improve the retrieve accuracy, we combine the two distance vectors. The final distance vector can be written as
where is a weight for regulating the effect of global feature matching and spatial feature reconstruction. Finally, the identity of the probe image can be determined by , where is the entry of .
To verify the performance as well as the generalization ability of proposed method, this section includes several experiments in the order of person re-identification, partial person re-identification and partial face recognition.
|single query||multiple query||Labeled||Detected|
|Part-based||Spindle (CVPR17) ||76.50||-||-||-||-||-||-||-|
|MSCAN (CVPR17) ||80.31||57.53||86.79||66.70||-||-||-||-|
|DLPAP (CVPR17) ||81.00||63.40||-||-||-||-||-||-|
|AlignedReID (Arxiv17) ||91.80||79.30||-||-||-||-||-||-|
|PCB (Arxiv17) ||92.30||77.40||-||-||-||-||61.30||57.50|
|Mask-guided||SPReID (CVPR18) ||92.54||81.34||-||-||-||-||-|
|MGCAM (CVPR18) ||83.79||74.33||-||-||50.14||50.21||46.71||46.87|
|MaskReID (Arxiv18) ||90.02||75.30||93.32||82.29||-||-||-||-|
|Pose-guided||PDC (ICCV17) ||84.14||63.41||-||-||-||-||-||-|
|PABR (Arxiv18) ||90.20||76.00||93.20||82.70||-||-||-||-|
|Pose-transfer (CVPR18) ||87.65||68.92||-||-||33.80||30.50||30.10||28.20|
|PN-GAN (Arxiv17) ||89.43||72.58||-||-||-||-||-||-|
|PSE (CVPR18) ||87.70||69.00||-||-||-||-||27.30||30.20|
|Attention-based||DuATM (CVPR18) ||91.42||76.62||-||-||-||-||-||-|
|HA-CNN (CVPR18) ||91.20||75.70||93.80||82.80||44.40||41.00||41.70||38.60|
|AACN (CVPR18) ||85.90||66.87||89.78||75.10||-||-||-||-|
|DSR (CVPR18) ||91.26||75.62||93.45||82.44||-||-||61.78||56.87|
4.1 Implementation Details and Evaluation Protocol
Our implementation is based on the publicly available code of PyTorch. All models in this paper are trained and tested on Linux with GTX TITAN X GPU. In the training term, all training samples are all re-scaled to, thus spatial features are generated by FCN. No data augmentation method is used for training samples. Besides, we set margin and because it can achieve the best performance. With regard to the batch hard triplet SFR function, one batch consists of 32 subjects, and each subject has 4 different images. Therefore, each batch returns 128 groups of hard triples. The model is trained with 400 epochs and the learning rate is shown in Fig. 6.
For performance evaluation, we employ the standard metrics as in most person ReID literatures, namely the cumulative matching cure (CMC) and the mean Average Precision (mAP). To evaluate our method, we re-implement the evaluation code provided by  in Python.
4.2 Person Re-identification
Market1501 has 12,936 training and 19.732 testing images with 1,501 identities in total from 6 cameras. Deformable Part Model (DPM) is used as the person detector. We follow the standard training and evaluation protocols in  where 751 identities are used for training and the remaining 750 identities for testing.
CHUK03 consists of 13,164 images of 1,467 subjects captured by two cameras from CHUK campus. Both manually labelled and DFM detected person bounding boxes are provided. We adopt the new training/testing protocol  proposed in since it defines a more realistic and challenging ReID task. In particular, 767 identities are used for training and the remaining 700 identities are used for testing.
DukeMTMC-reID is the subset of Duke Dataset , which consists of 16,522 training images from 702 identities, 2,228 query images and 1,7,661 gallery images from the other identities. It provide manually labelled person bounding boxes. Here, we follow the setup in .
The examples of the three datasets are shown in Fig. 7. And we set = 0.7 in all person re-identification experiments.
Results on Market1501. Comparisons between SFR and 17 state-of-the-art approaches of four categories (part-based model, mask-guided model, pose-guided model and attention-based model) published after 2017 on Market-1501  are shown in Table I. We conduct the single and multiple query experiments, respectively . The results suggest that the proposed SFR achieves the competitive performance on all evaluation criteria under single and multiple query settings.
It is noted that: (1) The gaps between our results and baseline model (ResNet-50+Triplet) are significant: SFR increases from 88.18% to 93.04% under single query setting, and from 92.25% to 94.84% under multiple query setting, which fully suggests that spatial feature with alignment-free reconstruction is more effective than only using global feature matching. (2) Benefit from batch hard triplet spatial reconstruction (BHTSR) and pyramid pooling, SFR outperforms our pervious work DSR  by 1.78%, 1.39% at the Rank 1 accuracy under single query setting, respectively. BHTSR can learn more discriminative local feature and the pyramid pooling avoids the influence of scale variations of the detected person. (3) Our SFR achieves the best performance at the Rank 1 accuracy. Contributed by exact human semantic parsing, SPReID  achieves the competitive accuracy. However, SPReID relies on excellent human semantic parsing model in a extreme extension and would fail to address arbitrary-size person patch. (4) Although mask and pose estimation provide external cues to improve the performance of person re-identification compared to other methods without using external cues, the overusing of external cues easily result in unstable of these methods due to partial occlusions and the missing of person component. (5) Performance differences among these existing approaches mainly come from input size (e.g., 224 224, 256 128 and 384 192), baseline model (e.g., AlexNet, VGGNet, ResNet, and Inception) and algorithms themselves.
|Spindle (CVPR17) ||-||-|
|Part||MSCAN (CVPR17) ||-||-|
|-based||DLPAP (CVPR17) ||-||-|
|AlignedReID (Arxiv17) ||-||-|
|PCB (Arxiv17) ||81.80||66.10|
|Mask-||SPReID (CVPR18) ||84.43||70.97|
|guided||MGCAM (CVPR18) ||-||-|
|MaskReID (Arxiv18) ||78.86||61.89|
|PDC (ICCV17) ||-||-|
|Pose-||PABR (Arxiv18) ||84.40||49.30|
|guided||Pose-transfer (CVPR18) ||78.52||56.91|
|PN-GAN (Arxiv17) ||73.58||53.20|
|PSE (CVPR18) ||79.80||62.00|
|Attention||DuATM (CVPR18) ||81.16||62.26|
|-based||HA-CNN (CVPR18) ||80.50||63.80|
|AACN (CVPR18) ||41.37||-|
|DSR (CVPR18) ||82.43||68.73|
Results on CUHK03. We only list the results of those methods that use the new training/testing protocol . Table 6 shows results on CUHK03 when detected person bounding boxes and manually labeled bounding boxes are respectively used for both training and testing. The proposed method SFR get 65.86% and 63.86% accuracies while using manually labeled bounding boxes and detected bounding boxes by DPM, respectively. From the results shown in Table I, we can find that our proposed method SFR outperforms the previous best method PCB  implemented by deep learning with multiple parts by 2.56% at Rank 1 using detected person bounding boxes. It is also noted that: (1) SFR performs much better than mask-guided model: MGCAM , pose-guided models: Pose-transfer  and PSE , and attention-based model: HA-CNN . Clear gaps are shown between our method SFR and these state-of-the-art methods: The Rank 1 performance of SFR is 16.00% higher using either labeled or detected person images than others. The results fully suggest that the advantage of SFR is more pronounced. (2) Training with BHTSR and multi-scale spatial representation with pyramid pooling performs better than DSR trained with single-scale spatial feature and the pairwise loss function. Similar results are also observed using the mAP metric.
Results on DukeMTMC-reID. Person Re-ID results on DukeMTMC-reID  are given in Table II. This dataset is challenging because the person bounding box size varies drastically across different camera views, which naturally suits the proposed SFR with multi-scale representation. Except for Spindle Net , MSCAN , DLPAP , AlignedReID , MGCAM  and PDC , other comparison methods have reported the results on DukeMTMC-reID. The results show that SFR is 0.40% and 0.27% higher than the second best ReID model: SPReID  at the Rank 1 and mAP metrics respectively. Besides, SFR beats the previous work DSR by 2.40% and 2.51% at the Rank 1 and mAP metrics, respectively, which indicates that multi-scale representation using pyramid pooling can cope with scale variations to some extent.
Influence of weight . Similarity measure between two images is achieved by combining global feature matching and spatial feature reconstruction. We set the value of by from 0 to 1 at the stride of 0.1. Similarity distance only contains global feature matching distance when , and similarity distance only contains spatial feature reconstruction when . Spatial feature reconstruction performs much better than global feature matching by 3.95%, 1.82%, 1.07%, 0.93% and 3.95% on Market1501 under single query and multiple query setting, CHUK03 using labeled and detected person images, and DukeMTMC-reID, respectively. It shows that spatial feature reconstruction is more effective by discovering detail information of the persons. It is worth to note that fusion of global feature matching and spatial feature reconstruction performs better than single distance measure, which suggests that global feature matching incorporated with spatial feature reconstruction is able to improve the performance of ReID. From the results in Fig. 8, SFR achieves the best performance when we set = 0.7-0.9, indicating that spatial feature reconstruction is of more importance than global feature matching.
4.3 Partial Person Re-identification
Partial REID is a specially designed partial person dataset that includes 600 images from 60 people, with 5 full-body images and 5 partial images per person. These images are collected at a university campus from different viewpoints, backgrounds and different types of severe occlusions. The examples of partial persons in the Partial REID dataset are shown in Fig. 9(a). The region in the red bounding box is the partial person image. The testing protocol can be found in the open code.
Partial-iLIDS is a simulated partial person dataset based on iLIDS . The iLIDS contains a total of 238 images of 119 people captured by multiple non-overlapping cameras. Some images in the dataset contain people occluded by other individuals or luggage. Fig. 9(b) shows some examples of individual images from the iLIDS dataset. For the occluded individuals, the partial observation is generated by cropping the non-occluded region of one image of each person to construct the probe set. The non-occluded images of each person are selected to construct a gallery set.
|DSR (CVPR18) ||69.67||88.33||78.33||88.00|
|DSR (CVPR18) ||79.67||91.33||81.00||90.67|
The designed Fully Convolutional Network (FCN) is trained with Market1501. We follow the standard training protocols in , where 751 identities are used for training the FCN model. For comparison, multi-task sparse representation (MTSR) proposed for partial face modeling, ambiguity-sensitive matching and sliding window matching (AMC-SWM) are considered. Besides, Resizing model is also used for comparison, in which person images in the gallery and probe set are all resized to . And then 2,048-dimension feature vector is extracted by FCN followed by global average pooling (GAP).
Single-Shot Experiments (N=1). Single-shot experiment means that only one image per person exists in the probe set. Table III shows the single-shot experimental results. We find the results on Partial REID and Partial-iLIDS are similar. The proposed method SFR outperforms Resizing model, MTSR, and AMC-SWM. It is noted that: (1) The gaps between SFR and Resizing model are significant: SFR increases from 43.87% to 66.20% and from 26.87% to 63.87% at Rank 1 accuracy on Partial REID and Partial-iLIDS, respectively. SFR takes full advantage of FCN that operate in a sliding-window manner and outputs feature maps without deformation. Such results justifies the fact that the person image deformation would significantly affect the recognition performance. For example, resizing the upper part of a person image to a fixed-size would cause the entire image to be stretched and deformed. (2) AMC-SWM achieves comparable result because local features in AMC combined with global features in SWM makes it robust to occlusions and view/pose various. However, features of non-automatic learning in AMC-SWM make it not as well as SFR performs. (3) Spatial feature reconstruction combined with global feature matching (= in Partial REID and = in Partial-iLIDS) performs much better than global feature matching (ResNet-50+tri), which fully suggests that the local feature plays a very important role in person re-identification.
Besides, we conduct another interesting experiment, where we exchange gallery set and probe set. So the gallery set and probe set contain partial person images and holistic person images, respectively. Table IV shows the experimental result under single-shot settings. Experimental results show that the proposed SFR also performs much better than Resizing model, MTSR, and AMC-SWM and it is also effective when the gallery set only contains partial person images. Furthermore, compared to the results in Table V, partial person images exist in the gallery set would influence the performance to some extent.
Multi-shot experiments (N1). Multi-shot means that multiple person images per subject exist in the gallery set. The results are shown in Table V. Similar results are obtained in the single-shot experiment, all approaches achieve significant improvement compared to the single-shot experiment. Specifically, the results show that multi-shot setup helps to improve the performance of SFR since it can increase from 66.20% to 73.33%, 81.33%, 82.67% and 86.33% on Partial REID dataset at Rank 1 accuracy, respectively.
Influence of weight . Similarity measure between two images are achieved by combining global feature matching and spatial feature reconstruction. We set the value of by from 0 to 1 at the stride of 0.1. Similarity distance only contains global feature matching distance when , and similarity distance only contains spatial feature reconstruction when . The results are shown in Fig. 10, we can find that SFR achieves the best rank-1 accuracy under single-shot setting on Partial REID (66.20%) and Partial-iLIDS ( and 63.87%) when =0.7 and =0.6, respectively. For multi-shot experiments, we find that SFR performs much better than global feature matching, which can improve more than 10.00% at the Rank 1 accuracy. It shows that spatial feature reconstruction is more effective by discovering detail information of the persons.
4.4 Partial Face Re-identification
CASIA-NIR-Distance  database is a newly proposed partial face database, which contains 4,300 face images from 276 subjects. Half of them contains the entire facial region of the subject. Partial face images are captured by cameras under near-infrared illumination with subject presenting the different arbitrary region of the face. Besides, the variations of presented partial face images in CASIA-NIR-Distance database include imaging at different distances, view angles, scales, and illumination. Fig. 11(second row) shows some examples of partial faces in the CASIA-NIR-Distance database and the acquisition device.
Partial LFW, another simulated partial face database based on LFW database , is used for evaluation. LFW database contains 13,233 images from 7,749 individuals. Face images in LFW have large variations in pose, illumination, and expression, and may be partially occluded by other faces of individuals, sunglasses, etc.
VGGFace  model is used as base model. The fully-connected layers are discarded to evolve into a Fully Convolutional Network (FCN). Close-set experiments are conducted on the CASIA-NIR-Distance and Partial LFW datasets, containing images of 276 and 1,000 subjects respectively. One image per subject (N=1) is selected to construct the gallery set and one different image per subject is used to construct the probe set. For CASIA-NIR-Distance, some subjects do not have holistic face images captured by the iris recognition system, partial face images may exist in the gallery set, thus the difficulty of accurate recognition is increased. In this experiment, the setting of parameters is that = 0.8. For Partial LFW dataset, the gallery set contains 1,000 holistic face images from 1,000 individuals. The probe set share same subjects with the gallery set, but for each individual they contain different images. Gallery face images are re-scaled to . To generate partial face images as the probe set, an arbitrary-size region at random position of a random size is cropped from a holistic face image. Fig. 11(first row) shows some partial face images and holistic faces images in Partial LFW.
The proposed SFR is compared against the existing partial face algorithms including MRCNN , MKDSRC-GTP , RPSM , I2C , and DFM . MKDSRC-GTP, RPSM and DFM are implemented using the source codes provided by authors. I2C is implemented by ourselves according to the paper . Table VI and Table VII show the performance of the proposed SFR algorithm on the CASIA-NIR-Distance and Partial LFW datasets, respectively. The rank-1 matching accuracies achieved on the two databases are 96.74% and 46.30%, which clearly shows that our algorithm performs much better than those traditional algorithms for partial face recognition. The reasons are analyzed as follows: (1) Multi-scale spatial feature in our SFR takes full advantages of local and global information, which could represent a partial face more robustly in comparison with keypoint-based algorithms (MKDSRC-GTP, RSPM, and I2C). (2) RPSM method based on SIFT  descriptor, SURF descriptor  and LBP  histogram for partial face matching first aligns the partial faces and then computes the similarity of the partial face and a gallery face image. However, the required alignment step limits the practical applications of RPSM and the same story happens in MRCNN either. (3)Although I2C does not require alignment, the similarity between a probe patch and each face image in a gallery is computed by the instance-to-class (I2C) distance with the sparse constraint. Similar to , , MKDSRC-GTP simply uses local features and this leads to poor performance. From these perspectives, the characteristics of alignment-free property and more distinctive and robust descriptions in SFR contribute to the improvement of partial face recognition and place a huge advantage over the existing partial face recognition approaches.
The experiments on person, partial person and partial face re-identificaton datasets unveil the extensibility of our approach. On each datasets, the proposed approach, SFR, always outperforms other state-of-the-art approaches including part-based model, mask-guided model, pose-guided model and attention-based model. This is anticipated as these methods require either alignment or external cues, which extremely leads these approaches to poor stability due to relying on segmentation or pose estimation. On the contrast, that SFR relies on both global feature and spatial feature masks it alignment-free, more robust to scale various and external cues unnecessary.
Also, SFR embedded model is able to achieve remarkable performance without requiring fixed-size input image, which is demanded in AMC-SWM, MTRC and Resizing model. In the form of dictionary learning, SFR is designed for matching a pair of images of different sizes, which makes the model free to address re-id problems of partial images with arbitrary-sizes.
Nevertheless, the proposed approach also has a drawback. Compared to global feature matching, it costs more computational consumption for SFR while solving reconstruction coefficients. Therefore, we are considering the acceleration of the proposed approach as our future work.
In this paper, we have proposed a novel approach called Spatial Feature Reconstruction (SFR) to get rid of the fixed size input limitation. The proposed spatial feature reconstruction method provides a feasible scheme to reconstruct the probe spatial feature map linearly from a gallery spatial map. Besides, pyramid pooling layer combined with global pooling layer reduces the influence of scale various, which avoids the alignment step in many other approaches. Furthermore, we embedded SFR into batch hard triplet loss function to learn more discriminative features for minimizing the reconstruction error for a image pair from the same target and maximizing that of image pair from different targets.
Experimental results on several publicly datasets, including Market1501, CUHK03, and DukeMTMC-reIDID datasets, validate the effectiveness and efficiency of SFR. Additionally, the extensibility of the proposed method is unveiled by achieving state-of-the-art results on two partial person datasets: Partial REID and Partial-iLIDS, and on two partial face recognition datasets: CASIA-NIR-Distance and Partial LFW. Finally, the best performance shown on many datasets suggests that combining global feature matching with SFR in this paper is the better solution as it exploits the complementarity of the two feature matching models.
-  T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 28(12):2037–2041, 2006.
H. Bay, T. Tuytelaars, and L. Van Gool.
Surf: Speeded up robust features.
European Conference on Computer Vision (ECCV), pages 404–417, 2006.
-  I. Cheheb, N. Al-Maadeed, S. Al-Madeed, A. Bouridane, and R. Jiang. Random sampling for patch-based face recognition. In IEEE International Workshop on Biometrics and Forensics (IWBF), pages 1–5, 2017.
-  D. Chen, Z. Yuan, B. Chen, and N. Zheng. Similarity learning with spatial constraints for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1268–1277, 2016.
-  D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1335–1344, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell.
Decaf: A deep convolutional activation feature for generic visual
International Conference on Machine Learning (ICML), 2014.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
-  S. Gutta, V. Philomin, and M. Trajkovic. An investigation into the use of partial-faces for face recognition. In FG, pages 33–38, 2002.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  L. He, H. Li, Q. Zhang, and Z. Sun. Dynamic feature learning for partial face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  L. He, H. Li, Q. Zhang, Z. Sun, and Z. He. Multiscale representation for partial face recognition under near infrared illumination. In Proceedings of the IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS), 2016.
-  L. He, J. Liang, H. Li, and Z. Sun. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7073–7082, 2018.
-  J. Hu, J. Lu, and Y.-P. Tan. Robust partial face recognition using instance-to-class distance. In Visual Communications and Image Processing (VCIP), 2013.
-  G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
-  M. M. Kalayeh, E. Basaran, M. Gökmen, M. E. Kamasak, and M. Shah. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1062–1071, 2018.
-  D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 384–393, 2017.
-  S. Li, S. Bak, P. Carr, and X. Wang. Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 369–378, 2018.
-  W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  S. Liao, A. K. Jain, and S. Z. Li. Partial face recognition: Alignment-free approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(5):1193–1205, 2013.
-  J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu. Pose transferrable person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4099–4108, 2018.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004.
-  H. Neo, C. Teo, and A. B. Teoh. Development of partial face recognition framework. In International Conference on Computer Graphics, Imaging and Visualization (CGIV), pages 142–146, 2010.
-  W. Ou, X. Luan, J. Gou, Q. Zhou, W. Xiao, X. Xiong, and W. Zeng. Robust discriminative nonnegative dictionary learning for occluded face recognition. Pattern Recognition Letters, 2017.
-  K. Pan, S. Liao, Z. Zhang, S. Z. Li, and P. Zhang. Part-based face recognition using near infrared images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
-  U. Park, A. Ross, and A. K. Jain. Periocular biometrics in the visible spectrum: A feasibility study. In Biometrics: Theory, Applications, and Systems, 2009. BTAS’09. IEEE 3rd International Conference on, pages 1–6. IEEE, 2009.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
-  L. Qi, J. Huo, L. Wang, Y. Shi, and Y. Gao. Maskreid: A mask based deep ranking neural network for person re-identification. arXiv preprint arXiv:1804.03864, 2018.
-  X. Qian, Y. Fu, W. Wang, T. Xiang, Y. Wu, Y.-G. Jiang, and X. Xue. Pose-normalized image generation for person re-identification. arXiv preprint arXiv:1712.02225, 2017.
-  E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pages 17–35. Springer, 2016.
-  M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen. A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
K. Sato, S. Shah, and J. Aggarwal.
Partial face recognition using radial basis function networks.In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), pages 288–293, 1998.
M. Savvides, R. Abiantun, J. Heo, S. Park, C. Xie, and B. Vijayakumar.
Partial & holistic face recognition on frgc-ii data using support vector machine.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2006.
-  J. Si, H. Zhang, C.-G. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang. Dual attention matching network for context-aware feature sequence based person re-identification. arXiv preprint arXiv:1803.09937, 2018.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  C. Song, Y. Huang, W. Ouyang, and L. Wang. Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1179–1188, 2018.
-  C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose-driven deep convolutional model for person re-identification. In IEEE International Conference on Computer Vision (ICCV), pages 3980–3989, 2017.
-  Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee. Part-aligned bilinear representations for person re-identification. arXiv preprint arXiv:1804.07094, 2018.
-  Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with refined part pooling. arXiv preprint arXiv:1711.09349, 2017.
-  G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou. Learning discriminative features with multiple granularities for person re-identification. arXiv preprint arXiv:1804.01438, 2018.
-  R. Weng, J. Lu, and Y.-P. Tan. Robust point set matching for partial face recognition. IEEE Transactions on Image Processing (TIP), 25(3):1163–1176, 2016.
-  J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 31(2):210–227, 2009.
-  J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang. Attention-aware compositional network for person re-identification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  L. Zhang, M. Yang, and X. Feng. Sparse representation or collaborative representation: Which helps face recognition? In IEEE International Conference on Computer Vision (ICCV), 2011.
-  X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J. Sun. Alignedreid: Surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184, 2017.
-  H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1077–1085, 2017.
-  L. Zhao, X. Li, J. Wang, and Y. Zhuang. Deeply-learned part-aligned representations for person re-identification. In IEEE International Conference on Computer Vision (ICCV), 2017.
-  L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In IEEE International Conference on Computer Vision (CVPR), 2015.
-  W.-S. Zheng, S. Gong, and T. Xiang. Person re-identification by probabilistic relative distance comparison. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
-  W.-S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, and S. Gong. Partial person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pages 4678–4686, 2015.
-  Z. Zheng, L. Zheng, and Y. Yang. Pedestrian alignment network for large-scale person re-identification. arXiv preprint arXiv:1707.00408, 2017.
-  Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 3, 2017.
Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan.
See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6776–6785, 2017.