Average-Biased-ReLU
Average Biased ReLU Based CNN Descriptor for Improved Face Retrieval
view repo
The convolutional neural networks (CNN) like AlexNet, GoogleNet, VGGNet, etc. have been proven as the very discriminative feature descriptor for many computer vision problems. The trained CNN model over one dataset performs reasonably well over another dataset of similar type and outperforms the hand-designed feature descriptor. The Rectified Linear Unit (ReLU) layer discards some information in order to introduce the non-linearity. In this paper, it is proposed that the discriminative ability of deep image representation using trained model can be improved by Average Biased ReLU (AB-ReLU) at last few layers. Basically, AB-ReLU improves the discriminative ability by two ways: 1) it also exploits some of the discriminative and discarded negative information of ReLU and 2) it kills the irrelevant and positive information used by ReLU. The VGGFace model already trained in MatConvNet over the VGG-Face dataset is used as the feature descriptor for face retrieval over other face datasets. The proposed approach is tested over six challenging unconstrained and robust face datasets like PubFig, LFW, PaSC, AR, etc. in retrieval framework. It is observed that AB-ReLU is consistently performed better than ReLU using VGGFace pretrained model over face datasets.
READ FULL TEXT VIEW PDFAverage Biased ReLU Based CNN Descriptor for Improved Face Retrieval
The image descriptors are the fundamental signature for image matching. Most of the research in the early days was focused over designing of hand-crafted descriptors such as Scale Invariant Feature Transform (SIFT) [1], Local Binary Pattern (LBP) [2], etc. The hand-designed descriptors have shown very promising performance in several computer vision problems such as image matching [3]
[4], [5][6], texture classification [7], [8], [9] [10], [11], biomedical image analysis [12], [13], [14], object detection [15], [16], etc. Several descriptors are also proposed for face retrieval such as [5], [17], [18], [19], [20], [21]. The main drawback of the hand-designed descriptors are with the less discriminative power due to the data in-dependency nature.Since last few years, deep convolutional neural networks have attracted full attention of researchers in computer vision community. The first remarkable work was done in 2012 by Alex et al. named as the AlexNet [22]
for the Imagenet classification task
[23]. After Alexnet, several CNN models proposed for the Imagenet classification such as VGGNet [24], GoogLeNet [25] and ResNet [26]. The network over the time became deeper and deeper, from AlexNet (8 stages) to VGGNet (16 and 19 stages) to GoogLeNet (22 stages) to ResNet (152 stages).The deep neural networks are also proposed for the face recognition task. Some recent and renowned deep learning based approaches are DeepFace
[27], FaceNet [28], and VGGFace [29], Bilinear CNN (BCNN) [30], Deep CNN (DCNN) [31] and All-in-One CNN [32] among others for face recognition. The DeepFace used a nine-layer deep neural network for face representation [27]. The number of parameters in DeepFace is too high as it is not using the weight sharing. DeepFace reported an accuracy of 97.35% on the Labeled Faces in the Wild (LFW) database [27], [33]. FaceNet is also proposed as the feature extractor for face recognition and clustering [28]. It uses the deep convolutional network as the feature embedding. FaceNet reported 99.63% of accuracy over LFW face database. VGGFace utilized the convolutional neural network (CNN) based end-to-end learning for face recognition [29]. It is trained over a very large scale VGGFace database with 2.6M images from 2.6K subjects.The RoyChowdhury et al. used the Bilinear CNN (BCNN) [34] for face recognition task [30]. They converted the standard pre-trained VGGFace Model into a BCNN without any extra training cost. They reported 89.5% rank-1 recall using BCNN over the IJB-A benchmark [30], [35]
. The DCNN is made with 18 layers consisting of 10 convolution layer, 5 pool layer, 1 dropout layer, 1 fully connected layer, and 1 softmax layer
[31]. It is trained over the CASIAWebFace dataset and evaluated over the IJB-A (97.70% rank-10 accuracy) and the LFW (97.45% accuracy) datasets [31], [35], [33]. A very recently, Ranjan et al. proposed All-in-One CNN for facial analysis [32]. It is a multi-purpose network tracking face detection, face alignment, pose estimation, gender recognition, smile detection, age estimation and face recognition through a single network. All-in-One CNN utilized a multi-task learning framework by regularizing the shared parameters of CNN
[32]. In this work, the VGGFace model is used as the feature extractor for the face retrieval experiments.The pre-trained models are also used for several tasks in the Computer Vision. Marmanis et al. used the pretrained CNN model (trained over ImageNet database) as the initial feature extractor for the Earth observation classification task [36]. They observed 92.4% over UC Merced Land Use benchmark which is far better than the hand-designed approaches [36]. Liu et al. fused the CNN features with hand-designed features and experimented for content-based image retrieval [37]. It is also reported that if pre-trained CNN model is directly applied at more abstract level such as sketches, whereas it is trained over the photos, the performance degrades drastically [38]. Very recently, Bansal et al. claimed that the trained network over still face images can be used for face verification in videos also effectively [39]. Pre-trained CNN models over Imagenet database are also successfully applied in medical image applications for Mammogram Analysis [40]. Schwarz et al. also used the pre-trained CNN features for RGBD object recognition and pose estimation [41]. The CNN has also shown promising performance for event detection in videos which is actually trained over image classification database [42]. Karpathy and Fei-Fei used the pre-trained CNN on ImageNet [23] for sentence generation from the image [43]. The trained CNN model is fine-tuned for Cross-scene Crowd Counting by Zhang et al. [44]. Pre-trained CNN model is also used for the content based image retrieval [45]
. Some researchers also adapted the transfer learning to utilize the trained network of a domain in some other domain, such as Deep transfer
[46] and Residual transfer [47]. A very recently, Ge et al. used the pre-trained VGG convolutional neural networks for remote-sensing image retrieval [48]. In this paper also, the pre-trained network is used for the face retrieval task.Some researchers are also focused over different layers of CNN model. Wen et al. have used the center loss function instead of the softmax loss function for face recognition
[49]. The ReLU discards the negative values which actually represent the absence of events and might be useful to improve the discriminative ability. In order to get rid of negative values of ReLU, a Rectified Factor Network is introduced in [50]. A Parametric Rectified Linear Unit (PReLU) is used by He et al. as a generalization of the Rectified Linear Unit (ReLU) by considering the slope of the negative region into the parameter of each neuron
[51]. The ReLU also has the “dying Gradient” problem where the gradient flow through a unit can be zero forever [22]. Leaky ReLU (LReLU) tried to fix the dying gradient problem during training by considering small negative slope [52]. The LReLU is extended to randomized leaky rectified linear units (RReLU) by considering a random small number of negative slope [53]. An exponential linear unit (ELU) is proposed by Clevert et al. which also considers the ReLU’s negative values [54]. Most of the existing rectifier units do not consider the negative values which might be important. These rectifier units are also not dependent upon the input data. In this paper, a new data dependent rectifier unit is proposed to boost the discriminative power of VGGFace descriptor at the testing time.The main contributions of the this paper are as follows:
The suitability of using pre-trained CNN model over other databases of similar type is explored.
A new data dependent Average Biased Rectified Liner Unit (AB-ReLU) is proposed to boost the discriminative power of the pre-trained network at testing time.
The suitability of proposed AB-ReLU is tested at different layers of the network.
The image retrieval experiments are conducted over six challenging face datasets.
The rest of the paper is organized as follows: Section 2 reviews the VGGFace model and rectified linear unit; Section 3 proposes a new data dependent rectified linear unit and modified VGGFace descriptor; Section 4 presents the experimental setup; Section 5 presents the results and discussions; and finally Section 6 sets the concluding remarks.
In this section, first the original VGGFace model used in this work is described in detail and then the original rectified linear unit is presented.
No. | Layer Name | Layer Type | Filter | Volume Size |
---|---|---|---|---|
0 | input | Image | n/a | 224,3 |
1 | conv1_1 | Conv | :3,3,64, :1, :1 | 224,64 |
2 | relu1_1 | Relu | n/a | 224,64 |
3 | conv1_2 | Conv | :3,64,64, :1, :1 | 224,64 |
4 | relu1_2 | Relu | n/a | 224,64 |
5 | pool1 | Pool | :2, :2, :0 | 112,64 |
6 | conv2_1 | Conv | :3,64,128, :1, :1 | 112,128 |
7 | relu2_1 | Relu | n/a | 112,128 |
8 | conv2_2 | Conv | :3,128,128, :1, :1 | 112,128 |
9 | relu2_2 | Relu | n/a | 112,128 |
10 | pool2 | Pool | :2, :2, :0 | 56,128 |
11 | conv3_1 | Conv | :3,128,256, :1, :1 | 56,256 |
12 | relu3_1 | Relu | n/a | 56,256 |
13 | conv3_2 | Conv | :3,256,256, :1, :1 | 56,256 |
14 | relu3_2 | Relu | n/a | 56,256 |
15 | conv3_3 | Conv | :3,256,256, :1, :1 | 56,256 |
16 | relu3_3 | Relu | n/a | 56,256 |
17 | pool3 | Pool | :2, :2, :0 | 28,256 |
18 | conv4_1 | Conv | :3,256,512, :1, :1 | 28,512 |
19 | relu4_1 | Relu | n/a | 28,512 |
20 | conv4_2 | Conv | :3,512,512, :1, :1 | 28,512 |
21 | relu4_2 | Relu | n/a | 28,512 |
22 | conv4_3 | Conv | :3,512,512, :1, :1 | 28,512 |
23 | relu4_3 | Relu | n/a | 28,512 |
24 | pool4 | Pool | :2, :2, :0 | 14,512 |
25 | conv5_1 | Conv | :3,512,512, :1, :1 | 14,512 |
26 | relu5_1 | Relu | n/a | 14,512 |
27 | conv5_2 | Conv | :3,512,512, :1, :1 | 14,512 |
28 | relu5_2 | Relu | n/a | 14,512 |
29 | conv5_3 | Conv | :3,512,512, :1, :1 | 14,512 |
30 | relu5_3 | Relu | n/a | 14,512 |
31 | pool5 | Pool | :2, :2, :0 | 7,512 |
32 | fc6 | Conv | :7,512,4096, :1, :0 | 1,4096 |
33 | relu6 | Relu | n/a | 1,4096 |
34 | fc7 | Conv | :1,4096,4096, :1, :0 | 1,4096 |
35 | relu7 | Relu | n/a | 1,4096 |
represent the filter size, stride and padding respectively. In Volume Size column, the first value is a dimension of volume and the second value is depth of volume, i.e. 224,3 represents volume size 224
2243. The last fully connected layer and softmax layer are not shown because the output of ‘relu7’ is considered as the 4096-dimensional feature vector in this work.
In this work, the original pre-trained VGGFace model is taken from MatConvNet library [55] released by University of Oxford111\(http://www.robots.ox.ac.uk/\hbox{$\scriptstyle\mathtt{\sim}$}vgg/software/vgg% \_face/\). This model is based on the CNN implementation of VGG-Very-Deep-16 CNN architecture as described in [29]. This model is trained over VGGFace database222\(http://www.robots.ox.ac.uk/\hbox{$\scriptstyle\mathtt{\sim}$}vgg/data/vgg\_face/\) which consists 2.6M faces images from 2,622 subjects. The layers of VGGFace model are summarized in Table I. In this table, the last fully connected layer and sofmax layer of VGGFace are not listed as it is not required in this work. The output of ‘relu7’ is considered as the VGGFace feature descriptor. The filter size, stride and padding are mentioned in the Filter column with fields , and respectively. A filter size :3,128,256 means total 256 filters of dimension 33 and depth 128. Similarly, a volume size 112,64 means a 3-D volume of dimension 112112 with depth 64. In this work, the changes are made in selected rectified linear unit (ReLU) layers, especially in last few layers which is described in the next section.
The rectified linear unit (ReLU) in a neural network is used to introduce the non-linearity [22]. The ReLU simply works like a filter, ignores the negative signals and pass the positive signals. Consider is the input volume to ReLU at layer of any network and is the output volume of ReLU for layer. Suppose the input volume is dimensional and is the size of the input volume in dimension . Then, an element at position of output volume is computed from the corresponding element of input volume as follows,
(1) |
where is -dimensional, is the size of in dimension, . The ReLU function is illustrated in Fig. 1. It is linear in the range, whereas zero in the range. The main drawback with ReLU is that it passes all values even it might not be important and blocks all values even it might be important. This problem is solved in the next section by introducing a data dependent ReLU.
In this section, first a data dependent average biased rectified linear unit (AB-ReLU) is proposed, then it is applied with existing pre-trained VGGFace model [29] to create a more discriminative face descriptor, and finally AB-ReLU based VGGFace descriptor is used for face retrieval.
It can be noticed from ReLU in the previous section that it is not data dependent, vanishes all the signals and passes all the signals which can lead to less discriminative features. In this section, this problem is resolved by introducing a new data dependent ReLU named as average biased rectified linear unit (AB-ReLU). The AB-ReLU is data dependent by exploiting the average property of the input volume. It also works like a filter and pass only those signals which satisfy the average biased criteria. The average biased criteria ensures that only important features get passed irrespective of its sign. Suppose, AB-ReLU is used in any network at layer and and are input volume and output volume for this layer respectively. Then, the element of output layer is given by following equation,
(2) |
where represents the position of an element, is the dimension of , is the size of in dimension, , and is the average biased factor defined as follows,
(3) |
where is a parameter to be set empirically and is the average of input volume computed as follows,
(4) |
![]() |
![]() |
The AB-ReLU leads to two AB-ReLUs, i.e. AB-ReLU and AB-ReLU based upon the input data. This behavior of AB-ReLU is illustrated in Fig. 2 where Fig. 1(a) shows the ReLU function and Fig. 1(b) depicts the AB-ReLU function. The AB-ReLU signifies the average biased scenario when the input data volume has the majority, i.e. and allows some prominent signals by converting it into signal with the addition of an average biased factor of input volume (). Similarly, if the input data volume has the majority, i.e. then AB-ReLU blocks even some inferior signals along with all signals by subtracting the average biased factor of input volume (). The default value of is set to . In the next subsection, AB-ReLU is used to construct the descriptor.
In this subsection, the VGGFace model is used with AB-ReLU to construct the improved VGGFace descriptors. The AB-ReLU is applied directly over pre-trained VGGFace model at some layers instead of simple ReLU. The output of layer35 (i.e. ReLU) of original pre-trained VGGFace model after reshaping into a 1-D array is used as the VGGFace descriptor and represented by VGGFace35ReLU (or just 35R as shorthand notation). The first descriptor is proposed by simply replacing the last ReLU, i.e. at layer35 with AB-ReLU and converting its output into a 1-D array. This descriptor is represented by VGGFace35AB-ReLU (i.e. 35AR) for . The other variants of this descriptor are VGGFace35AB-ReLU2 (i.e. 35AR2) and VGGFace35AB-ReLU5 (i.e. 35AR5) for and respectively. Similarly, other descriptors are generated by replacing some ReLU of VGGFace with AB-ReLU. In second descriptor i.e. VGGFace33AB-ReLU (i.e. 33AR) for , layer34 and layer35 are removed, the ReLU at layer33 is replaced with AB-ReLU and the output of layer33 is considered as the descriptor after reshaping into a 1-D array. Its other variants are VGGFace33AB-ReLU2 (i.e. 33AR2) and VGGFace33AB-ReLU5 (i.e. 33AR5) for and respectively. In VGGFace33AB-ReLU_35 (i.e. 33AR_35) descriptor, the ReLU at layer33 is replaced with AB-ReLU while the output of layer35 using ReLU is considered as the descriptor. AB-ReLU is applied at multiple layers, i.e. at layer33 and layer35 in VGGFace33,35AB-ReLU (i.e. 33,35AR). The AB-ReLU is also applied at layer30. Two descriptors namely VGGFace30AB-ReLU (i.e. 30AR) and VGGFace30AB-ReLU_35 (i.e. 30AR_35) are considered for the experiments. In VGGFace30AB-ReLU, the output layer30 (i.e. AB-ReLU) is taken as the descriptor, whereas in VGGFace30AB-ReLU_35, the AB-ReLU is used at layer30 and the output of last layer (i.e. layer35) is taken as the descriptor. In experiment section, the shorthand notations of descriptor are used.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
The effect of AB-ReLU with pre-trained VGGFace (VGGFace35AB-ReLU) is illustrated with an example face image in Fig. 3. The example face image displayed in Fig. 2(a) is considered from the LFW database [33]. This example face image is used as the input to the pre-trained VGGFace model and features are computed before and after layer35. Fig. 2(b) shows the input signal for last layer (i.e. layer35). The output signal of ReLU at layer35 is displayed in Fig. 2(c). In Fig. 2(d), 2(e), and 2(f), the output signals of AB-ReLU for =1, 2, and 5 respectively are illustrated. For this example, at layer 35, it can be also observed from the Fig. 3 that AB-ReLU passes more signal as compared to ReLU.
In this paper, the image retrieval framework is adapted for the experiments. The face retrieval is done using introduced AB-ReLU based VGGFace descriptor. In face retrieval, the top matching faces are returned from a database for a given query face based on the description of the faces. The best matching faces are decided based on the similarity scores between query face and database faces. In this work, the similarity scores are considered as the distance between the descriptor of query face and descriptor of database faces. The lower distance between two feature descriptors represents more similarity among the corresponding face images and vice versa.
In image retrieval, the performance also depends upon the distance measures used for finding the similarity scores. In order to compute the performance, top few numbers of faces are retrieved. The Chi-square (Chisq) distance is used in most of the experiments in this work. The Euclidean, Cosine, Earth Mover Distance (Emd), L1, and D1 distances are also adapted to find out the more suitable distance in the current scenario [56], [6].
In order to present the result of face retrieval and comparison, the standard evaluation metrics are used in this paper such as precision, recall, f-score, and retrieval rank. All the images of a database are treated as the query image (i.e. probe) one by one and rest of the images as gallery to report the average performance over full database. The average retrieval precision (ARP) and average retrieval rate (ARR) over full database are computed as the average of mean precisions (MP) and mean recalls (MR) respectively over all categories. The MP and MR for a category is calculated as the mean of precisions and recalls respectively by turning all the images of that category as the query one by one. The precision (
) and recall () for a query image is calculated as follows,(5) | ||||
The F-score is calculated from the ARP and ARR values with the help following equation,
(6) |
In order to test the effective rank of correctly retrieved faces, the average normalized modified retrieval rank (ANMRR) metric is adapted [57]. The better retrieval performance is inferred from the higher values of ARP, ARR and F-Score, and lower value of ANMRR and vice-versa.
Six challenging, unconstrained and robust face databases are used to demonstrate the effectiveness of the proposed AB-ReLU based VGGFace descriptor: PaSC [58], LFW [33], PubFig [59], FERET [60], [61], AR [62], [63], and ExYaleB [64], [65]. Viola Jones object detection method [66] is adapted to detect and crop the face regions in the images. The faces are resized to and ‘zerocenter’ normalization is applied before feeding to proposed AB-ReLU based VGGFace model.
The PaSC still images face database consists 9376 images from 293 individuals with 32 images per individual [58]. PaSC database has the effects like blur, pose, and illumination and regarded as one of the difficult database. This database finally has 8718 faces after face detection using Viola Jones detector. In current scenario, the unconstrained face retrieval is very demanding due to the increasing number of faces over the Internet. In this paper, LFW [33] and PubFig [59] databases are considered for this purpose. These two databases have collected the images from the Internet in an unconstrained way without subjects cooperations with several variations, such as pose, lighting, expression, scene, camera, etc. In the image retrieval framework, it is required to retrieve more than one (typically 5, 10, etc.) top matching images. In that case, the sufficient number of images should be available for each category in the database. By considering this fact, all the individuals having at least 20 images are taken in the LFW database (i.e. 2984 faces from 62 individuals) [33]. The Public Figure database (i.e., PubFig) consists 6472 faces from 60 individuals [59]. Following the URLs given in the PubFig face database, the images are downloaded directly from the Internet after removing the dead URLs.
In order to experiment with the robustness of the descriptor, FERET, AR and Extended Yale B face databases are used. “Portions of the research in this paper use the FERET database of facial images collected under the FERET program, sponsored by the DOD Counterdrug Technology Development Program Office” [60], [61]. The cropped version of the Color-FERET database having 4053 faces from 141 people (only subjects having at least 20 faces) is considered in this work. Several variations like expression and pose (13 different poses) are present in the FERET database. The cropped version of the AR face database is also used for the experiments [62], [63]. The AR database has the masking effect where some portions of the face are occluded along with the illumination and color effect. Total 2600 face images are available from 100 people in AR database. Extended Yale B (ExYaleB) database is based on the severe amount of illumination differences (i.e. 64 types of illuminations) [64], [65]. Total 2432 cropped faces from 38 persons with 64 faces per person are present in the ExYaleB database for the face retrieval.
Database | 35R | 35AR | 35AR2 | 35AR5 | 33R | 33AR | 33AR2 | 33AR5 | 33AR_35 | 33,35AR | 30AR_35 | 30AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PaSC | 93.06 | 93.88 | 93.89 | 93.83 | 93.36 | 93.82 | 93.88 | 93.84 | 92.96 | 93.04 | 93.02 | 86.98 |
LFW | 99.10 | 99.53 | 99.31 | 99.24 | 99.21 | 99.36 | 99.32 | 99.37 | 99.22 | 99.14 | 99.30 | 94.82 |
PubFig | 98.22 | 98.35 | 98.59 | 98.54 | 98.25 | 98.32 | 98.52 | 98.43 | 98.08 | 98.09 | 97.63 | 91.76 |
FERET | 95.64 | 95.87 | 95.56 | 95.42 | 94.79 | 94.35 | 94.22 | 93.57 | 95.74 | 95.74 | 95.70 | 92.94 |
AR | 99.73 | 99.77 | 99.81 | 99.81 | 99.85 | 99.81 | 99.81 | 99.81 | 99.77 | 99.77 | 99.77 | 99.96 |
ExYaleB | 85.77 | 86.39 | 86.27 | 85.90 | 86.92 | 86.55 | 86.18 | 85.53 | 85.90 | 85.81 | 86.72 | 92.52 |
Database | 35R | 35AR | 35AR2 | 35AR5 | 33R | 33AR | 33AR2 | 33AR5 | 33AR_35 | 33,35AR | 30AR_35 | 30AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PaSC | 87.91 | 89.33 | 89.60 | 89.37 | 87.83 | 88.79 | 89.07 | 89.27 | 87.79 | 87.80 | 86.87 | 68.46 |
LFW | 98.33 | 98.49 | 98.44 | 98.39 | 98.39 | 98.50 | 98.51 | 98.51 | 98.21 | 98.17 | 98.11 | 86.31 |
PubFig | 96.51 | 97.04 | 97.37 | 97.30 | 96.84 | 97.19 | 97.19 | 97.13 | 96.53 | 96.53 | 95.84 | 84.19 |
FERET | 88.04 | 88.46 | 88.01 | 87.94 | 84.65 | 84.73 | 84.85 | 84.75 | 87.98 | 87.98 | 86.91 | 66.75 |
AR | 94.85 | 95.15 | 95.21 | 95.12 | 95.35 | 95.56 | 95.65 | 95.63 | 94.61 | 94.57 | 94.78 | 90.83 |
ExYaleB | 77.31 | 77.97 | 77.71 | 77.29 | 76.28 | 76.43 | 76.17 | 75.84 | 76.97 | 76.98 | 77.58 | 81.60 |
Database | 35R | 35AR | 35AR2 | 35AR5 | 33R | 33AR | 33AR2 | 33AR5 | 33AR_35 | 33,35AR | 30AR_35 | 30AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PaSC | 83.11 | 85.08 | 85.39 | 85.13 | 82.79 | 83.94 | 84.37 | 84.10 | 82.92 | 82.89 | 81.48 | 54.21 |
LFW | 97.34 | 97.69 | 97.52 | 97.34 | 97.45 | 97.54 | 97.63 | 97.34 | 97.17 | 97.12 | 96.87 | 77.52 |
PubFig | 95.06 | 95.71 | 95.90 | 95.83 | 95.41 | 95.75 | 95.72 | 95.54 | 94.91 | 94.92 | 94.11 | 77.28 |
FERET | 80.28 | 81.22 | 80.92 | 80.64 | 75.90 | 75.83 | 76.18 | 75.77 | 80.14 | 80.16 | 77.80 | 45.26 |
AR | 80.93 | 81.95 | 82.05 | 82.07 | 80.47 | 81.63 | 81.95 | 82.31 | 80.47 | 80.45 | 79.29 | 73.83 |
ExYaleB | 70.64 | 71.54 | 71.43 | 71.05 | 68.05 | 68.17 | 68.40 | 68.68 | 70.27 | 70.26 | 70.49 | 68.51 |
Database | 35R | 35AR | 35AR2 | 35AR5 | 33R | 33AR | 33AR2 | 33AR5 | 33AR_35 | 33,35AR | 30AR_35 | 30AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PaSC | 28.06 | 28.74 | 28.83 | 28.74 | 27.95 | 28.33 | 28.48 | 28.38 | 27.99 | 27.98 | 27.50 | 18.29 |
LFW | 31.48 | 31.61 | 31.53 | 31.46 | 31.51 | 31.53 | 31.57 | 31.44 | 31.42 | 31.39 | 31.31 | 24.04 |
PubFig | 17.44 | 17.57 | 17.64 | 17.62 | 17.53 | 17.61 | 17.60 | 17.58 | 17.41 | 17.42 | 17.21 | 13.05 |
FERET | 30.32 | 30.67 | 30.54 | 30.43 | 28.63 | 28.57 | 28.73 | 28.59 | 30.27 | 30.27 | 29.32 | 17.10 |
AR | 31.13 | 31.52 | 31.56 | 31.56 | 30.95 | 31.40 | 31.52 | 31.66 | 30.95 | 30.94 | 30.50 | 28.40 |
ExYaleB | 11.04 | 11.18 | 11.16 | 11.10 | 10.63 | 10.65 | 10.69 | 10.73 | 10.98 | 10.98 | 11.01 | 10.70 |
Database | 35R | 35AR | 35AR2 | 35AR5 | 33R | 33AR | 33AR2 | 33AR5 | 33AR_35 | 33,35AR | 30AR_35 | 30AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PaSC | 41.89 | 42.89 | 43.04 | 42.90 | 41.72 | 42.30 | 42.52 | 42.37 | 41.79 | 41.77 | 41.06 | 27.31 |
LFW | 46.05 | 46.23 | 46.13 | 46.02 | 46.10 | 46.13 | 46.18 | 46.00 | 45.96 | 45.92 | 45.80 | 35.42 |
PubFig | 26.86 | 27.06 | 27.17 | 27.14 | 26.98 | 27.11 | 27.09 | 27.06 | 26.82 | 26.83 | 26.52 | 20.52 |
FERET | 43.46 | 43.97 | 43.79 | 43.63 | 41.05 | 40.97 | 41.20 | 40.99 | 43.38 | 43.40 | 42.05 | 24.50 |
AR | 44.96 | 45.53 | 45.58 | 45.59 | 44.71 | 45.35 | 45.53 | 45.73 | 44.70 | 44.69 | 44.05 | 41.02 |
ExYaleB | 19.09 | 19.33 | 19.30 | 19.20 | 18.39 | 18.42 | 18.49 | 18.56 | 18.99 | 18.99 | 19.05 | 18.52 |
Database | 35R | 35AR | 35AR2 | 35AR5 | 33R | 33AR | 33AR2 | 33AR5 | 33AR_35 | 33,35AR | 30AR_35 | 30AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PaSC | 4.40 | 3.40 | 3.43 | 3.61 | 4.43 | 3.98 | 3.83 | 4.23 | 4.51 | 4.54 | 5.28 | 33.97 |
LFW | 0.42 | 0.38 | 0.44 | 0.49 | 0.41 | 0.41 | 0.40 | 0.55 | 0.46 | 0.49 | 0.32 | 16.33 |
PubFig | 0.85 | 0.66 | 0.58 | 0.60 | 0.72 | 0.58 | 0.62 | 0.64 | 0.91 | 0.89 | 1.25 | 14.44 |
FERET | 13.01 | 12.15 | 12.50 | 12.80 | 17.24 | 17.43 | 17.15 | 17.69 | 13.12 | 13.11 | 15.31 | 49.86 |
AR | 4.13 | 3.40 | 3.42 | 3.41 | 4.25 | 3.53 | 3.32 | 3.30 | 4.51 | 4.52 | 5.24 | 10.54 |
ExYaleB | 10.82 | 9.98 | 9.88 | 10.22 | 13.68 | 13.59 | 13.36 | 13.12 | 11.18 | 11.17 | 10.96 | 12.93 |
Database | Euclidean | Cosine | L1 | D1 | Chisquare |
---|---|---|---|---|---|
PaSC | 84.89 | 84.84 | 85.01 | 85.01 | 85.08 |
LFW | 97.65 | 97.64 | 97.66 | 97.66 | 97.69 |
PubFig | 95.64 | 95.63 | 95.67 | 95.67 | 95.71 |
FERET | 81.32 | 81.18 | 81.25 | 81.25 | 81.22 |
AR | 81.80 | 81.78 | 81.93 | 81.93 | 81.95 |
ExYaleB | 71.45 | 71.34 | 71.48 | 71.48 | 71.54 |
In this work, the content based image retrieval framework is adapted for the experiments and comparison. In this section, first result comparison is presented by fixing the similarity measure as Chi-square distance, and then the performance of proposed VGGFace35AB-ReLU descriptor is tested with different similarity measures.
Several VGGFace descriptor with AB-ReLU at different layers such as VGGFace35ReLU (35R), VGGFace35AB-ReLU (35AR), VGGFace35AB-ReLU2 (35AR2), VGGFace35AB-ReLU5 (35AR5), VGGFace33ReLU (33R), VGGFace33AB-ReLU (33AR), VGGFace33AB-ReLU2 (33AR)2, VGGFace33AB-ReLU5 (33AR5), VGGFace33AB-ReLU_35 (33AR_35), VGGFace33,35AB-ReLU (33,35AR), VGGFace30AB-ReLU_35 (30AR_35), and VGGFace30AB-ReLU (30AR), etc. are used for the experiments. The average retrieval precision (ARP) for topmost match (i.e. Rank-1 Accuracy) is illustrated in Table II over the PaSC, LFW, PubFig, FERET, AR, and ExYaleB databases. It is observed from Table II that the performance of 35AR and 35AR2 is better, mainly over the unconstrained databases, whereas the performance of 30AR is better over robust databases like AR and ExYaleB. It is also noted that the performance of AB-ReLU (35AR) is improved as compared to the ReLU (35R). Table III listed the ARP values when 5 best faces are retrieved. In this result, the performance is generally better for parameter , i.e. 35AR2 and 33AR2. The picture is clear from Table IV, where ARP is reported for 10 numbers of retrieved images. Descriptors constructed at last layer (i.e. layer35) are superior except over AR database. One possible reason is that the trained faces of VGGFace database are not masked. The result in Table IV confirms that AB-ReLU is better suited for the descriptor as compared to ReLU at both layer35 as well as layer33.
The ARR and F-Score are summarized in Table V and Table VI respectively, for 10 numbers of retrieved images. The similar trend is observed in the results of ARR and F-Score that 35AR and 35AR2 are the best performing VGGFace based descriptors. Some variations can be seen in the ANMRR results for same 10 best matching retrieved images in Table VII as compared to the ARP, ARR and F-Score because ANMRR penalizes the rank heavily for false positive retrieved images. Still 35AR is better over PaSC and FERET databases and 35AR2 is better over PubFig and ExYaleB databases. It can be noticed that the F-Score and ANMRR over LFW database is highest for 35AR and 30AR_35 descriptors respectively. It means that while the true positive rate for 30AR_35 descriptor over LFW database is low as compared to 35AR descriptor, the retrieved faces using 30AR_35 are closer to the query face in terms of its ranks.
In the comparison results of the previous subsection, Chi-square distance was adapted as the similarity measure. This experiment is conducted to reveal the best suitable similarity measure for proposed descriptor. The ARP values using VGGFace35AB-ReLU (i.e. 35AR) descriptor over each database are presented in Table VIII. In this experiment, 10 top matching images are retrieved with different distances. The Euclidean, Cosine, L1, D1 and Chi-square distances are experimented and reported in Table VIII. It is noticed that the Chi-square distance based similarity measure is better suited for each database except the FERET database.
In this paper, an average biased rectified linear unit (AB-ReLU) is proposed for the image representation using CNN model. The AB-ReLU is data dependent and adjust the threshold based on the positive and negative dominated data. It considers the average of the input volume to adjust the input volume itself. The advantage of AB-ReLU is that it allows the important negative signals as well as blocks the irrelevant positive signals based on the nature of the input volume. The AB-ReLU is applied over pre-trained VGGFace model at last few layers by replacing the conventional ReLU layers. The face retrieval experiments are conducted to test the performance of AB-ReLU based VGGFace descriptor. Six challenging face databases are considered, including three unconstrained and three robust databases. Based on the experimental analysis, it is concluded that AB-ReLU layer is better suited at the last layer instead of the simple ReLU layer for a pre-trained CNN model based feature description. Favorable performance is reported in both unconstrained as well as robust scenarios. It is also found that the Chi-square distance is better suited with the proposed descriptor for face retrieval.
I gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce Titan X Pascal used for this research.
4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China
, 2017, pp. 741–746.N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute and simile classifiers for face verification,” in
Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 365–372.
Comments
There are no comments yet.