SELM: Siamese Extreme Learning Machine with Application to Face Biometrics

08/06/2021 ∙ by Wasu Kudisthalert, et al. ∙ Universidad Autónoma de Madrid King Mongkut's Institute of Technology Ladkrabang 0

Extreme Learning Machine is a powerful classification method very competitive existing classification methods. It is extremely fast at training. Nevertheless, it cannot perform face verification tasks properly because face verification tasks require comparison of facial images of two individuals at the same time and decide whether the two faces identify the same person. The structure of Extreme Leaning Machine was not designed to feed two input data streams simultaneously, thus, in 2-input scenarios Extreme Learning Machine methods are normally applied using concatenated inputs. However, this setup consumes two times more computational resources and it is not optimized for recognition tasks where learning a separable distance metric is critical. For these reasons, we propose and develop a Siamese Extreme Learning Machine (SELM). SELM was designed to be fed with two data streams in parallel simultaneously. It utilizes a dual-stream Siamese condition in the extra Siamese layer to transform the data before passing it along to the hidden layer. Moreover, we propose a Gender-Ethnicity-Dependent triplet feature exclusively trained on a variety of specific demographic groups. This feature enables learning and extracting of useful facial features of each group. Experiments were conducted to evaluate and compare the performances of SELM, Extreme Learning Machine, and DCNN. The experimental results showed that the proposed feature was able to perform correct classification at 97.87 and 99.45 proposed feature provided 98.31 well-known DCNN and Extreme Leaning Machine methods by a wide margin.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the period of COVID-19 pandemic, A New Normal was introduced. People all around the world had to change their daily habits. They had to be constantly aware of their surroundings and had to keep everything around them clean of the virus all the time. The traveling history of every suspected COVID vector in an area had to be retraced when an infected person was detected in the area during that time, e.g., everyone arriving or leaving a building or community at the same time. To be able to retrace traveling history, accurate personal identification is of utmost importance. At this time of writing, some communities required visitors to identify themselves correctly before they were permitted an access into the communities. There are several ways to identify an individual, such as from their ID card, passport, fingerprint, iris or DNA 

fierrez06phd; jain16years, but one of the most convenient ways in many setups (like the discussed moving travellers due to COVID-19) is facial identification. At this time, numerous monitoring cameras have already been installed almost everywhere, such as in department stores, airports, border crossing facilities, cities and transportation stations, as a security and surveillance measure. An accurate and reliable face identification algorithm is required to identify individuals by their facial features patel18spm; fierrez21faceq. The process of identification from facial features is a type of one-to-many mapping process, i.e., an unknown face is identified between multiple faces already registered in a database. The identification is assisted by taking into account demographic information—identity, age, gender, and ethnicity tome15soft; sosa18cots; guo2020learning; ter21bias. On the other hand, a face verification task is a one-to-one mapping process. The task verifies whether the individual with the recognized face is the same person registered in a system sun2013hybrid. This task is often used for authorizing a system, for example, for authorizing an access to a mobile device or a laptop patel20qid. The advantage of this method over others like fingerprint recognition alonso09finger is that it does not require anyone touching anything fierrez18touch.

Face recognition techniques have been developed for decades galbally2019study, e.g., Geometric based approaches shi2006effective, Local feature analysis arca2003face, Dictionary based learning chen2012dictionary; patel2012dictionary, Hand-crafted features jin2014hand; antipov2015learned

and, recently, Deep Convolutional Neural Network (DCNN) 

yuan2017convolutional. Recently, many large-scale face datasets containing millions of images have been available 8599059; kemelmacher2016megaface; cao2018vggface2

for training deep learning model. Nevertheless, the class distributions of some variates in those datasets were rather imbalanced, causing statistical bias

ter21bias. This issue was associated with imbalanced representation of classes in a dataset. An effect of the bias was reported in phillips2011other. They reported that algorithms invented by Asian researchers were able to distinguish Asian subjects better than Caucasian subjects. Conversely, algorithms from the West performed better on Caucasian subjects. Along the same line, a study by buolamwini2018gender reported that a commercial face recognition system yielded better outcomes on male individuals and lighter individuals, but worse outcomes on darker females. Therefore, bias in class proportion and demographic variates would strongly affect a biometric system performance serna21insidebias. This concern could be alleviated by utilizing datasets evenly distributed across demographics klare2012face; serna2020sensitiveloss. Training a model on a specific group could reduce data diversity and allows the model to learn better characteristics of each class. Interestingly, the performance of a model that was intensely trained on a very specific group, like male and female or every different ethnic in an area, might be superior than the performance of a conventionally trained model acien18bias.

Face representation is an important part of the face verification task. Historically, different representation techniques have been used to extract facial information from face images. In the past, computer vision hand-crafted techniques were employed to transform face images into useful features such as geometry-based features that utilized the shape of a face and its landmarks to represent the appearance of the face and its components. At the time of writing, the most competitive face representations are obtained using Deep Convolutional Neural Networks (DCNN) optimized according to different loss functions 

TripletLoss; SphereFace; arcFace. Among the different loss functions, triplet loss (a triplet network) is a Distance-Metric approach designed as a type of Siamese network hoffer2015deep

. This triplet network has a hierarchy that starts learning from low level features to high level features, i.e., from pixels to classes. It could be fed with two inputs in parallel. A pair of faces can be fed into a triplet network to output a similarity/distance coefficient between the two input face images. The value of this coefficient is then usually compared against a threshold. An identity match is positive when it exceeds the threshold. Else, it was a mismatch. Fortunately, several machine learning algorithms could be employed to enhance the performance of the face verification task. They could learn the pattern of the data and distinguish them into classes instead of measuring the similarity/distance coefficient between two faces. Nevertheless, most of them could not deal with this task without some modification because their architecture was designed to be fed with one input at a time. Fortunately, this can be solved by linking two inputs into a concatenated input, but certain unavoidable bias would be introduced, e.g., the exact order of concatenation of the two inputs might introduce a bias— a different order yields a different output. In this work, we restructured a well-known classification algorithm, Extreme Learning Machine (ELM) 

huang2004extreme, to accept twin inputs simultaneously and eliminate this kind of bias. The restructured algorithm was based on a single hidden-layer feedforward neural network (SLFN).

The following are the main contributions of the present paper:

  • We propose a novel classification method for verification tasks named Siamese Extreme Learning Machine (SELM). The proposed method adapts standard Extreme Learning Machine architectures in order to process parallel inputs in an efficient way.

  • We develop a demographic-dependent triplet model that is shown to improve the performance in face verification.

  • The proposed framework is demonstrated to distinguish gender, ethnicity, and face accurately.

  • We carry out a performance comparison in face biometrics between biased and unbiased triplet models under different setups: subject-independent, gender-dependent, and gender-ethnicity-dependent.

  • We carry out a performance comparison between Siamese and Non-Siamese algorithms.

2 Related works

Some of the key challenges in face recognition are the following: 1) inadequate quality of facial images deteriorates the performance of face detection and verification fierrez21faceq; and 2) biases between cohorts of people, specially with respect to privileged ones, deteriorates the performance of face recognition in general and introduces undesired discrimination between population groups sixta2020fairface; serna2019algorithmic. There are many powerful and well-known techniques for face recognition patel18spm. In this section on related works, we will first discuss the strengths and weaknesses of key techniques for face recognition with emphasis on the two challenges indicated above. We will then position our proposed machine learning methods in context.

2.1 Demographic variates in face recognition

Gender and race are two important demographic variates representing subject-specific characteristics of the human face. Other variates have also been proven useful for face recognition. For example, the skin tone can help improve face recognition performance. Back to demographic variates, Cook et al. cook2019demographic examined the effects of demographic variates on face recognition through leading commercial face biometric systems. They investigated the effects with a dataset of 363 subjects in a controlled environment and found that many demographic covariates significantly affected the face recognition performance, including gender, age, eyewear, height, and especially skin reflectance. Lower skin reflectance (darker skin tone) was associated with lower efficiency (longer transaction time) and accuracy, in terms of mated similarity score. The study also revealed that skin reflectance was a significantly better predictor than self-identified race variates. Buolamwini and Gebru buolamwini2018gender reported a significant bias in well-known commercial gender classification systems, i.e., Microsoft del2018introducing, IBM high2012era, and Face++. They found that darker-skinned females were the most misclassified group with an error rate of

, while the mis-classified rate of lighter-skinned males was only

. The largest difference in error rate between the best and the worst classified groups was . They concluded that these three classification systems yielded the best accuracy for lighter-skinned individuals and males but the worst accuracy for darker-skin females due to the mentioned bias. Several studies have reported that Caucasian and male individuals are easier to distinguish by face recognition algorithm klare2012face; buolamwini2018gender; cook2019demographic. Recently, Lu et al. 8599059 investigated the effects of demographic groups on face recognition and found that the difficulty of unconstrained face verification varies significantly with different demographic variates. Males are easier to verify than females, and old subjects are recognized better than young individuals. On the other hand, light-pink skin tone is recognized with the best performance. Moreover, gender and skin tone variates are not significantly correlated.

On the other hand, some works have exploited the inherent differences between population groups for stronger and more fair recognition. Phillips et al. phillips2011other and O’Toole et al. OTOOLE2012169 showed the importance of demographic composition and modeling. They reported that recognition of face identities from a homogeneous population (same-race distribution) was easier than recognition from a heterogeneous population. Liu et al. lui2009meta showed that the recognition performance using a training set that contained facial images of Caucasians and East Asians at a ratio of 3:1 was better at identifying East Asians in every case. Klare et al. klare2012face and Vera-Rodriguez et al. Vera-Rodriguez_2019_CVPR_Workshops improved face-matching accuracy by training exclusively on specific demographic cohorts of which demographic variates were evenly distributed. This solution could reduce face bias and offer higher accuracy across all demographic cohorts. Vera-Rodriguez et al. Vera-Rodriguez_2019_CVPR_Workshops proposed a gender-dependent training approach to improve face verification performance that reduced the effect of gender as a recognition covariate. The approach improved AUC performance from 94.0 to 95.2. Vera-Rodriguez et al. Vera-Rodriguez_2019_CVPR_Workshops and Serna et al. serna2019algorithmic; serna2020sensitiveloss applied deap learning methods to train face recognition models and benchmarked the models over multiple privileged classes. Conventional methods (not exploiting data diversity) resulted in poor performance when demographic diversity was large. Their experimental results showed a big performance gap between the best class (Male-White) and the worst class (Female-Black) that reached up to . The above studies also demonstrated that training the models on specific demographic cohorts can be a possible solution to those large performance differences between cohorts. For example, useful features for distinguishing black individuals may be different to those for white individuals. Thus, training a model with specific groups of individuals may direct the model to better learn special characteristics of the groups.

Many well-known large-scale face recognition datasets have been published, such as MS-Celeb-1M 8599059, Megaface kemelmacher2016megaface, and VGGFace2 cao2018vggface2. These datasets contain more than a million face images, but most of them are highly-biased datasets, composed mainly of Caucasian people (), while come from a Male-Caucasian cohort. Recently, Wang et al. wang2019racial; wang2020mitigating introduced diverse and discrimination-aware face databases with even-distributed populations: Asian, Black, Caucasian, and Indian. However, they did not balance the gender distribution. Along the same line, Morales et al. morales21sensitivenets introduced the DiveFace database with equal distribution for six demographic groups: Female-Asian, Male-Asian, Female-Black, Male-Black, Female-Caucasian, and Male-Caucasian. The dataset was designed to be unbiased in terms of Gender and Ethnicity, which is useful both for training fair recognizers and evaluating them in terms of fairness across population groups.

2.2 Machine learning architectures for face recognition

Figure 1: Siamese network concept.

Machine learning classification techniques have been popular for face recognition tasks. Successful algorithms are for example: Support Vector Machines (SVM)

dadi2016improved, Extreme Learning Machines (ELM) gurpinar2016kernel

, Random Forests

liu2018conditional, and Deep Convolutional Neural Networks wen2016discriminative, the last one now dominating the field. Goswami et al. goswami2017face

summarized the performances of features extracted by deep and shallow feature extractor approaches. The experimental results showed clearly the superiority of deep features. Other works such as Liu 

et al. liu2018conditional, Bianco bianco2017large and Wong et al. wong2019realization have also shown the robustness and improved recognition of face biometrics based on features extracted from DCNNs. However, the typical classification architecture in those works was designed to be fed with one input image at a time. For comparing two input faces (e.g., for authentication) there is a need for extending the basic DCNN architecture to process two inputs.

One popular approach to exploit a DCNN backbone for comparing two inputs is the Siamese architecture. The concept is to train a feature representation by comparing pairs of facial images. The conceptual diagram is shown in Figure 1. In the present paper we will adopt this architecture in combination with an Extreme Learning Machine (cf. Section 3.1 for an introduction to this type of networks.) ELMs have been shown to be quite successful in various tasks related to face biometrics, but so far Siamese architectures have not been explored yet for enhancing basic ELM methods.

As example of ELMs for face biometrics, Laiadi et al. laiadi2019kinship

predicted kinship relationship by comparing facial appearances. They used three different types of features: deep features using VGG-Face model, BSIF-Tensor, and LPQ-Tensor features using MSIDA. These three features of the two considered face images were then measured by cosine similarity, then the measured data were concatenated as a vector for computing a kinship score by ELM. The proposed approach was up to

more accurate than a baseline ResNet-based method. Wong et al. wong2019realization adopted ELM to tackle face verification. They added a top layer of DeepID sun2014deep with ELM as the classification layer instead of a soft-max layer. This approach improved the accuracy by and , respectively, for a conventional DeepID and ELM.

In the present paper, we develop and explore a novel Siamese classification algorithm for face verification with ELM backbone. The proposed algorithm compares pairs of facial images based on demographic traits. The traits are used as factors for selecting feature extraction models. The main aim of this work is to boost the performance of the algorithm by decreasing the verification errors. A secondary aim of this work is to investigate the dependency of the performance on demographic variates.

3 Methods

3.1 Extreme Learning Machine

Extreme Learning Machines (ELMs) were first introduced by Huang et al. huang2004extreme. They are based on a single hidden-layer feedforward neural network (SLFN) architecture of which the weights are obtained by the closed-form solution of an inverse problem, instead of the typical iterative back-propagation optimization. It has been demonstrated that this closed-form solution in ELMs yields a small classification error and extremely fast learning. The ELM architecture consists of

input neurons (

input dimensions). The input neurons are fully connected with hidden neurons each one with weighted inputs according to , with . The weights between the hidden layer and the output layer are defined as hidden layer output weights that are used to determine the prediction outputs . The model is expressed mathematically as (scalars in italics, column vectors in bold lowercase, matrices in bold uppercase, denotes transpose):


The output from the hidden layer is processed by an activation function

with a linear combination of input and synaptic weights as well as bias , where and is the number of input samples. It should be noted that the set of and are randomly generated once to speed up the training process. Therefore, the activity of the hidden node can be written as:


The prediction score is then expressed by:


ELM minimizes the mean square error between true target labels and predicted targets by using the following objective function:


The optimal solution of the hidden layer output weights is finally calculated by the Moore-Penrose pseudo-inverse:


3.2 Weighted Similarity Extreme Learning Machine

Figure 2: Weighted Similarity Extreme Learning Machine architecture.

The WELM architecture is shown in Figure 2, where the conventional activation function

, e.g., sigmoid or radial basis function, is replaced with a similarity-based activation function

, e.g., cosine similarity or Euclidean distance. WELM can reduce training time because it does not need any tuning of the kernel parameters. It yields better performance especially when dealing with similarity-based tasks pasupa2018virtual; kudisthalert2020counting. In WELM, the matrix in conventional activation is replaced by:


The weights are randomly selected from a training set , thus, .

3.3 Siamese Extreme Learning Machine

Figure 3: Siamese Extreme Learning Machine architecture.

This paper proposes a novel Siamese Extreme Learning Machine (SELM) architecture to handle verification tasks that require simultaneous comparison of two identities. SELM is developed on a WELM network backbone. Input vectors and from identity A and B, respectively, are fed into WELM after a Siamese input layer, turning the conventional WELM architecture into a SELM architecture capable of feeding two inputs simultaneously and in parallel into the network, as shown in Figure 3.

A Siamese condition function in the Siamese layer is the core of SELM. The function combines two input vectors using one of the following equations:

  • Summation condition function:

  • Distance condition function:

  • Multiply (Hadamard product) condition function:

  • Mean condition function:


Note that this Siamese layer can be also interpreted as an initial feature-level information fusion stagefierrez18fusion.

The pseudocodes of the training and prediction processes of SELM are shown in Algorithm 1.

1:function SELM_Train(, )
2:      #samples in
3:      total #pairs of samples chosen from the available samples
4:      #pairs in Positive class out of
5:      #pairs in Negative class out of
6:      selected from after passing them across the Embedding Layer (see Fig. 3)
7:      randomly select subset of rows of (which has rows in total)
8:     for  to  do balance imbalance dataset
10:     end for
11:      Eq. 6 considering , , and each substituted for
13:     return ,
14:end function
16:function SELM_Predict(, , )
17:      like in Eq. 6 making use of
19:     return
20:end function
Algorithm 1 Siamese Extreme Learning Machine

3.4 Triplet Convolutional Neural Networks

Figure 4: Triplet network structure (Net can be a DCNN).

The triplet network model was proposed for learning useful representations by distance comparisons hoffer2015deep between three samples: sample , sample , and sample .The triplet network structure is shown in Figure 4. As can be seen, the network employs DCNNs as backbone to optimize the weights of the model with back-propagation. These core networks are identical sharing the same weights. The aim of the triplet network is to minimize the distance between the and the sample and to maximize the distance between the and the sample. The sample and the sample come from the same identity, while the sample comes from a different identity. The Euclidean distance of and is expressed as,


The triplet loss is then calculated as a loss function of the network as follows:


where the parameter is a soft margin. The objective of the learning function is to satisfy . In this study, we trained a number of triplet networks with several demographic groups so that they could learn population-specific facial information.

4 Proposed framework

Figure 5: The workflow of the proposed framework.

The proposed framework is shown in Figure 5. It consists of five stages. The framework was designed to verify the identity of two input facial images. The input images are first classified into gender and ethnicity to select a gender- and ethnicity-dependent triplet model for each input. The details of each stage are explained below.

  1. First stage (Input): input color facial images were first cropped and aligned properly Deng_2020_CVPR before being fed into the next stage. It should be noted that the two images passed in parallel through every process in the framework simultaneously. The input direction is shown by black or white line with arrow for Image A and B, respectively.

  2. Second stage (Feature Extraction): ResNet-50 is a 50-layer-deep CNN with skip connections. It is one of the most robust methods for face recognition among existing deep architectures, such as VGG-16, Inception-3 and DenseNet-121 vera2019facegenderid; serna2019algorithmic; wang2019benchmarking. ResNet-50 was used as the features extraction model. It was trained with a large-scale face dataset, VGGFace2 cao2018vggface2. ResNet-50 required a color image size of 224224 pixels as input. The length of the output was 2,048 features.

  3. Third stage (Gender-Ethnicity Prediction). This stage consists of two classification tasks: gender and ethnicity classification.

  4. Fourth stage (Gender-Ethnicity-Dependent Triplet Model): the extracted facial features from the second stage are used by one of six models to extract the triplets. Each triplet model was specially trained only with data in its Gender-Ethnicity-dependent class because, for example, a Female-Black person may have distinctive features different from those in the other classes. Thus, letting the model learn only in a specific class would make it better in recognizing the distinctive characteristics of the data in that class. In this work, we used the DiveFace dataset for training the triplet models because it is a discrimination-aware face dataset that provides the same distribution from the six different demographic groups considered here. Details of DiveFace are described in Section 5.1.1.

  5. Fifth stage (Identity Verification): there are two steps in this verification task. First, the pair of images A and B is classified as an impostor match if both images result in different Gender-Ethnicity classes in the third stage. Second, machine learning models are applied to verify if both images come from the same identity. In this work, we compare the proposed approach SELM to the performance of standard ELM and ResNet (which is now one of the most common DCNNs used for face recognition laiadi2019kinship). Incidentally, ResNet is also a core component of our proposed approach for training the triplet models.

5 Experimental protocol

5.1 Dataset

In this study, we used two datasets: DiveFace and Labeled Faces in the Wild. DiveFace is a diversity-aware face recognition dataset for training models such as Gender classification, Ethnicity classification, and Gender-Ethnicity-dependent triplet models. Labeled Faces in the Wild dataset is a well-known large scale face dataset in the face recognition domain for performance evaluation.

5.1.1 DiveFace: a diversity-aware face recognition dataset

Figure 6: Data distribution of the DiveFace dataset generated by t-SNE.

DiveFace was constructed to be an unbiased face recognition dataset. It was carefully constructed with images from Megaface MF2 training dataset kemelmacher2016megaface that contained 4.7 million faces from 672K identities from Flickr Yahoo’s dataset thomee2015new.

Ethnicity Gender Total
Female Male
Table 1: Proportions of face images from different ethnics and genders in DiveFace dataset

DiveFace was designed to be evenly distributed in six demographic groups. There are 24,000 identities from six demographic groups, 4,000 identities for each group and three poses for each identity. Thus, each demographic group contained 12,000 faces for a total of 72,000 faces in the whole dataset. The DiveFace proportions for every class in the dataset are shown in Table 1. Three ethnicity categories are available, related to the physical characteristics of each ethnic group:

  • Group 1: people with ancestral origin in Japan, China, Korea, and other countries in that region.

  • Group 2: people with ancestral origins in Sub-Saharan Africa, India, Bangladesh, Bhutan, and others.

  • Group 3: people with ancestral origins in Europe, North-America, and Latin-America with European origin.

In this study, we denoted Group 1, 2, and 3 as Asian, Black, and Caucasian, respectively. A t-distributed Stochastic Neighbor Embedding (t-SNE) maaten2008visualizing of dimension 2 from ResNet-50 descriptors of the full DiveFace dataset is shown in Figure 6. As can be seen, the six clusters separated from each other clearly. However, a few data points in the Male-Black category are also in the clusters of Male-Asian and Male-Caucasian.

5.1.2 Labeled Faces in the Wild

Labeled Faces in the Wild (LFW) database was introduced to evaluate the performance of face verification algorithms with unconstrained parameters, such as position, pose, lighting, background, camera quality, and gender LFWTech. The database contains 13,233 faces collected from the web from 5,749 unique individuals.

LFW was published in 2007. It has been a very popular database in the face recognition field. LFW has already been split properly into standard training and test sets. In this work, we used the test set for evaluating the performance of our framework. It contains a balanced set of 1,000 sample pairs (500 pairs of genuine facial images and 500 pairs of imposter images).

5.2 Experimental settings

In this study, we divided the DiveFace dataset into training, validation, and test sets. The size of the training set was 60% of the whole dataset; the size of the validation set was 10%; and the size of the testing set was 30%. The training set was used to train the triplet models and gender-ethnicity classifier models; the validation set was used to select optimal models; and the testing set was used to evaluate the prediction performances of all tested models. On the other hand, the performance of our full framework was evaluated with the LFW database. The average and standard deviation of the metrics of ten experimental runs, each with a different random split, are reported. For the image pairing, the set of positive samples was constructed by pairing all pose images in all possible ways within each identity. On the other hand, the set of negative samples was constructed by randomly pairing different identities.

The performance of the proposed SELM is evaluated in comparison to ResNet and ELM. ResNet is one of the most well-known DCNNs methods. As comparison baseline we used a ResNet-50 architecture pre-trained for face recognition with VGGFace2 (millions of images). The pre-trained ResNet-50 was then used to train our triplet models. These triplet models classify input image pairs into two classes (genuine or impostor match) based on Euclidean distance. The ELM and SELM methods have a similar architecture. The architecture is based on a single layer feedforward neural network that can be trained much faster than common artificial neural networks. On the other hand, SELM has one more additional layer (the Siamese layer). Both ELM and SELM use a kernel trick together with a pseudoinverse technique to generate the weights of the model that provide the lowest error rate. Moreover, we evaluate the performance when using four different types of Siamese conditions to improve the classification outcome.

As for parameter settings, the parameters of the three methods are tuned to obtain the best result. False Acceptance Rate (FAR) and False Rejection Rate (FRR) were used to find an optimal threshold, which is considered to be at the Error Equal Rate (EER). For ELM, three parameters were tuned: regularizing which was set to be , percentage of hidden nodes which was in the range of , and gamma in RBF kernel which was . For SELM, two parameters were tuned: regularizing and percentage of hidden nodes. As for ResNet, we used the same Euclidean coefficients for calculating the loss function as the Euclidean coefficients that we used in the kernel trick in SELM. Hence, for the kernel trick, no parameters needed to be tuned.

6 Results and discussion

In this section, we report the experimental results on the following types of evaluation: evaluation of feature performance, evaluation of classifiers, evaluation of Siamese and non-Siamese architectures, and evaluation of the performance of the whole framework. Two types of evaluation metrics are employed: verification accuracy and AUC (Area Under the Curve). For each experiment, the average and standard deviation of ten runs are reported.

6.1 Evaluation of feature performance

Female-Asian Female-Black Female-Caucasian Male-Asian Male-Black Male-Caucasian Average
SI 88.8917 ± 0.40 92.9042 ± 0.56 96.5625 ± 0.34 90.6625 ± 0.67 95.4958 ± 0.42 97.3875 ± 0.26 93.6507 ± 0.44
GD 96.4375 ± 0.37 95.5167 ± 0.40 97.0667 ± 0.44 95.8333 ± 0.49 95.6333 ± 0.20 96.7500 ± 0.28 96.2063 ± 0.36
GED 98.3792 ± 0.28 97.2375 ± 0.33 98.6708 ± 0.24 97.7208 ± 0.27 97.0917 ± 0.30 98.1083 ± 0.16 97.8681 ± 0.26

(a) Accuracy
Female-Asian Female-Black Female-Caucasian Male-Asian Male-Black Male-Caucasian Average
SI 98.2868 ± 0.40 96.7622 ± 0.35 98.4575 ± 0.25 98.5000 ± 0.29 96.9556 ± 0.24 98.2676 ± 0.32 97.8716 ± 0.31
GD 99.3332 ± 0.23 98.5432 ± 0.19 99.3619 ± 0.15 99.1686 ± 0.20 98.7926 ± 0.20 99.0539 ± 0.20 99.0422 ± 0.20
GED 99.6405 ± 0.18 99.0981 ± 0.28 99.6645 ± 0.15 99.5362 ± 0.14 99.2382 ± 0.24 99.5476 ± 0.11 99.4542 ± 0.18

(b) AUC
Table 2: Performance metrics achieved by ResNet of each feature type on the DiveFace dataset.

The performances of all features used in the experiment are presented in this section. ResNet-50 was used to train three different feature-extraction models, which were trained differently as follows.

  • Subject-Independent (SI) feature model: this model was trained by randomly pairing (no pattern) individuals as training samples, e.g., no pre-assigned values for proportions of gender and ethnicity classes. This kind of model training is conventional in face recognition.

  • Gender-Dependent (GD) feature model: this model was trained independently for Males and Females.

  • Gender-Ethnicity-Dependent (GED) feature model: this model focused on facial characteristics of each cohort, thus the model was trained independently on each of the six considered cohorts. The number of training samples from every cohort was assigned to be the same.

The experimental results on the DiveFace dataset are shown in Table 2(a) and 2(b). The best features among all types of features in every cohort are marked in bold.

As can be seen in Table 2, the values of Accuracy and AUC reflect each other, the higher the Accuracy, the higher the AUC, and vice versa. The feature performance of SI, the baseline, was the worst, but it still reached up to 93.65% and 97.87% in overall accuracy and AUC, respectively. Therefore, it was a challenge to improve on those metrics. Nevertheless, GED and GD were able to yield a better AUC performance: 99.45% and 99.04% AUC value, respectively. GED results to be the best among the tested methods, followed by GD. Furthermore, compared to SI and GD, GED exhibits better metrics for every cohort. This result confirms our hypothesis that training samples with specific, distinctive groups could induce the model to learn more useful facial features. The reason that the performance of GD was higher than SI and that the performance of GED was higher than GD is that GD learned intensively and independently on gender group, and GED learned in the same way as GD but on both gender and ethnicity groups.

Nevertheless, GED performance was only

better than that of GD. To check if that difference was significant or not, we used one-way ANOVA to test the null hypothesis (SI, GD, and GED have the same population mean,

siegel1956nonparametric. The statistical result, , indicates that the difference is statistically significant at a level of , hence the null hypothesis was rejected. GED is the best feature type among the three models tested in this work.

6.2 Evaluation of classifier performance

Female-Asian Female-Black Female-Caucasian Male-Asian Male-Black Male-Caucasian Average
ResNet 98.3792 ± 0.28 97.2375 ± 0.33 98.6708 ± 0.24 97.7208 ± 0.27 97.0917 ± 0.30 98.1083 ± 0.16 97.8681 ± 0.26
ELM 97.8917 ± 0.32 96.9000 ± 0.36 98.0667 ± 0.35 96.0583 ± 0.56 97.0750 ± 0.36 98.0000 ± 0.41 97.3319 ± 0.39
SELMSum 98.7042 ± 0.27 97.7625 ± 0.58 98.9083 ± 0.23 97.8917 ± 0.37 98.0042 ± 0.33 98.6167 ± 0.22 98.3146 ± 0.33
SELMDist 97.6125 ± 1.33 97.5208 ± 0.55 98.8208 ± 0.33 98.0333 ± 0.17 97.6833 ± 0.42 98.6250 ± 0.19 98.0493 ± 0.50
SELMMult 98.5125 ± 0.32 97.4750 ± 0.60 98.7333 ± 0.30 98.3750 ± 0.28 97.6208 ± 0.36 98.4708 ± 0.20 98.1979 ± 0.34
SELMMean 98.7125 ± 0.28 97.7708 ± 0.58 98.9083 ± 0.23 97.9083 ± 0.34 98.0042 ± 0.33 98.6208 ± 0.22 98.3208 ± 0.33

(a) Accuracy
Female-Asian Female-Black Female-Caucasian Male-Asian Male-Black Male-Caucasian Average
ResNet 99.6405 ± 0.18 99.0981 ± 0.28 99.6645 ± 0.15 99.5362 ± 0.14 99.2382 ± 0.24 99.5476 ± 0.11 99.4542 ± 0.18
ELM 99.6763 ± 0.15 99.2188 ± 0.23 99.7290 ± 0.11 99.5469 ± 0.14 99.4825 ± 0.14 99.7391 ± 0.09 99.5654 ± 0.14
SELMSum 99.7911 ± 0.09 99.4315 ± 0.20 99.8274 ± 0.10 99.7653 ± 0.09 99.6747 ± 0.11 99.8497 ± 0.06 99.7233 ± 0.11
SELMDist 99.7322 ± 0.11 99.3970 ± 0.20 99.7763 ± 0.10 99.7031 ± 0.11 99.6437 ± 0.13 99.8311 ± 0.05 99.6806 ± 0.12
SELMMult 99.6263 ± 0.19 99.1449 ± 0.28 99.6647 ± 0.15 99.7915 ± 0.08 99.3344 ± 0.23 99.5760 ± 0.12 99.5230 ± 0.17
SELMMean 99.7913 ± 0.09 99.4286 ± 0.20 99.8280 ± 0.10 99.7659 ± 0.09 99.6751 ± 0.11 99.8503 ± 0.06 99.7232 ± 0.11

(b) AUC
Table 3: Performance metrics on the DiveFace dataset achieved by the proposed SELM in comparison to standard ELM and the ResNet baseline using the most robust feature (GED).
Figure 7: Comparison of number of wins accomplished by SELMSum and SELMMean in terms of Accuracy and AUC evaluation metrics.
Figure 8: Summation of ranked order in terms of AUC performance, reported as stacked bars in descending order of ten experiments.

The performances of ResNet, ELM, and SELM embedded with four different types of Siamese conditions—summation, distance, multiply, and mean denoted as Sum, Dist, Mult, and Mean, respectively—are shown in Table 3(a) and 3(b). We used the best feature, GED, obtained from the previous experiment, Section 6.1. Table 3 lists the performance metrics—Accuracy and AUC—achieved by the proposed SELM in comparison to standard ELM and the ResNet baseline. The best metric achieved by the best classifier candidate for each identity cohort is marked in bold.

The experimental results in Table 3(a) show that SELMMean is the best classification method in terms of overall accuracy score, followed by SELMSum, SELMMult, SELMDist, ResNet, and ELM. SELMMean yields the highest accuracy for four out of the six demographic groups; SELMSum yields the highest accuracy for two out of the six demographic groups; and SELMMult yields the highest accuracy for one group. Nevertheless, the accuracy score achieved by the first and second best methods, SELMMean and SELMSum, differs only by . Furthermore, SELMSum achieves the highest AUC metric () for only one out of six groups, but SELMMean, () achieves the highest AUC for four out of the six groups. SELMDist, ELM, SELMMult and ResNet follow those two in this order. Figure 7 shows a comparison between the number of wins of SELMSum and SELMMean, in terms of both Accuracy and AUC evaluation metrics. Since the graphs were data from ten experimental runs of six demographic cohorts, the ideal score should be . Figure 7 shows clearly that SELMSum is definitely better than SELMMean for 51 out of 60 cases in terms of accuracy and 45 out of 60 cases in terms of AUC.

In addition, as a way to rank the methods, we show in Figure 8 the accumulated AUC score ranks across the ten experimental runs. The ideal summation would be 1st rank in all 60 experimental runs, i.e., 60 is the lowest summation possible (best method). At the other extreme, 6th, 360 would be the highest accumulated rank possible (worst method). We then used Kendall’s Coefficient of Concordance , a statistical technique, to calculate the degree of reliability of the ranked order:


where is the average ranked order assigned to the -th candidate; is the number of candidate methods (six); and the number of runs times the number of cohort groups is 60. The value of was found to be . The critical value in distribution was converted from by the following equation:


We acquired which indicates that the ranked order shown in Figure 8 is reliable at a confidence level of 99.9%. The rank order is as follows:

6.3 Evaluation of Siamese and non-Siamese architectures performance

(a) Accuracy
(b) AUC
Figure 9: The performances of Siamese (SELM) VS non-Siamese Extreme Learning Machines (WELM).

In this section, we compare the performance of the most robust Siamese architecture (SELMSum) to that of WELM, an ELM with non-Siamese architecture. Their backbone architecture was identical except the additional Siamese layer in SELM. Simultaneous dual inputs into WELM were concatenated for training the network, but these inputs were not concatenated by SELM; instead, they were passed through the Siamese layer. Any subsequent procedural steps of the two architectures are the same.

Figure 9 shows the Accuracy and AUC performances of WELM and SELM while using an incresing number of hidden nodes to train a model (Figure 8(a) and 8(b), respectively). The performance values are obtained averaging across the six available demographic cohorts. It can be seen that WELM has to use a large number of hidden nodes up to of the training samples in order to compete with SELM, while SELM needs only less than in order to achieve excellent results. The optimal model of WELM achieves Accuracy when the number of its hidden nodes is , while SELM achieves Accuracy with a number of hidden nodes of only . It should also be noted that SELM was able to achieve Accuracy and AUC with a number of hidden nodes of only . We used two-sample -test analysis to check the statistical significance between the mean scores from both methods at and found that the t-values for Accuracy and AUC are and , respectively. Hence, we conclude that the proposed Siamese-ELM performs significantly better than the standard non-Siamese-ELM.

6.4 Evaluation of the whole system performance

Figure 10: System’s face verification Accuracy and ranked order for each demographic group of the LFW database.
Figure 11: False Acceptance Rate and False Rejection Rate of the proposed SELM systems for the LFW database.

We evaluated the proposed system, described in Section 4, in conjunction with the most robust feature, GED, described in Section 6.1, and the most robust classification method, SELMSum, described in Section 6.2. The whole system is termed SELM. It should be noted that the proposed system first classifies individuals according to their respective Gender-Ethnicity class so that a proper feature-extraction model could be selected for that purpose, and the input image pairs that are not in the same Gender-Ethnicity class are classified as impostor comparisons. SELM is similar to SELM but without the initial Gender-Ethnicity classification. In Figure 10 we show the performances of ResNet (baseline), SELM, and SELM tested on the standard test set of the LFW database.

The ranked order of each demographic is shown on the top of the bar representing that group in Figure 10. It can be seen that SELM is the best method producing the smallest sum of ranked order (), followed by ResNet (), and SELM (). The performances of both SELM and SELM for the Black demographic class are lower than the performance obtained for Asian and Caucasian classes. This is because the systems were trained on DiveFace, which contains data of individuals in the Black group whose origin is in the Sub-Saharan region, Africa, India, Bangladesh, and Bhutan, while the tested LFW dataset contains data of individuals of the Black group not well represented by those regions. Regarding the performance of SELM, it works like a two-stage prediction system, and the accuracy of the final prediction in the second stage depends highly on the performance of the first stage, the Gender-Ethnicity prediction model. In this study, SELM yielded very accurate outcomes when the first stage provided an ideal classification of Gender-Ethnicity group. Figure 11 shows bar graphs of two evaluation metrics—False Acceptance Rate (FAR) and False Rejection Rate (FRR)—produced by ResNet, SELM, and SELM. FAR was considered the most important metric for this kind of task. It represented the rate of which wrong persons were given access to the system. The performance results show that both SELM and SELM provided a very low FAR (), 12 times lower than that provided by ResNet (), indicating that they would execute with far less error in face recognition tasks.

7 Conclusion

A framework for face verification is proposed. The framework employs a new classification method called Siamese Extreme Learning Machine (SELM), an improved version of a powerful classification method called Extreme Learning Machine that can accept two image inputs in parallel and process them concurrently. In our performance evaluation, SELM was studied in conjunction with several features that were trained on unbiased demographic-dependent groups. With this training, the feature-extraction model in our proposed SELM was able to better recognize distinct features of individuals in demographic groups than a conventional feature-extraction model was able to. In an evaluation experiment, four different types of Siamese conditions embedded in the Siamese layer were compared. The SELM with summation and mean conditions provided the highest overall performance score. Furthermore, in another experiment, SELM with ‘sum’ Siamese condition was demonstrated to be more robust than baseline methods ResNet and ELM. In particular, the proposed method was able to perform the verification task better, with Accuracy and AUC, than the other methods. More importantly, SELM provided a very low false acceptance rate, which was times lower than that provided by ResNet (), a considerable improvement.

For future work, we aim to do the following: (i) train our own face recognition model from scratch to eliminate any bias from the beginning ter21bias, (ii) explore other architectures for processing multiple inputs on top of ELM backbones beyond Siamese settings using recent advances from the information fusion field fierrez18fusion2

, and (iii) applying SELM to other types of image comparison tasks in addition to human face verification.



This work was supported by the Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang and project BIBECA (RTI2018-101248-B-I00 MINECO/FEDER).

Conflicts of interest/Competing interests

The authors declare that they have no competing interests.