Pixel-Level Alignment of Facial Images for High Accuracy Recognition Using Ensemble of Patches

02/07/2018 ∙ by Hoda Mohammadzade, et al. ∙ Sharif Accelerator 0

The variation of pose, illumination and expression makes face recognition still a challenging problem. As a pre-processing in holistic approaches, faces are usually aligned by eyes. The proposed method tries to perform a pixel alignment rather than eye-alignment by mapping the geometry of faces to a reference face while keeping their own textures. The proposed geometry alignment not only creates a meaningful correspondence among every pixel of all faces, but also removes expression and pose variations effectively. The geometry alignment is performed pixel-wise, i.e., every pixel of the face is corresponded to a pixel of the reference face. In the proposed method, the information of intensity and geometry of faces are separated properly, trained by separate classifiers, and finally fused together to recognize human faces. Experimental results show a great improvement using the proposed method in comparison to eye-aligned recognition. For instance, at the false acceptance rate of 0.001, the recognition rates are respectively improved by 24 in Yale and AT&T datasets. In LFW dataset, which is a challenging big dataset, improvement is 20



There are no comments yet.


page 4

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Face Recognition is one of the most attractive and practical fields of research in pattern analysis and image processing, receiving much attention from different knowledge backgrounds including pattern recognition, computer vision, image processing, statistical learning, neural networks, and computer graphics


According to [1]

, face recognition methods can be categorized into two main categories; feature-based and holistic (whole-pixels) methods. Feature-based methods try to create a feature vector out of the face for the learning process. The holistic recognition uses all pixels of face region as raw data for recognition and learning.

Feature-based methods utilize the geometrical and structural features of face [1]. For instance, in [2], features of head width, distances between eyes and eyes to mouth are compared. In [3], angles and distances between eye corners, mouth hole, chin top and the nostrils are used. In [4], face features such as mouth, nose, eyebrows, and face outline are detected using horizontal and vertical gradients. In this method, template matching using correlation is also proposed. In [5, 6]Hidden Markov Model (HMM) is used on pixel strips of different parts of face. Also, recently, a patch-based representation is used in [7] in which each patch tries to learn a transformation dictionary in order to transform the features onto a discriminative subspace. Paper [8] is another feature-based method in which a pyramid of facial image is created and the patches around five key landmarks in different pyramid levels are concatenated to prepare a high-dimensional feature vector.

Some feature-based methods use both features and whole pixels together in order to enhance the performance of recognition [1]. Eigenmodules [9] can be mentioned in which eigenfaces are combined with eigenmodules of face such as eigeneyes, eigenmouth and eigennose. In [10] Principle Component Analysis (PCA) is used in combination with Local Feature Analysis (LFA). Some of the methods in this category which seem more promising are based on “shape-free” face concept. In [11, 12], Active Appearance Model (AAM) has been proposed as a method of warping textures of image patches to a specific geometry in an iterative manner. In this method [13, 14, 15, 16], the patch of the face is labeled by several landmarks (model points), the texture of face is projected onto the texture model frame by applying scale and offset to the intensities, and the residual (error) between the projected and previous image patches is iteratively reduced. In [13, 16], the shape of the face is also modeled using Active Shape Model (ASM) [17]

. The authors have shown that different weights of eigenvalues can vary the different aspects and parts of face shape models. In

[18], several shape-free (neutral) faces create an ensemble and all the faces are approximated by a linear combination of the eigenfaces of the ensemble.

Despite significant advances of feature-based methods, holistic methods are still being received lots of attention as they use the information of all pixels in the face region. Holistic methods detect and crop the face out of the image and use it as a raw input for classification. Eigenfaces [19, 20], Fisherfaces [21], and Kernel faces [22, 23]

are several well-known examples of this category which respectively create a feature space using Principle Component Analysis (PCA), Fisher Linear Discriminant Analysis (LDA) and Kernel Direct Discriminant Analysis (KDDA) for face classification and recognition. Face recognition using support vector machine (SVM)

[24] is another method from this category, which formulates face recognition as a two-class problem, one class as dissimilarities between faces of the same person and the other class as dissimilarities between faces of different individuals. Bayesian classifier [25] can also be mentioned in this category, which has a probabilistic approach toward the similarity of faces. Some other holistic methods of face recognition have used artificial neural networks [1, 26]. As instance, Probabilistic Decision-Based Neural Network (PDBNN) [27]

and Convolutional Neural Networks (CNN)

[28, 29, 30, 31, 32] can be mentioned. Recently, Sparse Representation based Classification (SRC) [33] is used in order to create a recognition system with robustness to illumination and occlusion.

Both geometrical and intensity features exist in a 2D image of a face, which help human eye to recognize people from their images. E.g., both eye color and the distance between eyes and nose can be inferred from a facial image. Accordingly, in a successful face recognition system both of these categories of features, i.e., geometry and intensity, should be appropriately used. However, whenever eye alignment is used in a holistic method or other approaches, the correspondence between organs other than eyes are disturbed; the intensity of lips in different faces cannot be compared with each other nor their position can be compared. The main contribution of this work is to introduce an alignment method by which the intensity and geometry information are separated from each other. Each of these pieces of information is then used to train their corresponding classification modules and finally their results are fused together to recognize human faces. Moreover, note that any classification algorithm, such as Fisher-LDA-based ensemble of patches which is used in this paper, can be used as the classifier in the proposed method. In particular, the proposed method provides appropriate inputs for methods such as Convolutional Neural Networks (CNN), which automatically design specialized filters and require aligned raw features as inputs.

To mention in more details, two major tasks are accomplished using the proposed method:

  1. The proposed alignment method places the intensity of similar organs in the same positions in the warped faces.

  2. When intensities are properly aligned, using the proposed geometry extraction method, the coordinate of the aligned pixels can be extracted to be used as the corresponding geometry information.

As a result, the proposed pixel alignment provides both intensity and geometry information useful for recognition.

The remainder of this paper is organized as follows. Section II details the geometrical alignment as the first part of proposed method. Thereafter, geometrical information is more discussed in Section III. Creating feature vectors using ensemble of patches, using Fisher Linear Discriminant Analysis (LDA), and decision fusion are explained in Section IV afterwards. Section IV also sums up the proposed method by illustrating the overall structure. The utilized datasets and experimental results are also reported in Section V. Some discussions on alignment of features and ensemble of patches are gone through in Section VI. Finally, in Sections VII and VIII, article is concluded and the potential future work is mentioned, respectively.

Ii Geometrical Alignment

Geometrical alignment can be defined as aligning the geometries of faces to a unique geometry while saving their own textures. In the proposed geometrical alignment method, a reference geometry is defined and the geometry of all faces of train and test procedures are transformed to this geometry. Here, the geometry of a face is defined as the location of the contours of the facial landmarks. Therefore, geometrical alignment is performed by warping a face such that its facial contours coincide with those of the reference contours.

In the proposed method, in order to detect facial landmarks, every landmark detection method can be used such as Active Shape Model (ASM) [17] or Constrained Local Neural Fields (CLNF) [34]. In this work, CLNF is utilized for this purpose.111The code of CLNF method can be found in https://github.com/TadasBaltrusaitis/OpenFace. The landmarks in this work are as follows. There are 17 landmarks around the face region, 14 landmarks for lips, three landmarks for each upper and lower teeth, six landmarks for each eye, nine landmarks for the whole nose and five landmarks for each eyebrow, resulting in 68 total landmarks.

In the following sections, different steps of the proposed method are explained in details.

Ii-a Fitting Face contours

The CLNF method [34], which is an enhanced Constrained Local Model (CLM) [35], is used for detecting landmarks of each train and test face. This method is briefly described in the following. Interested readers are encouraged to refer to [34] for more details.

The CLNF method consists of two main parts: (I) probabilistic patch expert (landmark detector), and (II) non-uniform regularized landmark mean-shift optimization technique.

At first, face or faces are detected with a tree-based method. CLNF method introduces patch experts which are small partitions of pixels around the interest points such as face edge, eyes, eyebrows, nose and lips. The initial patch experts are put on the image. The pixels which fall in the patch are named as where is a two-dimensional vector representing the coordinate of pixels. This method uses a one-layer neural network with as inputs, and outputs where is a scalar [34]. A potential function is defined as which is a function of vertex features and edge features and . The features are defined as [34],


where ’s are the weights of neuron and

is the sigmoid activation function of the neural network.


control the smoothness (similarity) and sparsity, respectively. This method attempts to maximize the probability


Non-uniform regularized landmark mean-shift optimization technique considers that variance of different patches are not similar and therefore sets several weights

’s for them. The contour of patches is updated as [34],


in which the step is defined as,


where is the Jacobian of the landmark locations, is the regularization factor, is the matrix describing the prior on the parameter , and is the mean-shift vector over the patch responses [34].

Ii-B Reference Contour

Reference contours are obtained by averaging the contours of landmarks of several neutral faces from the training set. Figure 1 shows an example of reference contours.

Fig. 1: Obtaining reference contour by averaging landmark contours of several neutral faces.
Fig. 2: Procedure of geometrical transformation and pixel-to-pixel face warping.

Ii-C Transformation and Pixel-to-pixel Warping

After fitting the contour of landmarks to the input face, the face is geometrically transformed and warped to reshape to the geometry of reference face. This step is detailed in this section.

For the geometrical transformation and pixel-to-pixel warping, three interpolations are performed as depicted in Fig.

2 which are detailed next. As a result of these interpolations, the intensity of each pixel is transformed to its corresponding location on the warped face. This transformation is guided by the transformation between the location of landmarks on the input face and the location of those on the warped face.

It is important to note that the proposed face warping method differs from the conventional one as the target coordinates for every single pixel of the input face is calculated using the described interpolation procedures.

Ii-C1 Affine Interpolation

In this work, affine transformation is used in order to perform the coordinate interpolations, i.e., - and -interpolation. Affine transformation uses three surrounding points to calculate the interpolated value at a new point. Assuming that the points have two dimensions. In affine interpolation method [36], the value at each point is approximated as,


where the coefficients to are calculated by solving the following linear system, according to Fig. 3,


If this matrix equation is denoted as , then by solving it using least square method, the coefficients are found as,

Fig. 3: Affine interpolation.

Ii-C2 Delaunay Triangulation of Landmarks

Fig. 4: Delaunay triangulation of face landmarks.

According to [37], a triangulation of a finite point set is called a Delaunay triangulation, if the circumcircle of every triangle is empty, that is, there is no point from inside the circumcircle of any triangle.

Each face is triangulated using Delaunay method, as depicted in Fig. 4. By performing triangulation, the triangles needed for affine interpolations are obtained which are used in geometrical transformation as described next.

Ii-C3 Geometrical Transformation

Let and denote the coordinates of pixels on the input and warped face, respectively, and and denote their corresponding intensities.

For -interpolation, an auxiliary matrix is created with the same size as the input face. In this matrix, the of landmarks are put on the same entry as they were in the input face matrix. The other entries of this matrix are found using affine interpolation resulting in the coordinate of other pixels. This procedure is depicted in Fig. 5. The -interpolation is performed similarly as shown in Fig. 6. Thereafter, the target coordinate of all input pixels are found and each input pixel is known where to be transferred.

Fig. 5: -interpolation.

Ii-C4 Pixel-to-pixel Warping

After - and -interpolations, each coordinate gets the intensity of its corresponding from the input face, i.e.,


values are then resampled on a uniform grid, e.g., pixels, to create the warped face (see Fig. 7).

Fig. 6: -interpolation.
Fig. 7: intensity-interpolation.
Fig. 8: An example of geometrical transformation and pixel-to-pixel warping.
Fig. 9: Illustration of and information for a sample warped face. (a) information, (b) information.

For the sake of demonstration, an example of geometrical transformation and pixel-to-pixel warping on a sample face with a few number of landmarks is depicted in Fig. 8. In this figure, the yellow diamond points and red square points are respectively input and reference landmarks. The face is warped so that the input landmarks are precisely located at the position of reference landmarks, as it was the goal of geometrical transformation. The other pixels are interpolated as explained previously.

Iii Geometrical Information

Geometrical information seems to be useful in addition to intensity information of the warped face. Obviously, the geometry information of each face exists in its unwarped (input) face. By finding the original coordinate (i.e., coordinate in the unwarped face image) of each pixel of the warped face, geometry information can be gathered. However, as and coordinates have been once resampled, their original coordinates cannot be found directly. These coordinates can be obtained by performing two other resamplings on the same grid as before; one for original values and one for original values. To better explain it, two other interpolations are performed in which the and source coordinate of each pixel in the warped face is found using interpolation. These two interpolations are exactly the same as previous intensity-interpolation (Fig. 7) but by replacing with and .

For the sake of better visualization, the difference of original coordinates and of every pixel from its previous pixel is calculated. The differences in original coordinates are denoted as and here, respectively for differences in and information. Figure 9 illustrates the information of and for a sample face in Yale dataset [38]. The amount of vertical and horizontal transitions of each pixel after warping can be seen in this figure. This figure shows that for this specific face, warping has changed face more in horizontal direction rather than vertical.

Iv Classification Using Ensemble of Patches

Iv-a Ensemble of Patches and Feature Vectors

Instead of using the whole face, a patch-based approach is used in this work. To do this, an ensemble of patches are created in the limit of face frame. The location of patches are selected randomly once, and for all faces of dataset, the same patches are used in both training and testing phases. The optimum number and size of patches were found through trial and error to be 80 and pixels, respectively, over various different datasets.

For every face, the ensemble of patches are applied on intensity matrix of its warped face, its information, and its matrix. An example of applying ensemble of patches on these three matrices is depicted in Fig. 10. Note that the information of and is the same as and . In order to have the feature vectors of each patch, the matrix coefficients fell in the patch are reshaped as a vector. In other words, for the patch, if the size of patch is , the feature vectors are obtained as,


where , , and are respectively the feature vectors of patch with respect to intensity, , and matrices. Moreover, , , and denote the coefficient of intensity, , and matrices which fall in pixel of the patch.

Fig. 10: Classification using ensemble of patches.

Iv-B Fisher Linear Discriminant Analysis

After constructing the feature vecotrs of ensemble of patches, three separate Fisher Linear Discriminant Analysis (LDA) subspaces are trained for every patch. To better explain, for patch in all training set of faces, one Fisher LDA subspace is trained using the feature vectors , one for feature vectors , and one for feature vectors . In this work, Fisherface method [21] is used for classification of each patch; however, other more complicated learning methods can be used in future works.

The goal of Fisher LDA is maximizing the ratio of,


where and are the between- and within-class scattering matrices, respectively [39, 40], formulated as,


where is the mean of class and is the mean of means of classes. is the number of samples of class. And is the sample of class ().

After finding scattering matrices, a discriminative subspace is created using the eigenvectors of

matrix. To extract the discriminative features from each feature vector, it should be projected onto this subspace. If denotes the number of classes, this projection also reduces the dimension of data to [39, 40].

Fig. 11: The overall structure of the proposed method.

Iv-C Decision Making

Clearly, there are a lot of different features available rather than one, i.e., intensity, , and features for all patches. Hence, in order to obtain the final similarity/distance score between two face images, a fusion is required to be performed. The fusion can be performed either before, during, or after classification, which are respectively known as data-, feature-, and decision-level fusion. In the fusion of data and feature, respectively, the two feature vectors are concatenated before and after projecting to the discriminative subspace; and in the fusion of decision, the resulting scores are fused. The fusion of decision is found to perform better in this work.

For patch in every face image, each of the feature vectors, , , and , is projected onto their corresponding discriminative LDA subspace, obtained as described in Section IV-B. The projections result in projected feature vectors , , and . In the context of face recognition, it has been shown that the cosine of the angle between two discriminative feature vectors, which is obviously a similarity score, results a better recognition rather than distance measures such as Euclidean distance [41, 42]. Hence, cosine is used in this work for matching purposes. Then, the similarity score between two face images and is calculated as follows. First, the similarity scores in the discriminative subspaces related to patch are obtained as,


where , , and are respectively projected feature vectors , , and in face image.

Then, the final similarity score is simply obtained by a weighted summation of all the scores of patches (decision fusion),


where is the weight associated to the geometrical information, and the weight of intensity information is considered to be one for simplicity. The classification using ensemble of patches is summarized in Fig. 10.

Iv-D Overall structure of the proposed face recognition framework

The proposed method can be summarized as is depicted in Fig. 11. In this method, a set of reference contours is constructed, landmarks of each train/test face are detected using CLNF method, the faces are transformed geometrically to the reference, warping is performed, and feature vectors are created for classification. In preparing feature vectors, the ensemble of patches are considered for matrices of warped intensity, , and . A separate Fisher LDA is trained for every patch in each of these matrices. Finally, in the test phase, the feature vectors are projected onto the corresponding LDA subspaces and the similarity scores are summed up together in order to have the total score.

V Experimental Results

V-a Datasets

Four different datasets are used for evaluating the recognition performance using the proposed alignment method, which are Yale [38], AT&T [43], Cohn-Kanade [44, 45], and LFW datasets [46, 47] detailed in the following. In this work, as a pre-processing, the datasets are eye-aligned and then cropped using CLNF method [34]. To more explain, the location of eyes are found using CLNF and by a translation and rotation, the faces become eye-aligned.

V-A1 Yale face dataset

The Yale face dataset [38] was created by the Center of Computational Vision and Control at Yale University, New Haven. It consists of 165 grayscale face images of 15 different persons. There exist 11 images per person depicting different facial expressions.

V-A2 AT&T face dataset

The AT&T face dataset [43] was created by the AT&T Laboratories Cambridge in 2002. There are pictures of 40 different persons with 10 different facial expressions.

Fig. 12: Several samples of pixel alignment in Yale dataset
Fig. 13: Several samples of pixel alignment in AT&T dataset

V-A3 Cohn-Kanade face dataset

Cohn-Kanade dataset [44, 45] includes 486 face sequences from 97 persons. Every sequence starts with neutral face and ends with extreme versions of expressions. Different expressions exist in this dataset, such as laughing, surprising, and etc. The first version of this dataset is used here. For every person in this dataset, merely one neutral face, all middle expressions, and all extreme expressions are utilized in this work to perform experiments.

V-A4 LFW face dataset

Labeled Faces in the Wild (LFW) dataset [46, 47] is a very big and challenging dataset including 13,233 images of faces collected from the web. The faces have various poses, expressions, and locations in images. The distances of camera from persons are not necessarily the same in images. There are different number of images for every subject, from one to sometimes 10. The not-cropped and not-processed version of this dataset is used in this work for experiments.

V-B Warped Faces

In this section, for the sake of visualization, several warped faces which are pixel aligned are shown and analyzed. Several samples of warped faces from Yale and AT&T datasets are illustrated in figures 12 and 13, respectively. In these figures, the first and second row are faces before and after warping, respectively. At the right-hand side of these figures, the reference contours are shown.

As seen in Fig. 12, faces (a), (c) and (f) are smiling originally but the mouths in the corresponding warped faces are closed and the teeth are roughly removed. Face (d), however, is wondering originally while the mouth is totally closed after warping. Similarly, in Fig. 13, faces (b), (d), (e) and (f) have different expressions while their corresponding warped faces have neutral expression with closed mouths. As shown in these figures, removing the facial expression is obviously one of the results of the proposed pixel-by-pixel alignment method, which of course can greatly improve the recognition task. Moreover, faces (b), (c) and (e) in Fig. 12 show that this method can also change the pose of faces to the pose of the reference contours. Similarly, in Fig. 13, faces (a), (b), (c) and (d) have frontal pose after warping. Clearly, in all the warped faces in figures 12 and 13, not only are different organs of the face aligned, but also other features of the face are almost aligned. However, due to the drawback of the landmark detection method in converging to exact landmark points, some features may not become well-aligned. For instance, in Fig. 12, the eyes are not completely open in the warped faces (a), (b), (c), (e) and (f).

V-C Experiments

In all the experiments mentioned in this section, the dataset is firstly shuffled randomly and then 5-fold cross validation is performed. In the following, experimenting the impact of patch size are reported and analyzed. Thereafter, classification using ensemble of patches is compared to classification using the whole face. Finally, the proposed method is examined and compared to eye-aligned classification.

V-C1 Experiment on Size of Patches

In this experiment, the effect of patch sizes are mentioned and reported. Different experiments on AT&T dataset were performed with different sizes of patches, which are , , , , , and random-sized patches each with one of the mentioned sizes. In these experiments, 80 random patches were utilized, and the optimum weight of geometrical information was found to be 0.2 through trial and error.

In each iteration of the experiments, the similarity score between every pair of gallery and probe images is calculated and the Receiver Operating Characteristic (ROC) curve using all the scores are plotted. The ROC curves of experiments are depicted in Fig. 14. As is obvious in this figure, the size of patches has important impact on the recognition rate. According to the curves, patches perform better; therefore, merely patches are used in the next experiments.

Fig. 14: Effect of size of patches in classification using ensemble of patches

V-C2 Patch-based Recognition Using Eye-aligned and Pixel-aligned Faces

Several other experiments were performed evaluating the effect of using ensemble of patches for both eye-aligned and pixel-aligned face images. In these experiments, 80 random patches were utilized with size . First, classification using ensemble of patches and not using patches were tested on eye-aligned images. Note that in classification using ensemble of patches for eye-aligned faces, the ensemble was solely applied on intensity matrix of eye-aligned images because warping does not exist anymore and thus there is no geometrical information. Figure 15 shows ROC curves of the two experiments performed on AT&T dataset. As can be seen in this figure, using ensemble of patches results in overall worse performance than not using patches when eye-aligned method is utilized.

On the other hand, the same two experiments were performed using pixel-aligned faces rather than eye-aligned ones. The ROC curves of these experiments on AT&T dataset are also depicted in Fig. 15. As obvious in this figure, when pixel-aligned faces are used, patch-based recognition produces superior results compared to not using patches. For instance, in FAR of 0.001, verification rates are roughly 99% and 94% in recognition using ensemble of patches and not using it, respectively. This result verifies the effectiveness of using ensemble of patches alongside having faces pixel-to-pixel warped.

Fig. 15: Comparison of classification using the whole face or ensemble of patches

V-C3 Eye-aligned Versus Proposed Method

Fig. 16: Comparison of proposed method with eye-aligned face recognition on (a) Yale dataset, (b) AT&T dataset, (c) Cohn-Kanade dataset, (d) LFW dataset.

Eye-aligned face recognition is compared with the proposed pixel-aligned classification method in Fig. 16 and Table I. This comparison is performed for four datasets, which are Yale [38], AT&T [43], Cohn-Kanade [44, 45], and LFW datasets [46, 47]. LFW dataset is a very challenging and big dataset and includes images which might have more than one face, but still there is only one subject label associated with each image. For this dataset, using CLNF method [34], the faces in image were detected. If there were several detected faces in the image, the face with the biggest area (multiplication of height and width of face) and minimum distance from the center of image was extracted as the main face. Thereafter, the main face was cropped out of the image.

[b] Dataset Image FAR1 Verification Rate Yale Eye-aligned 0.001 74% Proposed alignment 0.001 99% AT&T Eye-aligned 0.001 62% Proposed alignment 0.001 94% Cohn-Kanade Eye-aligned 0.001 95.6% Proposed alignment 0.001 95.5% LFW Eye-aligned 0.1 64% Proposed alignment 0.1 84%

  • The reason of choosing False Acceptance Rate (FAR) of 0.1, rather than 0.001, for LFW dataset is that this dataset is very challenging giving sense to this rather easier false acceptance rate.

TABLE I: Results of the proposed method and eye-aligned face recognition in specific false alarm rates.

As can be seen in the ROC curves of Fig. 16 and Table I, the proposed method significantly outperforms eye-aligned face recognition with a wonderful enhancement in Yale, AT&T, and LFW dataset. Notice that LFW is a very challenging and big dataset, and Yale and AT&T datasets are two medium well-known datasets. The proposed method results very good in both big and small datasets, showing its power and effectiveness in different types of datasets.

In Cohn-Kanade dataset, however, eye-aligned method performs slightly better than the proposed method; although the ROC curves show that the rates of proposed method is almost near the rate of eye-aligned face recognition. The reason of this failure is that the CLNF method [34], which was used for warping, did not work precisely in detecting very open mouths in extremely surprising faces. Thus, warping could not be performed successfully because of imprecise detected landmarks. Therefore, this failure is not because of the weakness of the proposed method, but because of not having correct and accurate landmarks as input.

Vi Discussions

Vi-a Discussion on Alignment of Features

The most important contribution of this work was to introduce a method for fine alignment of intensity features of the face which as a by-product also results in accurate extraction of geometry information of the face. In other words, the proposed alignment method places the intensity of similar organs in the same positions in the warped faces. On the other hand, when intensities are properly aligned, using the proposed geometry extraction method, the coordinates of the aligned pixels can be extracted as the corresponding geometry information. As a result, pixel alignment provides both intensity and geometry information useful for recognition. Moreover, it is important to note that the key finding of this work, i.e., finer alignment results in better classification, is not limited to face recognition problem and is applicable to any other pattern recognition/classification challenge where features are badly corresponded. However, in such pattern recognition problems, what is required to be considered first is to construct an alignment framework, i.e., defining proper correspondences and creating a method for aligning to a reference assortment of features.

Vi-B Discussion on Classification Using Ensemble of Patches

As it was experimented in Section V-C2 and shown in Fig. 15, it was observed that for warped faces, classification using ensemble of patches enhances performance in comparison to classification using the whole face. However, this enhancement is not seen for eye-aligned (not warped) faces. This might be because of the fact that in warped faces, every pixel corresponds to a specific region (such as lip corners) in all faces; however, it is not true in eye-aligned images. Therefore, in warped faces, every patch covers similar and corresponding pixels in all faces but may cover not related pixels in eye-aligned ones. That is why using patches has made the result worse in eye-aligned faces, as well as improving result in warped faces.

Vii Conclusion

In this article, a pixel-level facial alignment method, i.e., a method to align the whole pixels of faces is proposed. This alignment is achieved by mapping the face geometry onto a reference geometry, where the mapping is guided by contours of facial landmarks which are fitted to each face using landmark detection methods such as CLNF [34]. The proposed alignment method provides both the aligned intensity information and their corresponding geometry information. The resulting aligned intensity and geometry features create superior recognition results when they are used in a patch-based recognition framework.

The experiments were performed on four well-known datasets and Fisherfaces [21] was used as an instance of a holistic-based face recognition method. Results showed the significantly better performance of patch-based pixel-aligned face recognition in comparison to eye-aligned face recognition in all utilized datasets (except on Cohn-Kanade dataset with slightly rate difference). The reason of not having outperformance in this dataset is that the landmark detection, which is not a contribution of this work, did not work properly in extreme expressions resulting in not qualified warping. The proposed method guarantees better performance in comparison to eye-aligned face recognition when the landmarks are detected properly.

Viii Future Work

In the proposed warping, all faces are warped to a unique neutral face from different expressions and poses. When the geometrical information is obtained using the warped and original faces, three different type of information are included in it, i.e., the face itself, expression and pose. Among these three pieces of information, merely the face itself is important for us because it reflects which pixel has gone where. The other two ones, which are expression and pose, make geometrical information impure because two different expressions or poses of one person result in different geometrical information which is not good. One solution to this problem is not to have only one reference face, but to have one reference face per every expression or pose. This can be performed using regression for every expression or pose with landmarks of non-neutral face as input and landmarks of neutral image as output. We are looking to it as a future work.


  • [1] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: A literature survey,” ACM computing surveys (CSUR), vol. 35, no. 4, pp. 399–458, 2003.
  • [2] M. D. Kelly, “Visual identification of people by computer.” STANFORD UNIV CALIF DEPT OF COMPUTER SCIENCE, Tech. Rep., 1970.
  • [3] T. Kanade, Computer recognition of human faces.   Birkhäuser Basel, 1977, vol. 47.
  • [4] R. Brunelli and T. Poggio, “Face recognition: Features versus templates,” IEEE transactions on pattern analysis and machine intelligence, vol. 15, no. 10, pp. 1042–1052, 1993.
  • [5] A. V. Nefian and M. H. Hayes, “Hidden markov models for face recognition,” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 5.   IEEE, 1998, pp. 2721–2724.
  • [6] F. Samaria and S. Young, “Hmm-based architecture for face identification,” Image and vision computing, vol. 12, no. 8, pp. 537–543, 1994.
  • [7] C. Ding, C. Xu, and D. Tao, “Multi-task pose-invariant face recognition,” IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 980–993, 2015.
  • [8] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3025–3032.
  • [9] A. Pentland, B. Moghaddam, T. Starner et al.

    , “View-based and modular eigenspaces for face recognition,” in

    CVPR, vol. 94, 1994, pp. 84–91.
  • [10] P. S. Penev and J. J. Atick, “Local feature analysis: A general statistical theory for object representation,” Network: computation in neural systems, vol. 7, no. 3, pp. 477–500, 1996.
  • [11] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” in Computer Vision, 1998. Proceedings of the 1998 European Conference on, vol. 2.   Springer, 1998, pp. 484–498.
  • [12] ——, “Active appearance models,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 6, pp. 681–685, 2001.
  • [13] A. Lanitis, C. J. Taylor, and T. F. Cootes, “Automatic face identification system using flexible appearance models,” Image and vision computing, vol. 13, no. 5, pp. 393–401, 1995.
  • [14] M. B. Stegmann, “Analysis and segmentation of face images using point annotations and linear subspace techniques,” Tech. Rep., 2002.
  • [15] G. J. Edwards, T. F. Cootes, and C. J. Taylor, “Face recognition using active appearance models,” in European conference on computer vision.   Springer, 1998, pp. 581–595.
  • [16] A. Lanitis, C. J. Taylor, and T. F. Cootes, “A unified approach to coding and interpreting face images,” in Computer Vision, 1995. Proceedings., Fifth International Conference on.   IEEE, 1995, pp. 368–373.
  • [17] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape models-their training and application,” Computer vision and image understanding, vol. 61, no. 1, pp. 38–59, 1995.
  • [18] I. Craw and P. Cameron, “Face recognition by computer.” in BMVC, 1992, pp. 1–10.
  • [19] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
  • [20] M. A. Turk and A. P. Pentland, “Face recognition using eigenfaces,” in Computer Vision and Pattern Recognition, 1991. Proceedings CVPR’91., IEEE Computer Society Conference on.   IEEE, 1991, pp. 586–591.
  • [21] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Transactions on pattern analysis and machine intelligence, vol. 19, no. 7, pp. 711–720, 1997.
  • [22] J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Face recognition using kernel direct discriminant analysis algorithms,” IEEE Transactions on Neural Networks, vol. 14, no. 1, pp. 117–126, 2003.
  • [23] J. Lu, K. Plataniotis, and A. Venetsanopoulos, “Kernel discriminant learning with application to face recognition,” in Support Vector Machines: Theory and Applications.   Springer, 2005, pp. 275–296.
  • [24] P. J. Phillips, “Support vector machines applied to face recognition,” in Advances in Neural Information Processing Systems, 1999, pp. 803–809.
  • [25] B. Moghaddam, T. Jebara, and A. Pentland, “Bayesian face recognition,” Pattern Recognition, vol. 33, no. 11, pp. 1771–1782, 2000.
  • [26] M. M. Kasar, D. Bhattacharyya, and T.-h. Kim, “Face recognition using neural network: a review,” International Journal of Security and Its Applications, vol. 10, no. 3, pp. 81–100, 2016.
  • [27] S.-H. Lin, S.-Y. Kung, and L.-J. Lin, “Face recognition/detection by probabilistic decision-based neural network,” IEEE transactions on neural networks, vol. 8, no. 1, pp. 114–132, 1997.
  • [28] M. O. Simón, C. Corneanu, K. Nasrollahi, O. Nikisins, S. Escalera, Y. Sun, H. Li, Z. Sun, T. B. Moeslund, and M. Greitans, “Improved rgb-dt based face recognition,” Iet Biometrics, vol. 5, no. 4, pp. 297–303, 2016.
  • [29] W. AbdAlmageed, Y. Wu, S. Rawls, S. Harel, T. Hassner, I. Masi, J. Choi, J. Lekust, J. Kim, P. Natarajan et al., “Face recognition using deep multi-pose representations,” in Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on.   IEEE, 2016, pp. 1–9.
  • [30] O. M. Parkhi, A. Vedaldi, A. Zisserman et al., “Deep face recognition.” in BMVC, vol. 1, no. 3, 2015, p. 6.
  • [31] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708.
  • [32] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 815–823.
  • [33] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009.
  • [34] T. Baltrusaitis, P. Robinson, and L.-P. Morency, “Constrained local neural fields for robust facial landmark detection in the wild,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 354–361.
  • [35] D. Cristinacce and T. F. Cootes, “Feature detection and tracking with constrained local models.” in BMVC, vol. 1, no. 2, 2006, p. 3.
  • [36] S. J. Prince, Computer vision: models, learning, and inference.   Cambridge University Press, 2012.
  • [37] B. Gärtner and M. Hoffmann, “Computational geometry – lecture notes hs 2013, chapter 6,” ETH Zürich University, Tech. Rep., 2014.
  • [38] “Yale face dataset,” http://vision.ucsd.edu/content/yale-face-database, accessed: 2017-17-1.
  • [39] T. Hastie, R. Tibshirani, and J. Friedman, “The elements of statistical learning: Data mining, inference, and prediction,” Biometrics, 2002.
  • [40]

    C. Bishop, “Pattern recognition and machine learning (information science and statistics), 1st edn. 2006. corr. 2nd printing edn,”

    Springer, New York, 2007.
  • [41] V. Perlibakas, “Distance measures for pca-based face recognition,” Pattern Recognition Letters, vol. 25, no. 6, pp. 711–724, 2004.
  • [42] H. Mohammadzade and D. Hatzinakos, “Projection into expression subspaces for face recognition from single sample per person,” IEEE Transactions on Affective Computing, vol. 4, no. 1, pp. 69–82, 2013.
  • [43] “At&t face dataset,” http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html, accessed: 2017-17-1.
  • [44] “Cohn-kanade face dataset,” http://www.pitt.edu/~emotion/ck-spread.htm, accessed: 2017-17-1.
  • [45] T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” in Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on.   IEEE, 2000, pp. 46–53.
  • [46] “Lfw face dataset,” http://vis-www.cs.umass.edu/lfw/, accessed: 2017-17-1.
  • [47] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Technical Report 07-49, University of Massachusetts, Amherst, Tech. Rep., 2007.