As noted in the review on the eye localization topic, ”eye detection and tracking remains challenging due to the individuality of eyes, occlusion, variability in scale, location, and light conditions”
. Eye data and details of eye movements have numerous applications in face detection, biometric identification, and particularly in human-computer interaction tasks. Among the various applications of the eye localization topic, we are particularly interested in face expression analysis. Thus, while any method is supposed to perform accurately enough on the real-life cases and be fast-enough for real-time applications, we show an additional interest in the cases where eye centers are challenged by face expression. We will prove that the proposed method, which uses a MLP to discriminate among encoded normalized image projections from patches centered on the eye and, respectively, from patches shifted from the eye, is both accurate and fast.
1.1 Related work
The problem of eye localization was well investigated in literature, within a long history 
. Methods for eye center (or iris or pupil) localization in passive, remote imaging may approach the problem either as a particular case of pattern recognition application,,  or by using the physical particularities of the eye, like the high contrast to the neighboring skin  or the circular shape of the iris . The proposed method combine a pattern recognition approach with features that make use of the eye’s high contrast.
One of the first eye localization attempts is in the work from , who used image projections for this purpose. Taking into account that our method also uses image projections for localization, in the next paragraphs we shall present state of the art methods by going from the conceptually closest to the wider categories. Namely we shall start by presenting solutions based on projections, to follow with general eye localization methods and face fiducial points localization algorithms.
, tried to estimate also the face position, since the appearance of the Viola-Jones face detection solution, eye center search is limited to a subarea within the upper face square.
Projections based methods
The same image projections as in the work of Kanade are used to extract information for eye localization in a plethora of methods , , .  start with a snake based head localization followed by anthropometric reduction (relying on the measurements from 
) to the so-called eye-images and introduce the variance projections for localization. The key points of the eye model are the projections particular values, while the conditions are manually crafted.
 describe convex combinations between integral image projections and variance projections that are named generalized projection functions. These are filtered and analyzed for determining the center of the eye. The analysis is also manually crafted and requires identification of minima and maxima on the computed projection functions. Yet in specific conditions, such as intense expression or side illumination, the eye center does not correspond to a minima or a maxima in the projection functions.  use similar conditions with the ones used in  but applied solely on the integral projections to detect if an eye is open or closed.
introduce the edge projections and use them to roughly determine the eye position. Given the eye region, a feature is computed by concatenation of the horizontal and vertical edge image projections. Subsequently, a SVM–based identification of the region with the highest probability is used for marking the eye. The method from
is, to our best knowledge, the only one, except ours which uses image projections coupled with machine learning. Yet we differ by using supplementary data coupled with the introduction of efficient computation techniques and elaborated pre and post-processing steps to keep the accuracy high and the running time low.
General eye localization methods
There are many other approaches to the problem of eye localization. 2000 propose a face matching method based on the Hausdorff distance followed by a MLP eye finder. 2003 even reversed the order of the typical procedure: they use eye contrast specific measures to validate possible face candidates.
2005 refine with SVM the Gabor filtered faces, for locating 10 points of interest; yet the overall approach is different from the face feature fiducial points approach that is discussed in the next paragraph. 2006 use an iteratively bootstrapped boosted cascade of classifiers based on Haar wavelets. 2007 use multi scale Gabor jets to construct an Eye Model Bunch. 2009 use the distance to the closest edge to describe the eye area. 2008,  use isophote’s properties to gain invariance and follow with subsequent filtering with Mean Shift (MS) or nearest neighbor on SIFT feature representation for higher accuracy. 2010 relies on thresholding the cumulative histogram for segmenting the eyes. 2010 train a set of classifiers to detect multiple face landmarks, including explicitly the pupil center, by using a sliding window approach and test in all possible locations and inter-connect them to estimate the shape overall. 2011 rely their eye localizer on gradient techniques and search for circular shapes. 2013 use an exhaustive set of similarity measures over basic features such as histograms, projections or contours to extract the eye center location having in mind the specific scenario of driver assistance.
Face fiducial points localization
More recently, motivated by the introduction by 2001 of the active appearance models (AAM), that simultaneously determine a multitude of face feature points, a new class of solutions, namely the localization of face fiducial points appeared. In this category we include the algorithm of 2005 who use a GentleBoost algorithm for combining Gabor filters extracted features; 2008, who extend the original active shape models with more landmark points and stacks two such models; 2012, who model shapes using the Markov Random Field and classify them using SVM in the so-called Borman algorithm; 2011
who use Bayesian inference on SIFT extracted features and most recently2012 who use a combination of regularized boosted classifiers and mixture of complex Bingham distributions over texture and shape related features.
1.2 Paper Structure
In this paper we propose a system for eye centers localization that starts with face detection and illumination type detection, followed by a novel feature extractor, a MLP classifier for discriminating among possible candidates and a post-processing step that determines the eye centers. We contribute by:
Describing a procedure for fast image projections computation. This step is critical in having the solution run in real time.
Introducing a new encoding technique to image analysis domain.
The combination of normalized image projections with zero-crossing based encoding results in image description features named Zero-crossing based Encoded image Projections (ZEP). They are fast, simple, robust and easy to compute and therefore have applicability in a wider variety of problems.
The integration of the features in a framework for the problem of eye localization. We will show that description of the eye area using ZEP leads to significantly better results than state of the art methods in real-life cases represented by the extensive and very difficult Labeled Faces in the Wild database. Furthermore, the complete system is the fastest known in literature among the ones reporting high performance.
The remainder of this paper is organized as follows: Section 2 reviews the concepts related to Integral Projections and describes a fast computation method for them; Section 3 summarizes the encoding procedure and the combination with image projections to form the ZEP features. The paper ends with implementation details, with a discussion on the achieved results in the field of eye localization and proposals for further developments.
2 Image Projections
2.1 Integral Image Projections
The integral projections, also named integral projection functions (IPF) or amplitude projections, are tools that have been previously used in face analysis. They appeared as “amplitude projections”  or as “integral projections”  for face recognition. For a gray-level image sub–window with and , the projection on the horizontal axis is the average gray–level along the columns (1), while the vertical axis projection is the average gray–level along the rows (2):
The integral projections reduce the dimensionality of data from 2D to 1D, describing it up to a certain level of details. Also, the projections can be computed on any orthogonal pair of axes, not necessarily rows and columns. This will be further discussed in subsection 5.2.
2.2 Edge Projections
Instead of determining edges with wavelet transform as in the case of , we use a different approach for computing the edge projections. First, the classical horizontal and vertical Sobel contour operators (for details see  sect. 3.7) are applied, resulting in and which are combined in the image used to extract edges:
The edge projections are computed on the corresponding image rectangle :
As Sobel operator is invariant to additive changes, if compared to other types of projections, the edge projections are significantly more stable with respect to illumination changes.
2.3 Fast Computation of Projections
While sums over rectangular image sub–windows may be easily computed using the concept of summed area tables  or integral image , a fast computation of the integral image projections may be achieved using the prefix sums  on rows and respectively on columns. A prefix-sum is a cumulative array, where each element is the sum of all elements to the left of it, inclusive, in the original array. They are the 1D equivalent of the integral image, but they definitely precede it as recurrence is known for many years.
For the fast computation of image projections, two tables are required: one will hold prefix sums on rows (a table which, for keeping the analogy with integral image, will be named horizontal 1D integral image) and respectively one vertical 1D integral image that will contain the prefix sums on columns. It should be noted that computation on each row/column is perform separately. Thus, if the image has pixels, the 1D horizontal integral image, on the column , , is:
Thus, the horizontal integral projection corresponding to the rectangle is:
The procedure is visually exemplified in figure 1.
Using the oriented integral images, the determination of the integral projections functions on all sub-windows of size in an image of pixels requires one pass through the image and additions, subtractions and two circular buffers of locations, while the classical determination requires additions. Hence, the time to extract the projections associated with a sub-window, where many sub-windows are considered in an image, is greatly reduced.
The edge projections require the computation of the oriented integral images over the Sobel edge image, . This image needs to be computed on the areas of interest.
In conclusion, the fast computation of projections opens the direction of real-time feature localization on high resolution images.
3 Encoding and ZEP Feature
To reduce the complexity (and computation time), the projections are compressed using a zero-crossing based encoding technique. After ensuring that the projections values are in a symmetrical range with respect to zero, we will describe, independently, each interval between two consecutive zero-crossings. Such an interval is called an epoch and for its description three parameters are considered (as presented in figure 2):
Duration - the number of samples in the epoch;
Amplitude - the maximal signed deviation of the signal with respect to ;
Shape - the number of local extremes in the epoch.
The proposed encoding is similar with the TESPAR (Time-Encoded Signal Processing and Recognition) technique 
that is used in the representation and recognition of 1D, band–limited, speech signals. Depending on the problem specifics, additional parameters of the epochs may be considered (e.g. the difference between the highest and the lowest mode from the given epoch). Further extensions are at hand if an epoch is considered the approximation of a probability density function and the extracted parameters are the statistical moments of the said distribution. In such a case theshape parameter corresponds to the number of modes of the distribution.
The reason for choosing this specific encoding is two-fold. First the determination of the zero-crossings and the computation of the parameters is doable in a single pass through the target 1D signal, and, secondly, the epochs have specific meaning when describing the eye region, as discussed in the next subsection.
Given an image sub-window, the ZEP feature is determined by the concatenation of four encoded projections as described in the following:
Compute both the integral and the edge projection functions (, , , );
Independently normalize each projection within a symmetrical interval. For instance, in our application we normalized each of the projections to the interval. This will normalize the amplitude of the projection;
Encode each projection as described; allocate for each projection a maximum number of epochs;
Normalize all other (i.e. duration and shape) encoding parameters;
Form the final Zero-crossing based Encoded image Projections (ZEP) feature by concatenation of the encoded projections. Given an image rectangle, the ZEP feature consists of the epochs from all the 4 projections: (, , , ).
Image projections are simplified representations of the original image, each of them carrying specific information; the encoding simplifies even more the image representation. The normalization of the image projections, and thus of the epochs amplitudes, ensures independence of the ZEP feature with respect to uniform variation of the illumination. The normalization with respect to the number of elements in the image sub-window leads to partial scale invariance: horizontal projections are invariant to stretching on the vertical direction and vice versa. The scale invariance property of the ZEP feature is achieved by completely normalizing the encoded durations to a specific range (e.g. the encoded horizontal projection becomes invariant to horizontal stretching after duration normalization). We stress that when compared with previous methods based on projections, which lack the normalization steps, the hereby proposed algorithm increases the overall stability to various influences.
3.1 ZEP on Eye Localization
As noted, image projections have been used in multiple ways for the problem of eye localization. In an exploratory work, 1973 determined the potential of image projections for face description. More recently, 1998, 2004 and 2004 presented the use of the integral projections and/or their extensions for the specific task of eye localization. Especially in  it was noted that image projections, in the eye region have a specific sequence of relative minima and maxima assigned with to skin (relative minimum), sclera (relative maximum), iris (relative minimum), etc.
Considering a rectangle from the eye region including the eyebrow (as showed in figure 3 (a) ), the associated integral projections have specific epochs, as showed in figure 3 (c) and (d). The particular succession of positive and negative modes is precisely encoded by the proposed technique. On the horizontal integral projection there will be a large (one-mode) epoch that is assigned to skin, followed by an epoch for sclera, a triple mode, negative, epoch corresponding to the eye center and another positive epoch for the sclera and skin. On the vertical integral projection, one expects a positive epoch above the eyebrow, followed by a negative epoch on the eyebrow, a positive epoch between the eyebrow and eye, a negative epoch (with three modes) on the eye and a positive epoch below the eye.
The ZEP feature, due to invariance properties already discussed, achieves consistent performance under various stresses and is able to discriminate among eyes (patches centered on pupil) and non-eyes (patches centered on locations at a distance from the pupil center). As explained in section 4.2, on the validation set, using Fisher linear discriminant over 90% correct eye detection rate is achieved by selecting patches that are centered on the pupil with respect to the ones that are shifted.
The block schematic of our eye center localization algorithm is summarized in figure 4. In the first step, a face detector (the cascade of Haar features  delivered with OpenCV) automatically determines the face square. Next the regions of interest are set in the upper third of the detected face: from 26% to 50% of the face square on rows, respectively from 25% to 37% on columns for the left eye and from 63% to 75% on columns for the right eye.
Noting the susceptibility of the image projections to alter their shape due to lateral illumination, we introduced a simple method for detecting such a case and we adapt the algorithm to the type of illumination found. After a very simple preprocessing, the ZEP features for each possible location are computed and feed to a classifier to identify the possible eye locations. The possible eyes are then post-processed and the best positions are located as discussed in subsection 4.3.
Regarding the face detection, the recent solutions use multiple cascades for not only identification of the face rectangle, but also for determination of the in-plane (roll) and yaw (frontal/profile) angles of head. Such procedure follows Viola and Jones extension of the initial face detector work , . Thus, it is customary to limit the analysis of “frontal faces” to a maximum rotation of .
4.1 Lateral Illumination Detection
To increase the solution robustness to lateral illumination, we automatically separate such cases. The motivation for the split lies in the fact that side illumination significantly alters the shape of the projections in the eye region, thus decreasing the performance of the classification part.
The lateral illumination detection relies on computing the average values on the eye patch previously selected. The following ratios are considered:
where () and () are the average gray levels on the upper and lower halves of the left eye and ( and ) are their correspondents on the right eye.
The lateral illumination case is considered if any of the computed ratios, , , , is outside the range.
We designed this block such that an illumination that do not produce significant shadows on the eye region is detected as frontal, and as lateral otherwise. In terms of illumination angle, the cases with shadows on the eye region imply an absolute value of azimuth angle higher than or an elevation angle value higher than . Negative elevation (light from below) with low azimuth value does not produce shadows on the eye region. The interval mentioned above has been found by matching the mentioned cases with the ratios values on the training database.
Indeed 98.54% of the images from the BioID database are detected as frontal illuminated, while the results on Extended Yale B are presented in table 1. Extended Yale B database has images with various illumination angles as it will be discussed in section 5.5.
Once the ZEP features are determined, the extracted data is feed into a Multi-Layer Perceptron (MLP), having one input layer, one hidden layer and one output layer, trained with the back-propagation algorithm. In our implementation, the number of neurons from the hidden layer is chosen to be half the size of the ZEP feature as it was empirically determined as a reasonable trade-off between performance (higher number of hidden neurons) and speed.
In the preferred implementation, each projection is encoded with 5 epochs, leading to 60 elements in the ZEP feature (and 60 inputs to the MLP). If more epochs are provided by projections (which is very unlikely for eye localization - less than 0.1% in the tested cases), the last ones are simply removed.
The training of the MLP is performed with crops of eyes and non-eyes of pixels, as shown in figure 5, while the preferred face size is pixels. The positive examples are taken near the eye ground truth: the eye rectangle overlaps more than 75% with the true eye rectangle. The patches corresponding to the negative examples overlap with the true eye between 50% and 75%, thus leading to a total of 25 positive examples and 100 negative ones from each eye in a single face image. Positive and negative locations are showed in figure 6. This specific choice of positive and negative examples yields to high performance in localization.
In total there were 10,000 positive examples and as much negative ones, taken from the authors’ Eye Chimera database, from the Georgia Tech database  and from the neutral poses selected from the YaleB database . We have considered two training variants corresponding to the two types of illumination (frontal or lateral).
One training procedure uses images from our data set (40%) and from Georgia Tech (60%) and focuses on frontal illumination, eye expressions and occlusions. In this case, the MLP is trained to return the L2 distance from a specific patch center to the true eye center. Thus the MLP performs regression.
The second training procedure (for lateral illumination) uses images only from the frontal pose of the Yale B database and it is used for improving performance against illumination. In this case the training set was labelled with (non-eye – between 50% and 75% overlapping onto the centered eye patch) or (eye – more than 75% overlapping).
As many machine learning algorithms are available, we have performed a short study on examples extracted from the training databases. Given the number of images in the databases, 20,000 examples were used for training the networks and approximately 200,000 were used for classifier validation. For the classification problem, a Support Vector Machine (SVM) produced 93.7% correct detection rate, the used MLP 92.6% and an ensemble of 50 bagged decision trees 91.5% detection rate. For the regression case, a SVM for Regression lead to an approximation error of 0.090, the regression MLP 0.096 and bagged ensemble of regression trees only 0.115. Taking into account the achieved values, there is no significant performance difference among the various machine learning systems tested (conclusion which matches the findings from), thus our decision on using the MLP was based more on speed issues.
4.3 Preprocessing and Postprocessing
The conceptual steps in both illumination cases of the actual eye localization procedure are the same: preprocessing, machine learning and postprocessing.
A simple preprocessing is applied for each eye candidate region to accelerate the localization process. Following 2003, we note that the eye center (associated with the pupil) is significantly darker than the surrounding; thus the pixels that are too bright with respect to the eye region (and are not plausible to be eye centers) are discarded. The “too bright” characteristic is encoded as gray–levels higher than a percentage (so called darkness preprocessing threshold in table 2) from the maximum value of the eye region. In the lateral illumination case, this threshold is higher due to the deep shadows that can be found on the skin area surrounding the eye.
In the area of interest, using a step of 2 over a sliding image patch of pixels, we investigate by the proposed ZEP+MLP all the plausible locations. We consider as positive results the locations where the value given by the MLP is higher than an experimentally found threshold (see table 2). These positive results are recorded in a separate image (the ZEP image, shown in figure 7) which is further post–processed for eye center extraction.
Since closed eyes (that were included in the training set) are similar with eyebrows, one may get false eye regions given by the eyebrow in the ZEP image. Thus the ZEP image is segmented, labelled and the lowest and largest regions are associated with the eye. This step will discard, for instance, the regions given by the eyebrow in figure 7 (c).
For the frontal illumination case, due to training with L2 distance as objective, one expects a symmetrical shape around the true eye center. Thus the final eye location is taken as the weighted center of mass of the previously selected eye regions. For the lateral illumination, the binary trained MLP is supposed to localize the area surrounding the eye center and the final eye center is the geometrical center of the rectangle circumscribed to the selected region. We note that in both cases, the specific way of selecting the final eye center is able to deal with holes (caused by reflections or glasses) in the eye region.
An overview of how each step is implemented in the two illumination cases considered is shown in table 2.
5 Results and Discussions
We will discuss first the influence of various system parameters onto the overall results. For this purpose we will use the BioID database111http://www.bioid.com/downloads/software/bioid-face-database.html. This database contains 1521 gray-scale, frontal facial images of size , acquired with frontal illumination conditions in a complex background. The database contains 16 tilted and rotated faces, people that wear eye-glasses and, in very few cases, people that have their eyes shut or pose various expressions. The database was released with annotations for iris centers. Being one of the first databases that provided facial annotations, BioID became the most used database for face landmarks localization accuracy tests, even that it provides limited variability and reduced resemblance with real-life cases. We will use BioID as a starting point in discussing the achieved results (for giving an inside on the system’s various parameters and selecting the most performing state of the art systems) so that later to continue the evaluation under other stresses like eye expression, illumination angle or pose. Yet the most relevant test is on real-life cases, which are acquired in the Labelled Faces in Wild database that will be presented later on.
The localization performance is evaluated according to the stringent localization criterion . The eyes are considered to be correctly determined if the specific localization error , defined in equation (9) is smaller than a predefined value.
In the equation above, is the Euclidean distance between the ground truth left eye center and determined left eye center, is the corresponding value for the right eye, while is the distance between the ground truth eyes centers. Typical error thresholds are corresponding to eyes centers found inside the true pupils, corresponding to eyes centers found inside the true irises, and corresponding to eyes centers found inside the true sclera. This criterion identifies the worst case scenario.
We note that, while the BioID image size leads to approximately a size for the eye patch, because our target are HD video frames (for which we will also provide duration), we upscale the face square to , thus having an eye square of .
The results on the BioID database are shown in figure 8 (a), where we represented the maximum (better localized eye), average and minimum (worst localized eye) accuracy with respect to various values of the threshold.
5.1 The Influence of ZEP Parameters
We investigated the performance of the proposed system when only one type of projection is used. The results are presented in table 3. The computation time dropped to of the full algorithm time if only one projection type is used. The performance drops with in the case of integral projection and with in the case of edge projections. Using the proposed encoding it is possible to keep both the dimensionality of the feature and the time duration low enough in order to use more than one type of projection. This supplementary information helps to increase the results accuracy when compared with the method in .
Alternatives to the eye crop size and resulting values are presented in table 4. The experiment was performed by re-training the MLP with eye crops of the target size. As one can notice, the results are similar, thus proving the scale invariance of the ZEP feature. Slight variation is due to the pre- and post processing.
5.2 The Dimensionality Reduction
The main visible effect of the proposed encoding is the reduction of the size of the concatenated projections. Yet, as we have adapted the encoding technique to the specific of the projection functions applied on the eye area, its performance is higher than of other methods. To see the influence of this encoding technique, we compared the achieved results with the ones obtained by reducing the dimensionality with PCA (as being the most foreknown such technique) by the same amount as the proposed one. The rest of the algorithm remains the same. The comparative results may be seen in table 5. We also report the results when no reduction was performed.
The results indicate that both methods are lossy compression techniques and lead to decreased accuracy. The proposed method is able to extract the specifics of the eye from the image projections, as discussed in subsection 3.1, being marginally better then the PCA compression.
Furthermore, we take into account that the dimensionality reduction with PCA requires, for each considered location, a matrix multiplication to project the initial vector of size () onto the final space (with size ), thus having the complexity . In comparison, the determination of the epochs parameters is done in a single cross of the initial vector (i.e. with complexity ), thus we expect the proposed method to be significantly faster.
Indeed, the average value for computation time increases from 6 msec (using the proposed method) to 11 msec (almost double) using PCA on a face square. The lack of compression increases the duration to 24 msec per face square.
5.3 Robustness to Noise
An image projection represents a gray-scale average, hence it is reasonable to expect that the proposed method is very robust to noise. To study robustness to noise we have artificially added Gaussian noise to the BioID images and we subsequently measured the localization performance for accuracy. Indeed, while the noise variance increases from 0 to 30, the average accuracy decreases from to only
. The variation of the accuracy with respect to the added noise standard deviation may be seen in figure9. Examples of images degraded by noise may be seen in figure 10.
5.4 Results on the BioID
To give an initial overview of the problem in state of the art, we consider the results reported by other methods on the BioID database. Other solutions for eye localization are summarized in table 6. The results of the methods for localization of face fiducial points are showed in table 7. Visual examples of images with localized eyes produced by our method are shown in figure 11.
Analyzing the performance, first we note that our method significantly outperforms in both time and accuracy other methods relying on image projections (, ). The explanation lies in the normalization procedure implied when constructing the ZEP feature.
Comparison with face feature fiducial points localization is not straightforward. While such methods localize significantly more points than simple eye centers localization, they also rely strongly on the inter-spatial relation among them to boost the overall performance. Furthermore, they often do not localize eye centers, but eye corners and the top/bottom of the eye, which in many cases are more stable than the eye center (i.e. not occluded or influenced by gaze). And yet we note that our method is comparable in terms of accuracy and significantly faster (if one normalizes the reported time by the number of detected points).
Regarding other methods for eye localization, the proposed method ranks as one of the top methods for all accuracy tests, being always close to the best solution. Furthermore taking into account that on BioID database there are only images (3%) with closed eyes, methods that search circular (symmetrical) shapes have better circumstances. Because we targeted images with expressions, we specifically included in our training data set closed eyes. To validate this assumption we tested with very good results on the Cohn-Kanade database showing that our method is more robust in that case as showed in figure 8 (b).
Considering as most important criterion the accuracy at , we note that 2011 and 2012 provide higher accuracy. Yet, we must also note and the highest performance achieved by a variation of the method described in , namely Val.&Gev.+SIFT contains a 10-fold testing scheme, thus using 9 parts of the BioID database for training. Furthermore, taking into account that BioID database was used for more that 10 years and provides limited variation, it has been concluded ,  that other tests are also required to validate a method.
2012 provide results on other datasets and made public the associated code for their baseline system (Val.&Gev.+MS) which is not database dependent. 2011 do not provide results on any other database except BioID or source code, yet there is publicly available222At http://thume.ca/projects/2012/11/04/simple-accurate-eye-center-tracking-in-opencv/. code developed with author involvement. Thus, in continuation, we will compare the hereby proposed method against these two on other datasets. Additionally, we include the comparison against the eye detector developed by 2010 which has also been trained and tested on other database, thus is not BioID dependent.
5.5 Robustness to Eye Expressions
As mentioned in the introduction, we are specifically interested in the performance of the eye localization with respect to facial expressions, as real-life cases with fully opened eyes looking straight are rare. We tested the performance of the proposed method on the Cohn-Kanade database . This database was developed for the study of emotions, contains frontal illuminated portraits and it is challenging through the fact that eyes are in various poses (near-closed, half-open, wide-open). We tested only on the neutral pose and on the expression apex image from each example. The correct eye locations, with standard precisions, are shown in table 8. Typical localization results are presented in figure 12, while the maximum, average and minimum errors are plotted in figure 8 (b).
We note that solutions that try to fit a circular or a symmetrical shape over the iris, like  or , and thus, perform well on open eyes, do encounter significant problems when facing eyes in expressions (as it is shown in table 8). Taking into account the achieved results, which are comparable on neutral pose and expression apex images, we show that our method performs very well under such complex conditions. Achieved results indicate approximately a doubled accuracy when compared with the foremost state of the art method.
5.6 Robustness to Illumination and Pose
We systematically evaluated the robustness of the proposed algorithm with respect to lighting and pose changes. This was tested onto the Extended Yale Face Database B (B+) . We stress that part of the Yale B database  was used for training the MLP for lateral illumination, thus the training and testing sets are completely different.
The Extended Yale B database contains 16128 gray-scale images of 28 subjects, each seen under 576 viewing conditions (9 poses 64 illuminations). The size of each image is . The robustness with respect to pose and with respect to illumination was evaluated separately.
For evaluating the robustness to illumination, we tested the system on 28 faces, in neutral pose, under changing illumination (64 cases). The results are summarized in table 9.
The system achieves reasonable results in the cases when even a human observer is not able to identify the eyes. As long as the illumination is constant over the eye, the system performs very well, proving the invariance to uniform illumination of the ZEP feature claim. Examples of localization while illumination varies are presented in figure 13.
For larger illumination angles, due to the uneven distribution of the shadows, the shape of the projections is significantly altered and the accuracy decreases. Examples, with cases where the shades are too strong or inopportunely placed and we reach lower results, are showed in figure 14.
To evaluate the robustness of the algorithm with respect to the face pose, we consider each of the 28 persons with frontal illumination, but under varying poses (9 poses for each person). Pose angles are in the set , thus spanning the typical range for “frontal face”. The results are shown in table 10 and visual examples in the figure 15. Taking into account that the maximum number of images that have the worst eye less accurate than is 2, we may truthfully say that the proposed method is robust to face pose.
When compared with the method proposed in , our solution performs marginally better. If we consider the results reported in the mentioned paper, then the average result for accuracy at is 88.07% computed on the smaller YaleB database  while our method reaches 89.85% on the same subset of azimuth and elevation illumination angles on the larger Extended YaleB database . If we compare the full results on the entire Extended YaleB database (including extreme illumination cases) then our method outperforms with small margin for high accuracies as shown in table 11. Our method performs significantly better than the ones proposed by 2011 and respectively 2010.
5.7 Accuracy in Real-Life Scenarios
While BioID, Cohn-Kanade and Extended YaleB databases include specific variations as they are acquired under controlled lighting conditions with only frontal faces, they cannot be considered too closely resembling real-life applications. In contrast, there are databases like the Labeled Face Parts in the Wild (LFPW)  and the Labeled Faces in the Wild (LFW) , which are randomly gathered from the Internet, contain large variations in the imaging conditions. While LFPW is annotated with facial point locations, only a subset of about 1500 images is made available and contains high resolution and rather qualitative images. In opposition, the LFW database contains more than 12000 facial images, having the resolution pixels, with 5700 individuals that have been collected “in the wild” and vary in pose, lighting conditions, resolution, quality, expression, gender, race, occlusion and make-up.
The images difficulty is certified by the performance of human evaluation error as reported in , which also provided annotations. While the ground truth is taken as the average of human markings for each point normalized to inter-ocular distance, human evaluation error is considered as the averaged displacement of the one marker.
Examples of the results achieved on the LFW database may be seen in figure 16. Numerical results, compared with the solution from , ,  and with human evaluation error are presented in figure 17.
Regarding the achieved results, we note that even that our method was designed to work on large resolution faces, it provides accurate results when applied on smaller ones. As one can see in figure 17 we significantly outperform the state of the art solutions  and from  by almost 50% improvement at accuracy, on an over 12000 image database that presents as close to real-life as possible cases and with more the method from .
5.8 Algorithm Complexity
The entire algorithm requires only four divisions for the projections normalization and two for determination of the region weighting center with variable denominator per eye crop, and no high precision operations, therefore needing only limited fixed point precision. The ZEP+MLP combination is linear with respect to the size of scan eye rectangle . The method was implemented in C around OpenCV functionality, on an Intel i7 at 2.7 GHz, on single thread and it takes 6 msec for both eyes on a face square of pixels, which is a typical face size for HD - 720p ( pixels) format. We note that additional 7 msec are required for Face Detection.
The code can run in real-time while including face detection and further face expression analysis. Comparison with state of the art methods may be followed in the table 12 when comparing with other eye localization methods and on right hand side of table 7 when discussing face fiducial points localization solutions; for some works the authors have not reported speed performance, but taking into account algorithm complexity, it is reasonable to presume that it is too large for real-time.
Trying to overcome the difficulty of comparison while different platforms were used for implementation, we rely as a unifying factor on the single thread benchmarking score provided by  for specific CPU; this score will be denoted by . It must be noted that such numbers should be considered with precaution since there do exist several CPUs that correspond to the description provided by authors (and we always took the best case) and the benchmark test may not be very relevant for the specific processing required by a solution.
To aggregate the overall time performance of a method we used the following formula:
where is the frame size used for reported results. Note that the formula uses only one of the two dimensions that describe an image to cope with different aspect ratios.
The results for eye localization aggregated with the measure in equation (10) are showed in the table 12 when comparing with other eye localization methods. Our method rank second following the one proposed by 2000, but it gives consistently better results in terms of accuracy.
It has to be noticed that while, initially only 2012 reported comparable computational time, after integrating the larger frame size with processing power, our method turns to be 1.5 times faster. Furthermore, to be able to directly compare our computation time, we have modified the size of input face to be which corresponds to a pixels image, letting everything else the same and we find out that our method requires 1.6 msec to localize both eyes; given the additional 7 msec for face detection, we get a total time of 8.6 msec, that is equivalent with a frame rate of 116 frames/sec, proving that we clearly outperform the method from .
The previous subsections within this ”Results and discussions” part have guided through various experiments and measurements that present a through comparison of eye localization performance of the here proposed ZEP eye localization method, which is shown to perform remarkably well among a wide variety of conditions and datasets. Some of the presented numbers and experiments deserve yet a supplemental emphasis and clarification.
A first issue of discussion is related to the experimental setup, namely the databases that are currently used in the assessment of algorithms accuracy. BioID has gained through the years widespread recognition, as it was one of the earliest face image databases that contain facial landmark ground truth annotations. As such, BioID was intensively used for accuracy comparisons, with a clear tendency over time to concentrate the efforts in getting top results on BioID alone. As one may have noticed in table 6, the here proposed method is outperformed on BioID by the algorithms proposed by 2011 and 2012.
We can notice that most the methods are overtrained in standard conditions, and thus perform very well within their over–learned domain. As such, we claim that these approaches are not relevant in a broader, real-life testing scenario. The approaches proposed by 2012 and 2011 are thus retained as a significant eye location methods and we further tested them; we also included the solution from  as being a high profile method build outside the BioID database; the results showed that we outperform these methods by a gross margin.
As anyone knowledgeable in the field observes, the BioID database contains mostly frontal pose, frontal illumination and neutral expression faces, and catches only a small glimpse of the problems related to eye localization. As such, intensive performance comparison must be realized outside these standard conditions, as 2012 does in the case of varying illumination and pose and we do in the case of noise, variable illumination, expression and pose variations. Several tests that are reviewed again here prove the superior performance of the proposed ZEP eye localization method in these extreme conditions.
The non-frontal illumination and the subject pose variations are key issues in real-life, unconstrained applications. Typically these are tested within the Extended YaleB database, where ZEP performs marginally better (+2%) than method in  and significantly better than  and , as shown in Table 11. Subject emotional expressions hugely affect eye shape and surroundings. The Cohn-Kanade database is the state of the art testbed in emotion-related tasks; in this case, the ZEP eye localization outperforms  by 10%,  by some 30%, and  by 60% as shown in table 8.
Within all databases, closed eyes present an independent challenge. As noticed by the authors in , their method is prone to errors in detecting the closed eye center (which is confirmed by the experiments across all databases). The proposed ZEP method is much more robust to closed eyes, due to the way in which the eye profile is described within the proposed encoding of the luminance profiles.
Finally, we consider that the most relevant test is performed on the LFW database, taking into account the size (more than 12000 images), image resolution (extremely low) and especially the fact that images were acquired “in the wild”. Yet, on the LFW database, which is currently one of the most challenging tasks, we outperform the method in  by at least 5% the one in  by a gross margin (+13%) and respectively the method form  by near 30%. Nonetheless we much closer to human accuracy (as shown in figure 17 (a) ).
Regarding the computational complexity, the here proposed method requires a computational time which is inferior to the time required by the method from ; yet the accuracy of the here proposed method is significantly higher. If we compare only the computation time, without considering the image size, one may consider the method from  to be faster. Yet, tests showed that the here proposed solution is still faster than the implementation from  at equal image resolution (namely ). We thus claim that the here proposed method is the fastest solution from the select group of high accuracy methods.
In this paper, we proposed a new method to estimate the eye centers location using a combination of Zero-based Encoded image Projections and a MLP. The eye location is determined by discriminating between eyes and non-eyes by analyzing of the normalized image projections encoded with the zero-crossing based method. The extensive evaluation of the proposed approach showed that it can achieve real-time high accuracy. While the ZEP feature was used for eye description, we consider that it is general-enough and may be used in numerous problems.
- Asadifard & Shanbezadeh  Asadifard, M. and Shanbezadeh, J. Automatic adaptive center pupil detection using face detection and cdf analysis. In IMECS, pp. 130–133, 2010.
- Asteriadis et al.  Asteriadis, Stylianos, Nikolaidis, Nikos, and Pitas, Ioannis. Facial feature detection using distance vector fields. Pattern Recognition, 42(7):1388 – 1398, 2009.
- Becker et al.  Becker, H. C., Nettleton, W. J., Meyers, P. H., Sweeney, J. W., and Nice, C. M. Digital computer determination of a medical diagnostic index directly from chest X-ray images. IEEE Trans. on Biomedical Engineering, 1964(3):62 – 72, July 11.
- Belhumeur et al.  Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., and Kumar, N. Localizing parts of faces using a consensus of exemplars. In IEEE CVPR, pp. 545 – 552, 2011.
- Blelloch  Blelloch, G. E. Prefix sums and their applications. synthesis of parallel algorithms. Technical report, University of Massachusetts, 1990.
- Campadelli et al.  Campadelli, P., Lanzarotti, R., and Lipori, G. Precise eye localization through a general-to-specific model definition. In BMVC, pp. I – 187, 2006.
- Ciesla & Koziol  Ciesla, M. and Koziol, P. Eye pupil location using webcam. Human-Computer Interaction, 2012.
- Cootes et al.  Cootes, T. F., Edwards, G. J., and Taylor, C. J. Active appearance models. IEEE Transactions on PAMI, 23(6):681 – 685, 2001.
- Cristinacce et al.  Cristinacce, D., Cootes, T., and Scott, I. A multi-stage approach to facial feature detection. In BMVC, pp. 277 – 286, 2004.
- Crow  Crow, F. Summed-area tables for texture mapping. Proceedings of SIGGRAPH, 18(3):207 – 212, 1984.
- Dantone et al.  Dantone, M., Gall, J., Fanelli, G., and Gool, L. Van. Real-time facial feature detection using conditional regression forests. In IEEE CVPR, pp. 2578 – 2585, 2012.
- Ding & Martinez  Ding, Liya and Martinez, Aleix M. Features versus context: An approach for precise and detailed detection and delineation of faces and facial features. IEEE Trans Pattern Anal Mach Intell., 32(11):2022 – 2038, 2010.
- Everingham & Zisserman  Everingham, Mark and Zisserman, Andrew. Regression and classification approaches to eye localization in face images. In IEEE FGR, pp. 441 – 446, 2006.
- Feng & Yuen  Feng, G. C. and Yuen, P. C. Variance projection function and its application to eye detection for human face recognition. Pattern Recognition Letters, 19(9):899 – 906, July 1998.
- Florea et al.  Florea, Laura, Florea, Corneliu, Vertan, Constantin, and Vranceânu, Ruxandra. Zero-crossing base image projections encoding for eye localization. In EUSIPCO, pp. 150 – 154, 2012.
- Georghiades et al.  Georghiades, A., Belhumeur, P., and Kriegman, D. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. on PAMI, 23(6):643 – 660, 2001.
- Gonzalez & Woods  Gonzalez, Rafael and Woods, Richard E. Digital Image Processing. Prentice Hall, New Jersey, 2-nd edition, 2001.
- Gonzalez-Ortega et al.  Gonzalez-Ortega, D., D az-Pernas, F. J., Anton-Rodr guez, M., Mart nez-Zarzuela, M., and D ez-Higuera, J. F. Real-time vision-based eye state detection for driver alertness monitoring. Pattern Analysis and Applications, 16(3):285 – 306, 2013.
- Hamouz et al.  Hamouz, M., Kittlerand, J., Kamarainen, J. K., Paalanen, P., Kalviainen, H., and Matas, J. Feature-based affine-invariant localization of faces. IEEE Trans. on PAMI, 27(9):643 – 660, 2005.
- Hansen & Ji  Hansen, Dan Witzner and Ji, Qiang. In the eye of the beholder: A survey of models for eyes and gaze. IEEE Trans. on PAMI, 32(3):478 – 500, March 2010.
- Huang et al.  Huang, G., Ramesh, M., Berg, T., and Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, University of Massachusetts, 2007.
- Jesorsky et al.  Jesorsky, O., Kirchberg, K., and Frischolz, R. Robust face detection using the Hausdorff distance. In Bigun, J. and Smeraldi, F. (eds.), AVBPA, pp. 90–95. Springer, 2000.
- Jones & Viola  Jones, M. and Viola, P. Fast multi-view face detection. Technical Report 096, Mitsubishi Electric Research Laboratories, 2003.
- Kanade  Kanade, T. Picture processing by computer complex and recognition of human faces. Technical Report, Kyoto University, Department of Information Science, 1973.
- Kanade et al.  Kanade, T., Cohn, J. F., and Tian, Y. Comprehensive database for facial expression analysis. In IEEE FG, pp. 46–53, 2000.
- Kim et al.  Kim, S., Chung, S.-T., Jung, S., Oh, D., Kim, J., and Cho., S. Multi-scale Gabor feature based eye localization. In WASET, volume 21, pp. 483–487, 2007.
- King & Phipps  King, R. A. and Phipps, T. C. Shannon, TESPAR and approximation strategies. Computers & Security, 18(5):445 – 453, 1999.
- Kroon et al.  Kroon, B., Hanjalic, A., and Maas, S. M. Eye localization for face matching: is it always useful and under what conditions. In CIVR, pp. 379 – 387, 2008.
- Lee et al.  Lee, K.C., Ho, J., and Kriegman, D. Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans. PAMI, 27(5):684–698, 2005.
- Liu et al.  Liu, Weifeng, Wang, Yanjiang, and Jia, Lu. An effective eye states detection method based on projection. In ICSP, pp. 829 – 831, 2010.
- Milborrow & Nicolls  Milborrow, S. and Nicolls, F. Locating facial features with an extended active shape model. In ECCV, pp. 504 – 513, 2008.
- Mostafa & Farag  Mostafa, Eslam and Farag, Aly. Complex bingham distribution for facial feature detection. In ECCV, Workshops and Demonstrations, pp. 330–339, 2012.
- Nefian & Hayes  Nefian, Ara V. and Hayes, Monson H. Maximum likelihood training of the embedded HMM for face detection and recognition. In IEEE ICIP, pp. 33 – 36, 2000.
- Niu et al.  Niu, Z., Shan, S., Yan, S., Chen, X., and Gao, W. 2D cascaded adaboost for eye localization. In IEEE ICPR, pp. 1216 – 1219, 2006.
- PassMark Software Pty Ltd [retrieved January 2015] PassMark Software Pty Ltd. CPU Mark - high–mid range, retrieved January 2015. from http://www.cpubenchmark.net/singleThread.html.
Ramirez & Fuentes 
Ramirez, G.A. and Fuentes, O.
Multi-pose face detection with asymmetric haar features.
IEEE Workshop on Applications of Computer Vision, WACV 2008, pp. 1 – 6, 2008.
- Timm & Barth  Timm, F. and Barth, E. Accurate eye centre localisation by means of gradients. In VISAPP, pp. 125–130, 2011.
- Turkan et al.  Turkan, Mehmet, Pardas, Montse, and Cetin, A. Enis. Edge projections for eye localization. Optical Engineering, 47:047007, 2004.
- Valenti & Gevers  Valenti, R. and Gevers, T. Accurate eye center location and tracking using isophote curvature. In IEEE CVPR, pp. 1–8, 2008.
- Valenti & Gevers  Valenti, R. and Gevers, T. Accurate eye center location through invariant isocentric patterns. IEEE Trans. PAMI, 34(9):1785–1798, 2012.
- Valstar et al.  Valstar, M., Martinez, B., Binefa, X., and Pantic, M. Facial point detection using boosted regression and graph models. In IEEE CVPR, pp. 2729 2736, 2012.
- Verjak & Stephancic  Verjak, M. and Stephancic, M. An anthropological model for automatic recognition of the male human face. Ann. Human Biology, 21:363–380, 1994.
- Viola & Jones  Viola, P. and Jones, M. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004.
- Vukadinovic & Pantic  Vukadinovic, D. and Pantic, M. Fully automatic facial feature point detection using gabor feature based boosted classifiers. In IEEE Conf. on Systems, Man and Cybernetics, pp. 1692 – 1698, 2005.
- Wu & Zhou  Wu, J. and Zhou, Zhi-Hua. Efficient face candidates selector for face detection. Pattern Recognition, 36(5):1175–1186, 2003.
- Zhou & Geng  Zhou, Zhi-Hua and Geng, Xin. Projection functions for eye detection. Pattern Recognition, 37(5):1049 – 1056, 2004.