Finalists in the OpenCV AI Competition 2021
Semantic segmentation of eyes has long been a vital pre-processing step in many biometric applications. Majority of the works focus only on high resolution eye images, while little has been done to segment the eyes from low quality images in the wild. However, this is a particularly interesting and meaningful topic, as eyes play a crucial role in conveying the emotional state and mental well-being of a person. In this work, we take two steps toward solving this problem: (1) We collect and annotate a challenging eye segmentation dataset containing 8882 eye patches from 4461 facial images of different resolutions, illumination conditions and head poses; (2) We develop a novel eye segmentation method, Shape Constrained Network (SCN), that incorporates shape prior into the segmentation network training procedure. Specifically, we learn the shape prior from our dataset using VAE-GAN, and leverage the pre-trained encoder and discriminator to regularise the training of SegNet. To improve the accuracy and quality of predicted masks, we replace the loss of SegNet with three new losses: Intersection-over-Union (IoU) loss, shape discriminator loss and shape embedding loss. Extensive experiments shows that our method outperforms state-of-the-art segmentation and landmark detection methods in terms of mean IoU (mIoU) accuracy and the quality of segmentation masks. The eye segmentation database is available at https://www.dropbox.com/s/yvveouvxsvti08x/Eye_Segmentation_Database.zip?dl=0.READ FULL TEXT VIEW PDF
Finalists in the OpenCV AI Competition 2021
Eyes not only are the most vital sensory organ but also play a crucial role in conveying a person’s emotion state and mental well-being . Although there have been numerous works on blink detection [31, 5, 40], we argue that accurate segmentation of sclera and iris can provide much more information than blinks alone, thus allowing us to study the finer details of eye movement such as saccade, fixation, and other gaze patterns. As a pre-processing step in iris recognition, iris segmentation in high resolution expression-less frontal face images have been well studied by the biometric community. However, the commonly used Hough-transform-based method does not work well on low-resolution images captured under normal Human-Computer Interaction (HCI) and/or video-chat scenarios. This is particularly evident when the boundary of eyes and iris are blurry, and the shape of the eye can differ greatly due to pose variation and facial expression. To our knowledge, this work presents the first effort in solving the eye segmentation problem under such challenging conditions.
To investigate the topic of eye segmentation in the wild, the first problem we need to address is the lack of data. Albeit both biometric community and facial analysis community published an abundance of eye datasets over the years, none can be used as is for our purpose, because the former category only contains high resolution eye scans while the latter category lacks annotation of segmentation masks for sclera and iris. SSERBC 2017  proposed a sclera segmentation database which separate the sclera and the other parts. However, only sclera is not effective in our research. OpenEDS derived a eye segmentation database which annotated the background, the sclera, the iris and the pupil regions. The database was captured using a virtual-reality (VR) head mounted display mounted with two synchronized eyefacing cameras at a frame rate of 200 Hz under controlled illumination. But the limitation is that the gray-scale database is lack of pose varieties and resolution which cannot be utilized for eye segmentation in the wild. In fact, existing databases were collected in controlled environment (and mainly in high resolution), while there is no in-the-wild eye database that contains eye images in a wide range of resolutions. As a step towards the solution, we create a sizable eye segmentation dataset of 8882 eye patches by manually annotating 4461 face images selected from HELEN , 300VW , 300W , CVL , IMDB , Utdallas Face database , and Columbia Gaze database .
To solve the segmentation problem, we propose a novel method, Shape Constrained Network (SCN), that incorporates shape prior into the segmentation model. Specifically, we first pre-train a VAE-GAN  on the ground truth segmentation masks to learn the latent distribution of eye shapes. The encoder and discriminator are then utilised to regularise the training of the base segmentation network through the introduction of shape embedding loss and shape discriminator loss. This approach not only enables the model to produce accurate eye segmentation masks, but also helps suppress artifacts, especially on low-quality images in which the fine details are missing. In addition, since the regularisation is applied during the training, SCN does not incur additional computational cost to the base segmentation network during inference. Through extensive experiments, we demonstrate that SCN outperforms state-of-the-art segmentation and landmark localisation methods in terms of mean mIoU metric.
The main contribution of this work are as follows:
We collect and annotate a large eye segmentation dataset consisting of 8882 eye patches from 4461 face images in the wild, this is the first of its kind and a significant step towards solving the problem of eye segmentation.
We propose Shape Constrained Network (SCN), a novel segmentation method that utilises shape prior to increase accuracy on low quality images and to suppress artifacts.
We redesign the objective function of SegNet with three new losses: Intersection-over-Union (IoU) loss, shape discriminator loss and shape embedding loss.
Eyes localisation. Early methods [13, 12] often rely on edge information of the original image or handcrafted feature map when locating eyes and iris. In , the eye can be modelled as two parabolic curves (lids) and an ellipse (iris) respectively, whose parameters are determined by Hough transformation. Even though this method has been widely used in many iris recognition systems, it is very sensitive to image noises and pose changes. On a separate note, these algorithms are designed to work on eye scans of high quality (i.e. minimum of 70 pixels in iris radius), whereas for an in-the-wild image captured with consumer-grade camera, they do not perform well.
Everingham and Zisserman 
attempted to solve this problem with 3 different approaches: (a) ridge regression that minimizes errors in the predicted eye positions; (b) a Bayesian model of eye and non-eye appearance; (c) a discriminative detector trained using AdaBoost. This is one of the earliest detectors that achieved some degrees of success in detecting eyes from the low resolution images. However, it still felt short of detecting eyes in extreme poses and illumination conditions, partly because it utilized image intensities rather than robust image feature (e.g., HoG). Needless to say, they merely detect two landmarks that are insufficient for dense segmentation.
As a matter of fact, many existing 2D/3D facial landmarks detection methods [41, 21, 4, 1, 2] are able to provide significantly better localisation of eyes than the aforementioned methods, owing to the tremendous efforts in collecting and annotating large facial image databases [35, 37, 25, 14]. Unfortunately, the majority of these works only provide a small number of landmarks for one single eye (e.g., 6 landmarks in 68-point markup ), which is barely enough for describing the full structure of eye (i.e., iris, pupil and sclera) in a 2D image. Moreover, a significant portion of those annotated images do not display clear structure of eyes. To the best of our knowledge, there is no large scale database for dense eye landmarks localisation or eye segmentation. In this paper, we take a step forward by collecting the first in-the-wild eye database that is annotated with landmarks and fine-grained segmentation mask.
Deep semantic segmentation of image.
The above methods are all condition-sensitive algorithms, as they are meticulously designed based on the predefined setting (e.g., the number of points, shapes or curves), thus may not suit our specific purpose. More recently, various deep learning techniques have achieved impressive results in semantic segmentation of images. Fully Convolutional Networks (FCN) is one of the most influential deep learning methods for image segmentation. FCN is indeed an encoder-decoder network that predicts the segmentation mask in an end-to-end manner. It adopts VGG-16  as the backbone of encoder, and utilises the transposed convolution for upsampling and generating the mask. SegNet  also adopts VGG-16 in the encoder network, however, comparing with FCN, it removes the fully connected layers and leads to a more light-weight model. Additionally, inspired by unsupervised feature learning 
, the decoder of SegNet employs the max-unpooling layers, which reuse indices of the corresponding max-pooling operations of the encoder. The reuse of indices not only improves boundary delineation but also helps reduce the number of training parameters. DeepLab
proposed to use Atrous Convolutional Neural Network (Atrous-CNN) to generate the segmentation mask directly from the input image. The mask is further refined by a fully-connected Conditional Random Field (CRF) layer with mean-field approximation for fast inference.
One drawback of these methods is that they need to learn the shape prior from input image from scratch, which is often an inefficient procedure. Since the shapes of sclera and iris are highly regular, shape information can be exploited for eye segmentation. On the other hand, in low resolution images that do not display many details (such as prominent edges), not using a shape prior can produce sub-optimal results for this task because the pixel intensities alone does not provide sufficient contextual information.
Deep generative models with shape constraint.
Several deep generative models that take advantage of shape prior have been developed. Shape Boltzmann Machine (ShapeBM) provides a good way to construct a strong model of binary shape using Deep Boltzmann Machines (DBMs) . ShapeBM is an inference-based generative model that can generate realistic and different examples from the training data. Nonetheless, ShapeBM is quite sensitive to the appearance changes of object in different views, thus it is less appealing for the task of eye segmentation in-the-wild. More recently, Anatomically Constrained Neural Networks (ACNN) 
incorporates shape prior knowledge into semantic segmentation or super-resolution models. Since the shape prior of ACNN are learned by auto-encoder, the reconstructed segmentation masks are often blurry and lack sharp edges. Shape prior can also be modelled in Variational Auto-Encoder (VAE). VAE tries to learn latent representation of training examples by mapping them to a posterior distribution. Unfortunately, VAE still fails to produce clear and sharp segmentation mask. To address this problem, Larsen et al.  present VAE-GAN that combines VAE and GAN together with a shared generator. The element-wise reconstruction error of VAE is replaced by feature-wise errors to better capture data distributions. VAE-GAN can optimally balance the similarity and variation between the inputs and outputs.
Due to the lack of available data for eye segmentation in-the-wild, we create a new dataset by annotating 4461 facial images found in HELEN , 300VW , 300W , CVL , IMDB-WIki , Utdallas Face database , and Columbia Gaze database . The particular images were selected to ensure a variety of head poses, image qualities, resolutions, eye shapes and gaze directions are represented in this dataset.
Once the face images are collected, we first use an facial landmark detector  to find an approximate location of the eyes in each image. For each eye patch, we then manually annotate the segmentation mask. Each pixel in the patch is labelled as either background, sclera, or iris. Based on the annotated segmentation mask, the bounding box of the eye patch is then adjusted accordingly so that it is always centred on the eye with a fixed aspect ratio of 2:1. Some examples of the eye patches and their corresponding segmentation masks are illustrated in Figure 1.
Each eye patch is further tagged with 3 discrete attributes: head pose (near-frontal or non-frontal), resolution (high resolution or low resolution), and occlusion. The ’head pose’ attribute is manually annotated following the guideline that a head-yaw within 30 degree is considered ’near-frontal’ while the rest being considered ’non-frontal’. The ’resolution’ tag is derived by comparing the eye patch’s area to a fixed threshold of 4900 pixels, which is typically the number of pixels one can expect from a face image captured by 720P HD webcam during video-chat. Distribution of the eye patch size in our dataset is shown in Figure 2. The ’occlusion’ attribute labels whether the image contains hair, glasses, or profile view of the face (namely, self-occlusion). Detailed statistics of the dataset is given in Table 1.
|Total number of faces||4461|
|Total number of eye patches||8882|
|Non-frontal faces proportion||18.35%|
|Low-resolution eye patches proportion||57.58%|
|Proportion of images with any kind of occlusion||16.05%|
In this section, we illustrate the proposed Shape Constrained Network (SCN) that mainly contains a segmentation network and a shape regularization network, we design the loss functions for each part of network and explain the training of SCN in details.
We adapt SegNet  for our front-end segmentation network, and employ VAE-GAN  to regularise the predicted shape as well as to discriminate between real and fake examples. Our network is depicted in Figure 3. The training of SCN is divided into two steps: we first pre-train shape regularisation network (i.e., VAE-GAN) using the ground truth eye segmentation masks, afterwards, we borrow its encoder and discriminator for training our main segmentation network . The inference of SCN is indeed the same as SegNet, since we do not alter its encoder-decoder structure, instead, we mainly reformulate the losses and improve the training by adding shape regularization.
We utilise VAE-GAN  to learn the shape prior from ground truth segmentation masks. Simply put, VAE-GAN is a combination of Variational Auto-Encoder (VAE) and Generative Adversarial Networks (GANs), where they share a common decoder/generator. Specifically, in VAE, encoder tries to learn the parameters that map segmentation masks to the latent space of
, while the generator decodes the latent vectorto synthesise segmentation mask. In the part of GANs, the discriminator takes the generated mask and ground truth mask, and learns to judge between real and fake. Given a training example , the training losses of VAE-GAN can be written as:
where and are the masks generated from the feature embedding of ground truth data and randomly sample latent vector correspondingly. presents the distribution of latent vector given the input ,
is the normal distribution;is the KL divergence, and constrains to the latent distribution to Gaussian. and denotes the discriminator and its feature from the hidden layer respectively. is the reconstruction loss measuring the Euclidean distance of hidden layer’s output in the discriminator between the original image and the image reconstructed by auto-encoder. In VAEGAN, the similarity of the ground truth and the reconstructed image is not evaluated directly. Instead, they are first fed into the discriminator and the distance between their feature maps is used to measure the similarity. is an adversarial loss to play the minimax game between three candidates: original images, reconstructed images and images randomly sampled from latent space. The original images provide the discriminator with true examples, while the other two candidates aim at fooling the discriminator. The authors of VAEGAN did not indicate any method to choose the hidden layer. Theoretically, can be any hidden convolutional layer in the discriminator. In this paper, we empirically chose =1.
We borrow the architecture of SegNet  for our eye segmentation network, but reformulate the loss function to improve the segmentation accuracy and robustness. As mentioned previously, SegNet is indeed an encoder-decoder network without fully connected layers, this is achieved by reusing pooling indices calculated in the max-pooling step of the encoder to perform non-linear upsampling in the corresponding decoder. Owing to this, our segmentation network has less trainable parameters while maintaining a good performance.
Shape reconstruction loss. Based on VGG-16 , SegNet employs softmax cross entropy as the loss function, however, as Intersection-over-Union (IoU) is quite effective in evaluating the segmentation accuracy, we replace the original loss with the differentiable IoU loss . Moreover, comparing with cross entropy loss, IoU loss can better balance the contribution from different regions, thus avoiding the domination of one particular category (i.e., the background pixels, especially when the eye is nearly closed). This loss is defined as:
where and indicate reconstructed feature map and ground truth respectively, both variables are in the region of . is a very small number to avoid division by zero.
Shape embedding loss. Regularisation of the eye shape is important for producing a good segmentation mask. Inspired by ACNN , we regularise the shape prediction in the latent space of pre-trained VAE-GAN. Given a training image , the segmentation network predicts the mask , which can be encoded to such that by VAE. Similarly, the ground truth mask can also be encoded, i.e., . Assume the distance between two latent vectors is , where , to ensure that feature embedding of predicted mask lies close to that of ground truth, we need to minimise the expectation of error distance in terms of L2. Therefore, the latent loss can be computed as:
since the varianceof ground truth mask feature embedding is not related to any segmentation model parameters, it can be left out. Our shape embedding loss function becomes:
where is used to balance the precision and error tolerance.
Shape discriminator loss. The discriminative power of VAE is usually not strong enough to single out hard negative examples, hence, we propose a discriminator loss to further regularise the generated mask. This loss is defined as follows:
Although the discriminator loss can improve the quality of the segmentation result, it might also prolong the convergence of training. Therefore, it is important to weight the contribution of this loss.
The segmentation network and shape regularisation network need to be trained separately. First, we train the VAE-GAN using only ground truth segmentation masks. Our objective is to obtain a discriminative latent space to represent the underlying shape distribution , where indicates the shape, denotes the parameters of encoder, describes the parameters of generator, are discriminator parameters and is the latent vector.
Next, we freeze all the parameters of VAE-GAN, and connect the encoder and discriminator to the end of an untrained segmentation network. These two modules are only used to compute the shape embedding and discriminator losses as defined in Eq. 3 and 4, whilst their parameters will not be altered. Last, we train the segmentation network using the loss function Eq. 5.
Algorithm 1 shows the step-by-step training procedure of SCN. In that, describes the segmentation network, is the encoder, is the generator. indicates the parameters of segmentation network.
All experiments were performed on the aforementioned eye dataset, which is further divided into separate train, validation, and test sets with the ratio of 8:1:1. The three subsets were constructed in a subject-independent manner such that images of the same subject (as extracted from the meta data) are always put into the same subset.
During the experiments, mean IoU metric is used to to evaluate segmentation accuracy on sclera (S-mIoU), iris (I-mIoU), and the combined foreground classes (Mean mIoU). To ensure a fair comparison, all methods under comparison were re-trained on the same training set as ours using their publicly available implementation. Paired T-test with Bonferroni correction were applied to all results to test whether the performance difference between our proposed approach and the compared method is statistically significant.
, the network was trained Figfor a fixed number of 100 epochs. For the segmentation network, early stopping was used to prevent over-fitting, with the number of tolerance steps set to 50. The weights and for the two shape loss terms were both set to .
An ablation study was performed to verify that both the shape embedding loss and the shape discriminator loss helped to significantly improve segmentation accuracy in terms of Mean mIoU. The results are shown in Table 2. As can be seen, adding shape embedding loss increased the Mean mIoU by 2%, while further adding the shape discriminator loss brought an additional 1.5% improvement.
|SCN (full loss)||71.86%||86.18%||79.02%|
|SCN (with and )||70.26%||84.69%||77.47%|
|SCN (only with )||68.42%||84.20%||76.31%|
|SCN (only with )||67.78%||84.54%||76.16%|
|Method||S-mIoU||I-mIoU||Mean mIoU||Inference Time|
|DeepLab V3+ ||69.78%||85.46%||77.62%||0.041s|
|ERT111Using the implementation available at https://github.com/davisking/dlib ||66.42%||83.57%||74.99%||0.003s|
|DeepLab V2 ||63.41%||82.01%||72.71%||0.110s|
|SDM222Using the implementation available at https://github.com/FengZhenhua/Supervised-Descent-Method ||61.37%||78.70%||70.03%||0.037s|
. All compared methods were re-trained on the same training set during this experiment. The segmentation methods were trained and tested in the same setting as SCN. For the landmark localisation methods, the control points created during the annotation process were used as the training targets. During testing, we interpolated (cubic-spline for eyelids and ellipse for iris) the predicted landmark positions to create the segmentation mask for comparison. Result of this experiment is shown in Table3. SCN achieved higher Mean mIoU than all other methods. Through paired T-test with Bonferroni correction, we further found that the differences are all statistically significant (95% confidence). Visualisation of some random examples for the best-performing methods are shown in Figure 4. It can be clearly seen that SCN is quite robust and less likely to produce artifacts, which is attributed to the shape constraint.
In addition to accuracy, we also report the inference time of each method in Table 3. Although ERT  has the shortest inference time, it is less accurate than most deep methods. Among all deep methods, SCN runs the fastest (0.033s per image), achieving the same speed as that of SegNet . This is because the VAE-GAN is only used during training, thus does not incur additional computational cost during inference.
In this experiments, we wanted to investigate how the change of image resolution might affect segmentation performance of our method. Different from previous experiments, we ensure that the train set only contains high-resolution images (, where is the area of eye patch in pixels), while the test set only contains low-resolution images. The ratio is roughly 5:1. All samples are resized to for training and testing. We compared with six state-of-the-art segmentation methods in this experiment, the result is shown on Table 4. It is clear that SCN is consistently better than the other methods in S-mIoU and I-mIoU (at least 0.7% better in Mean mIoU), despite of the fact that fewer details are presented in the low-resolution image. Thereinto, S-mIoU and I-mIoU denote the intersection over union metric for sclera and iris, respectively. We attribute this to show that the shape prior knowledge learnt by VAEGAN from only high-resolution data can also benefit low-resolution eye segmentation.
|DeepLab v3+ ||61.59%||78.54%||70.07%|
|DeepLab V2 ||57.57%||76.79%||67.18%|
In this paper, we aimed at solving the problem of low-resolution eye segmentation. First, we proposed an in-the-wild eye dataset that includes 8882 eye patches from frontal and profile faces, the majority of which are captured in low resolution. We collected a significant number of samples that exhibit occlusion, weak/strong illumination and glasses. Then, we developed the Shape Constrained Network (SCN) that employs SegNet as the backend segmentation network, and we introduced shape prior to the training of SegNet by integrating the pre-trained encoder and discriminator from VAE-GAN. Based on the new training paradigm, we design three new losses: Intersection-over-Union (IoU) loss, shape discriminator loss and shape embedding loss.
We demonstrated in ablation studies that adding shape prior information is beneficial in training segmentation network. We outperformed several state-of-the-art segmentation methods as well as landmark alignment methods in subject-independent experiments. Last, we evaluate SCN’s performance in low-resolution images, with a cross dataset experiment in which the model is trained on high-resolution data and tested on low-resolution data. The results show that SCN can well generalise the variations in image resolution.
Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.In Advances in neural information processing systems, pages 402–408, 2001.
Proceedings of the European Conference on Computer Vision (ECCV), pages 801–818, 2018.
2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007.