RITnet: Real-time Semantic Segmentation of the Eye for Gaze Tracking

10/01/2019 ∙ by Aayush K. Chaudhary, et al. ∙ 27

Accurate eye segmentation can improve eye-gaze estimation and support interactive computing based on visual attention; however, existing eye segmentation methods suffer from issues such as person-dependent accuracy, lack of robustness, and an inability to be run in real-time. Here, we present the RITnet model, which is a deep neural network that combines U-Net and DenseNet. RITnet is under 1 MB and achieves 95.3% accuracy on the 2019 OpenEDS Semantic Segmentation challenge. Using a GeForce GTX 1080 Ti, RITnet tracks at > 300Hz, enabling real-time gaze tracking applications. Pre-trained models and source code are available https://bitbucket.org/eye-ush/ritnet/.



There are no comments yet.


page 1

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robust, accurate, and efficient gaze estimation is required to support a number of critical applications such as foveated rendering, human-machine and human-environment interactions, as well as inter-saccadic manipulations, such as redirected walking [Sun2018TowardsWalking]. Recent non-intrusive, video-based eye-tracking methods involve localization of eye features such as the pupil [Kassner2014Pupil:Interaction] and/or iris [Wood2014EyeTab:Computers]

. These features are then regressed onto some meaningful representation of an individual’s gaze. Convolutional neural networks (CNNs) have demonstrated high accuracy  

[Kim2019NVGaze:Estimation, Wu2019EyeNet:Understanding] and robustness in unconstrained lighting conditions  [B2019500000Segmentation] and an ability to generalize under low resolution constraints [Park2019Few-shotEstimation, Park2018DeepEstimation].

Figure 1: Comparison of model performance on difficult samples in the OpenEDS test-set. Top-row left to right shows eyes obstructed due to prescription glasses, heavy mascara, dim light and partial eyelid closure. Rows from top to bottom show input test images, ground truth labels, predictions from mSegNet w/BR [Garbin2019OpenEDS:Dataset] and predictions from RITnet, respectively.

In an effort to engage the machine learning and eye-tracking communities in the field of eye-tracking for head-mounted displays (HMD), Facebook Reality Labs issued the Open Eye Dataset (OpenEDS) Semantic Segmentation challenge which addresses part of the gaze estimation pipeline: identifying different regions of interest (e.g., pupil, iris, sclera, skin) in close-up images of the eye. Such

semantic segmentation of these regions enables the extraction of region-specific features (e.g., iridial feature tracking [Chaudhary2019MotionMovements])and mathematical models which summarize the region structures (e.g., iris ellipse [Wood2014EyeTab:Computers, B2019500000Segmentation, Park2018DeepEstimation], or pupil ellipse [Kassner2014Pupil:Interaction]) used to derive a measure of gaze orientation.

The major contributions of this paper are as follows:

  1. [noitemsep,nolistsep]

  2. We present RITnet, a semantic segmentation architecture that obtains state-of-the-art results on the 2019 OpenEDS Semantic Segmentation Challenge with model size of only 0.98 MB. Our model performs segmentation at 301 for 640x400 images on an NVIDIA 1080Ti GPU.

  3. We propose domain-specific augmentation schemes which help in generalization under a variety of challenging conditions.

  4. We present boundary aware loss functions with a loss scheduling strategy to train Deep Semantic Segmentation models. This helps in producing coherent regions with crisp region boundaries.

Figure 2: Architecture details of RITnet. DB refers to Down-Block, UB refers to Up

-Block, and BN stands for batch normalization. Similarly,

m refers to the number of input channels ( for gray scale image), c refers to number of output labels and p refers to number of model parameters. Dashed lines denote the skip connections from the corresponding Down

-Blocks. All of the Blocks output tensors of channel size


2 Previous Works

Recently developed solutions for end-to-end segmentation involve using Deep CNNs to produce a labeled output irrespective of the size of the input image. Such architectures consist of convolution layers with a series of down-sampling followed by progressive upsampling layers. Downsampling operations strip away finer information that is crucial for accurate pixel-level semantic masks. This limitation was mitigated by Ronneberger et al. by introducing skip-connections between the encoder and decoder [Ronneberger2015U-net:Segmentation]. Jergou et al. proposed TiramisuNet [Jegou2017TheSegmentation], a progression of dense blocks [Huang2017DenselyNetworks] with skip connections between the up- and down-sampling pathways. TiramisuNet demonstrated reuse of previously computed feature maps to minimize the required number of parameters. Dangi et al. proposed the DenseUNet-K architecture [Dangi2019]

for image-to-image translation based on simplified dense connected feature maps with skip connections. The RITnet model presented in this paper is based on the DenseUNet-K architecture


3 Proposed Model: RITnet

Recently, segmentation models based on Fully Convolutional Networks (FCN) have performed well across many datasets [Jegou2017TheSegmentation, Ronneberger2015U-net:Segmentation]. That success, however, often comes at the cost of computational complexity, restricting their feasibility for real-time applications where rapid computation and robustness to illumination conditions is paramount [Garbin2019OpenEDS:Dataset]. In contrast, RITnet has 248,900 trainable parameters which require less than 1MB storage with 32-bit precision (see Figure 2) and has been benchmarked at 300 .

RITnet has five Down-Blocks and four Up-Blocks which downsample and upsample the input. The last Down-Block is also referred to as the bottleneck layer which reduces the overall information into a small tensor of the input resolution. Each Down-Block consists of five convolution layers with LeakyReLU activation. All convolution layers share connections from previous layers inspired by DenseNet [Huang2017DenselyNetworks]. We maintain a constant channel size as in DenseUNet-K [Dangi2019] with K=32 channels to reduce the number of parameters. All Down-Blocks are followed by an average pooling layer of size 2x2. The Up-Block layer upsamples its input by a factor of two using the nearest neighbor approach. Each Up-Block consists of four convolution layers with LeakyReLU activation. All Up-Blocks receive extra information from their corresponding Down-Block via skip connections, an effective strategy which provides the model with representations of varying spatial granularity.

3.1 Loss functions

Each pixel is classified into one of four semantic categories:

background, iris, sclera, or pupil

. Standard cross-entropy loss (CEL) is the default choice for applications with a balanced class distribution. However, there exists an imbalanced distribution of classes with the fewest pixels representing pupil regions. While CEL aims to maximize the output probability at a pixel location, it remains agnostic to the structure inherent to eye images. To mitigate these issues, we implemented the following loss functions:

Generalized Dice Loss (GDL): Dice score coefficient measures the overlap between the ground truth pixel and their predicted values. In cases of class imbalance [Milletari2016V-Net:Segmentation], weighting the dice score by the squared inverse of class frequency [Sudre2017GeneralisedSegmentations] showed increased performance when combined with CEL.

Boundary Aware Loss (BAL): Semantic boundaries separate regions based on class labels. Weighting the loss for each pixel by its distance to the two nearest segments introduces edge awareness [Ronneberger2015U-net:Segmentation]. We generate boundary pixels using a Canny edge detector which are further dilated by two pixels to minimize confusion at the boundary. We use these edges to mask the CEL.

Surface Loss (SL): SL is based on a distance metric in the space of image contours which preserves small, infrequent structures of high semantic value [Kervadec2018BoundarySegmentation]. BAL attempts to maximize the correct pixel probabilities near boundaries while GDL provides stable gradients for imbalanced conditions. Contrary to both, SL scales the loss at each pixel based on its distance from the ground truth boundary for each class. It is effective in recovering smaller regions which are ignored by region based losses [Kervadec2018BoundarySegmentation].

The total loss is given by a weighted combination of these losses as .

4 Experimental Details

4.1 Dataset and Evaluation

We train and evaluate our model on the OpenEDS Semantic Segmentation dataset [Garbin2019OpenEDS:Dataset] consisting of 12,759 images split into train (8,916), validation (2,403) and test (1,440) subsets. Each image had been hand annotated with four semantic labels; background, sclera, pupil, & iris.

Per OpenEDS challenge guidelines, our overall score metric uses the average of the mean Intersection over Union (mIoU) metric for all classes and model size (S) calculated as a function of number of trainable parameters in megabytes (MB). The overall score is given as .

4.2 Training

We trained our model using Adam [Kingma2014Adam:Optimization]

with a learning rate of 0.001 and a batch size of 8 images for 175 epochs on a TITAN 1080 Ti GPU. We reduced the learning rate by a factor of 10 when the validation loss plateaued for more than 5 epochs. The selected model with the best validation score was found at the

epoch. In our experiments, we used and , where for epoch125 otherwise 0. This loss scheduling scheme gives prominence to GDL during initial iterations until a steady state is achieved, following which SL begins penalizing stray patches.

4.3 Data Pre-processing

To accommodate variation in individual reflectance properties (e.g., iris pigmentation, eye makeup, skin tone or eyelids/eyelashes) [Garbin2019OpenEDS:Dataset] and HMD specific illumination (the position of infrared LEDs with respect to the eye), we performed two pre-processing steps. These steps were based on the difference in the train, validation and test distributions of mean image brightness (Figure 11 in Garbin et. al [Garbin2019OpenEDS:Dataset]).Pre-processing reduced these differences and also increased separability of certain eye features. First, a fixed gamma correction with an exponent of 0.8 was applied to all input images. Second, we applied local Contrast Limited Adaptive Histogram Equalization (CLAHE) with a grid size of 8x8 and clip limit value of 1.5 [Zuiderveld1994ContrastEqualization]. Figure 3 shows an image before and after pre-processing.

Figure 3: Left to right: Original image, image after gamma correction, image after CLAHE is applied. Note that in the rightmost image, it is comparatively easier to distinguish iris and pupil.

To increase the robustness of the model to variations in image properties, training data was augmented with the following modifications:

  • Reflection about the vertical axis.

  • Gaussian blur with a fixed kernel size of 7x7 and standard deviation


  • Image translation of 0-20 pixels in both axes.

  • Image corruption using 2-9 thin lines drawn around a random center ()

  • Image corruption with a structured starburst pattern (Figure 4) to reduce segmentation errors caused by reflections from the IR illuminators on eyeglasses. Note that the starburst image is translated by 0-40 pixels in both directions.

Each image received at least one of the above-mentioned augmentations with a probability of 0.2 on each iteration. The probability that an image would be flipped horizontally was 0.5.

Figure 4: Generation of a starburst pattern from the training image 000000240768. Left to Right: Original image, selected reflections, concatenating with its 180 rotation, final pattern mask (best viewed in color).

5 Results

We compare our results against SegNet [Garbin2019OpenEDS:Dataset], another fully convolutional encoder-decoder architecture. mSegNet refers to the modified SegNet with four layers of encoder and decoder. mSegNet w/BR refers to mSegNet with Boundary Refinement as residual structure and mSegNet w/SC is a lightweight mSegNet with depthwise Separable Convolutions  [Garbin2019OpenEDS:Dataset]. As shown in Table 1, our model achieves a 6% improvement in mIoU score while the complexity is reduced by 38% compared to the baseline model mSegNet w/SC. However, our model’s segmentation quality was impacted at higher values of motion blur and image defocus (Figure 5), Figure 1 demonstrates that our model generalizes to some challenging cases where other models fail to produce coherent results.

Model Mean mIoU Model No. of Overall
F1 Size parameters Score
(S) (million)
mSegNet* 97.9 90.7 13.3 3.5 0.491
mSegNet* 98.3 91.4 13.3 3.5 0.495
mSegNet* 97.4 89.5 1.6 0.4 0.762
Ours 99.3 95.3 0.98 0.25 0.976
Table 1: Performance comparison on the test split of the OpenEDS dataset. The metrics and comparison models (*) are used as reported in  [Garbin2019OpenEDS:Dataset].
Figure 5: Our model struggles to do an accurate segmentation when eye masks are heavily blurred or defocused.

6 Discussion

Our model achieves state-of-the-art performance with a small model footprint.The final architecture was arrived at after exploring a number of architectural variations. Reducing the channel size from 32 to 24 and increasing the number of convolution layers in the Down-Block did not affect the results. Surprisingly, increasing the channel size to 40 and removing one convolutional layer in the Down-Block degraded performance, resulting in spurious patches in output regions. Performance was influenced by the choice of loss functions and the adjustment of their relative weights. By setting the boundary-aware loss at a relatively higher weight, we observed sharp boundary edges and consequently improved our test mIoU from 94.8% to 95.3%.

We speculate that some aspects of our model were successful because they accounted for labeling artifacts in the openEDS dataset. For example, although pupil-to-iris boundaries were defined using ellipse fits to multiple points selected on the boundaries [Garbin2019OpenEDS:Dataset]

, sclera-to-eyelid boundaries were created using a linear fit between adjacent points marked on the eyelids. It is perhaps for this reason that the use of nearest-neighbor interpolation outperformed bilinear interpolation in the process of upsampling. Although the smoother curves that result from bilinear interpolation resulted in more accurate detection of the iris and pupil, it was less accurate in segmentation of the sclera.

Finally, data prepossessing had a significant impact on model performance. Introduction of CLAHE and gamma correction resulted in an overall improvement of 0.2% in the validation mIoU score. Augmentation helped in noisy cases such as reflections from eyeglasses, varying contrast, eye makeup, and other image distortions.

7 Conclusion

We designed a computationally efficient model for the segmentation of eye images. We also presented methods for implementing multiple loss functions that can tackle class imbalance and ensures crisp semantic boundaries. We showed several methods for incorporating pre-processing and augmentation techniques that can help mitigate against image distortions. RITNet attained 95.3% on the OpenEDS test set with a model size 1 MB and benchmarks an impressive 301Hz on a NVIDIA 1080Ti.


We thank Anjali Jogeshwar, Kishan KC, Zhizhuo Yang, and Sanketh Moudgalya for providing valuable input and feedback. We would also like to thank the Research Computing group at RIT for providing access to GPU clusters.