Real-time Pupil Tracking from Monocular Video for Digital Puppetry

06/19/2020 ∙ by Artsiom Ablavatski, et al. ∙ Google 0

We present a simple, real-time approach for pupil tracking from live video on mobile devices. Our method extends a state-of-the-art face mesh detector with two new components: a tiny neural network that predicts positions of the pupils in 2D, and a displacement-based estimation of the pupil blend shape coefficients. Our technique can be used to accurately control the pupil movements of a virtual puppet, and lends liveliness and energy to it. The proposed approach runs at over 50 FPS on modern phones, and enables its usage in any real-time puppeteering pipeline.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of animating a virtual puppet in real-time using live footage of a human is a well studied one. Broadly speaking, one can classify these techniques by their choice of input data (monocular video, multi-view, depth images) and the methodology (direct optimization, prediction using neural networks, heuristics). For instance, Ichim 

et al[3] use monocular videos with predefined camera movements in order to obtain dense registration of person specific facial features and create a dynamic face model on the fly via optimization. Wu et al[10] leverage multi-view data to align a 3D Face Morphable Model (3DMM [2]) using a bundle of neural networks and produce person specific blend shapes. We refer the reader to [9] for a review of related work on 3D face alignment and blend shape computation. In this paper, we focus on puppeteering on mobile devices, without the use of extra sensors or a person-specific calibration step.

Despite the success of these techniques, the resulting avatars tend to lack a certain liveliness or expressivity because they do not track the position of the pupils. For instance, prior approach [4] leverage 3DMM which does not have the pupils in its internal representation. We address this problem using a two stage pipeline that combines a neural network for predicting the position of the pupils (Section 2) and a displacement-based algorithm for estimating the pupil blend shapes (Section 2). We build this pipeline on top of a state-of-the-art face mesh prediction model [4], but our approach generalizes to other face meshes.

Our network detects 5 points of the pupil, outer iris circle, and eye contour for each eye. Based on the position of these points, we apply carefully devised heuristics to obtain blend shape coefficients in the range where 1 and -1 represent full blend shape activation (e.g. the eye looks up or down respectively) with 0 being the neutral position (the eye looks frontally). We follow this with post-processing to reduce jitter from the detection stage and make the final rendering smooth and appealing.

Figure 1: Final rendering of eyes blend shapes tracking on the virtual avatar. Left — the original image, right — the image with overlaid avatar driven by the acquired blend shapes.

Our approach only requires a single frame at a time and does not rely on any additional sensors such as a depth camera. Figure 1 shows an example of a virtual puppet animated with our technique.

2 Neural network based eye landmarks

Figure 2: Overview of the pupil blend shapes acquisition. See text for details.

We start with a modern face mesh estimation pipeline that predicts a 468 vertex mesh for the human face [4]. We then compute the bounding boxes for eye regions and pass the corresponding cropped image regions to a smaller landmark regression network that produces additional higher quality landmarks.

Specifically, we extract the corresponding region ( pixels) via cropping from the center of eye landmark of the face mesh estimator. This cropped region is fed into a tiny neural network that has a structure similar to that described in Bazarevsky et al[1]. This subsequent network predicts 5 locations in 2D (pupil center, 4 points of outer iris circle, and 16 points of eye contour) in the coordinate system of the image starting from the upper left corner. We combine the corresponding landmarks (16 points of eye contour) from the face estimation pipeline with those from the eye refinement network by replacing the coordinates of the former while leaving untouched. We extend the face mesh with 5 pupil landmarks (pupil center and 4 points of outer iris circle), with their coordinate set to the average of the coordinate of the eye corners. The final refined facial mesh contains 478 (468 landmarks + 5 left pupil landmarks + 5 right pupil landmarks) vertices and is used in the second stage of this pipeline.

To execute the neural network on mobile devices, we employ TensorFlow Lite with GPU backend 

[6] coupled with MediaPipe [7] — a framework for building perception pipelines.

2.1 Model architecture

The neural network for predicting eye and iris landmarks contains a number of bottlenecks similar to recent work of Tan et al[8] and Bazarevsky et al[1]. The model ends with a fully connected layer that outputs a tensor corresponding to coordinates for each landmark (defined in the cropped image coordinate system). This design allows the network to learn a rich feature representation that achieves low error rates per landmarks (see Section 5). Further, the proposed model architecture has a small memory foot print (due to the small input resolution) and low number of FLOPs (due to the compression and depthwise convolutions). This enables real-time performance of the network on CPU and super real-time performance using GPU capabilities on modern phones. The run-time measurements and error rates are shown in the Table 1.

3 Displacement-based pupil blend shape estimation

We use the refined mesh to predict 4 blend shapes for the pupils: pupils pointing outwards, inwards, upwards and downwards respectively. We compute the activation of these shapes using a simple yet powerful displacement based approach. Specifically, for every blend shape we choose a pair of vertices on the refined mesh that robustly captures the blend shape i.e. for the pupil pointing inwards, we use the vertex of the pupil and the vertex of eye corner. Next, we measure the displacement between these two vertices and compare it to two empirically derived displacements — the displacement with the minimum activation of the blend shape, and — the displacement measured using the maximum activation of the blend shape. Based on this comparison, we obtain a scalar value in the range of for each pupil blend shape.

Next, we merge pairs of opposite blend shapes into two aggregate blend shapes. Finally, we apply smoothing and couple the estimated blend shape values between both eyes.

An overview of the pipeline is presented on the Figure 2.

3.1 Real-time heuristics calibration

Heuristics play a vital part in the proposed blend shape pipeline. Algorithm described in the Section 2 requires two displacements: and to be defined. The initial displacements are empirically estimated based on the representative face mesh dataset. However, these initial values are unable to model all person-specific variations. For a visual reference, Figure 3 shows the variation in displacements over time for different subjects (drawn in different colors) and initial estimated value (drawn in black). The solid line represents measurements of the actual displacements on per frame basis while the dotted line indicates the smoothed trend for a specific subject.

Figure 3: Variation of displacements in time for 3 subjects (red, green, blue) and initial estimated value (black). Vertical axis — displacement values from 25th to 75th percentile. Horizontal axis — # frames

To address the challenge of person-specific displacements and to make the system reliable, we propose to enhance the displacement estimation with a real-time calibration step. We employ the standard score [11]

calculation algorithm with a few modifications. The main idea of the filter is to check the displacement on every iteration and add it to a circular buffer of the trusted displacements if it falls within the specified confidence interval. Consequently, the calibrated displacement is calculated as an average of the trusted displacements. The standard deviation of these trusted displacements is used as the confidence interval in the next iteration. Details of the Standard Score Filter algorithm are presented in 

Algorithm 1.

0:  
0:  
  
  
  if  then
     
  else
     
     
     
  end if
  
  
  
  
  return  
Algorithm 1 Standard Score Filter

4 Datasets and training

In order to train the neural network to infer 2D positions of points around the eye, we use

manually annotated images from a globally sourced dataset. We applied a set of augmentations to these images such as affine (rotation, flip) and color transformations (hue, saturation, non-linear mapping, realistic camera noise injection). The network was trained for 250 epochs using the Adam optimizer 

[5]. Similar to [4]

, we use the Mean Squared Distance normalized by the Inter-Eye Distance (MSE IED) as our loss function. This normalization avoids factoring in the scale of the eyes.

5 Results

To quantitatively estimate the accuracy of the trained model, we use manually annotated images. The trained model achieved 7.16% MAD IED (Mean Absolute Distance normalized by the Inter-Eye Distance) on the collected dataset. The baseline error of the manual annotation is 5.73% for simple use cases (the face on the image is frontally rotated) and 7.04% - for hard cases. The error was measured on the same images annotated by different subjects. The inference speeds of the face mesh as well as eye refinement models on a number of phones are shown in the Table 1.

Phone Inference speed (ms)
Face mesh Eye refinement
GPU     CPU     GPU
Pixel XL 14 16 12
Pixel 2 XL 12 20 8
Pixel 3 XL 10 12 5
Samsung S9 10 12 5
iPhone X 4 7 2.6
Table 1: Face mesh and Eye refinement models inference speeds on a number of phones.

6 Conclusion

We present a novel pipeline for real-time pupil tracking from live video on mobile devices at real-time speeds. The approach defines a full end-to-end pipeline for pupil blend shapes estimation from monocular images without any pre-calibration and can be combined with any existing blend shape implementation. It can be used as out-of-the box solution for accurate control of the eye movements for a virtual puppet.

References

  • [1] V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, and M. Grundmann (2019)

    BlazeFace: sub-millisecond neural face detection on mobile gpus

    .
    arXiv preprint arXiv:1907.05047. Cited by: §2.1, §2.
  • [2] V. Blanz, T. Vetter, et al. (1999) A morphable model for the synthesis of 3d faces.. In Siggraph, Vol. 99, pp. 187–194. Cited by: §1.
  • [3] A. E. Ichim, S. Bouaziz, and M. Pauly (2015) Dynamic 3d avatar creation from hand-held video input. ACM Transactions on Graphics (ToG) 34 (4), pp. 45. Cited by: §1.
  • [4] Y. Kartynnik, A. Ablavatski, I. Grishchenko, and M. Grundmann (2019) Real-time facial surface geometry from monocular video on mobile gpus. arXiv preprint arXiv:1907.06724. Cited by: §1, §2, §4.
  • [5] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [6] J. Lee, N. Chirkov, E. Ignasheva, Y. Pisarchyk, M. Shieh, F. Riccardi, R. Sarokin, A. Kulik, and M. Grundmann (2019) On-device neural net inference with mobile gpus. arXiv preprint arXiv:1907.01989. Cited by: §2.
  • [7] C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C. Chang, M. G. Yong, J. Lee, et al. (2019) MediaPipe: a framework for building perception pipelines. arXiv preprint arXiv:1906.08172. Cited by: §2.
  • [8] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2820–2828. Cited by: §2.1.
  • [9] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395. Cited by: §1.
  • [10] F. Wu, L. Bao, Y. Chen, Y. Ling, Y. Song, S. Li, K. N. Ngan, and W. Liu (2019) MVF-net: multi-view 3d face morphable model regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 959–968. Cited by: §1.
  • [11] R. K. Yin (2017) Case study research and applications: design and methods. Sage publications. Cited by: §3.1.