Research on head-mounted eye tracking has traditionally focused on estimating gaze in screen coordinate space, e.g. of a public display. Estimating gaze in scene or world coordinates enables gaze analysis on 3D objects and scenes and has the potential for new applications, such as real-world attention analysis [Bulling 2016]. This approach requires two key components: 3D scene reconstruction and 3D gaze estimation.
In prior work, 3D gaze estimation was approximately addressed as a projection from estimated 2D gaze positions in the scene camera image to the corresponding 3D scene [Munn and Pelz 2008, Takemura et al. 2014, Pfeiffer and Renner 2014]. However, without proper 3D gaze estimation, gaze mapping suffers from parallax error caused by the offset between the scene camera origin and eyeball position [Mardanbegi and Hansen 2012, Duchowski et al. 2014]. To fully utilize the 3D scene information it is essential to estimate 3D gaze vectors in the scene coordinate system.
While 3D gaze estimation has been widely studied in remote gaze estimation, there have been very few studies in head-mounted eye tracking. This is mainly because 3D gaze estimation typically requires model-based approaches with special hardware, such as multiple IR light sources and/or stereo cameras [Beymer and Flickner 2003, Nagamatsu et al. 2010]. Hence, it remains unclear whether 3D gaze estimation can be done properly only with a lightweight head-mounted eye tracker. Świrski and Dodgson proposed a method to recover 3D eyeball poses from a monocular eye camera [Świrski and Dodgson 2013]. While it can be applied to lightweight mobile eye trackers, their method has been only evaluated with synthetic eye images, and its realistic performance including the eye-to-scene camera mapping accuracy has never been quantified.
We present a novel 3D gaze estimation method for monocular head-mounted eye trackers. Contrary to existing approaches, we formulate 3D gaze estimation as a direct mapping task from 2D pupil positions in the eye camera image to 3D gaze directions in the scene camera. Therefore, for the calibration we collect the 2D pupil positions as well as 3D target points, and finally minimize the distance between the 3D targets and the estimated gaze rays.
The contributions of this work are threefold. First, we summarize and analyze different 3D gaze estimation approaches for a head-mounted setup. We discuss potential error sources and technical difficulties in these approaches, and provide clear guidelines for designing lightweight 3D gaze estimation systems. Second, following from this discussion, we propose a novel 3D gaze estimation method. Our method directly maps 2D pupil positions in the eye camera to 3D gaze directions, and does not require 3D observation from the eye camera. Third, we provide a detailed comparison of our method with state-of-the-art methods in terms of 3D gaze estimation accuracy. The open-source simulation environment and the dataset are available at http://mpii.de/3DGazeSim.
2 3D Gaze Estimation
3D gaze estimation is the task of inferring 3D gaze vectors to the target objects in the environment. Gaze vectors in scene camera coordinates can then be intersected with the reconstructed 3D scene. There are three mapping approaches we discuss in this paper: 2D-to-2D, 3D-to-3D, and our novel 2D-to-3D mapping approach. In this section, we briefly summarize three approaches. For more details, please refer to the appendix section.
Standard 2D gaze estimation methods assume 2D pupil positions in the eye camera images as input. The task is to find the mapping function from to 2D gaze positions in the scene camera images (Figure 1 (a)). Given a set of calibration data items , the mapping function is typically formulated as a polynomial regression. 2D pupil positions are first converted into their polynomial representations
, and the linear regression weight is obtained via linear regression methods. Following Kassner et al.[Kassner et al. 2014], we did not include cubic terms and used an anisotropic representation as where .
In order to obtain 3D gaze vectors, most of the prior work assumes that the 3D gaze vectors are originating from the origin of the scene camera coordinate system. In this case, estimated 2D gaze positions can be simply back-projected to 3D vectors in the scene camera coordinate system. This is equivalent to assuming that the eyeball center position is exactly the same as the origin of the scene camera coordinate system. However, in practice there is always an offset between the scene camera origin and the eyeball position, and this offset causes the parallax error.
If we can estimate a 3D pupil pose (unit normal vector of the pupil disc) from the eye camera as done in Świrski and Dodgson [Świrski and Dodgson 2013], we can instead take a direct 3D-to-3D mapping approach (Figure 1 (b)). Instead of the 2D calibration targets , we assume 3D calibration targets in this case.
With the calibration data , the task is to find the rotation and translation between the scene and eye camera coordinate systems. This can be done by minimizing distances between 3D gaze targets and the 3D gaze rays which are rotated and translated to the scene camera coordinate system. In the implementation, we further parameterize the rotation by a 3D angle vector with the constraint that rotation angles are between and , and we initialize assuming that the eye camera and the scene camera are facing opposite directions.
Estimating 3D pupil pose is not a trivial task in real-world settings. Another potential approach is to directly map 2D pupil positions to 3D gaze directions (Figure 1 (c)).
In this case, we need to map the polynomial feature to unit gaze vectors originating from an eyeball center . can be parameterized in a polar coordinate system, and we assume a linear mapping from the polynomial feature to the angle vector. The regression weight is obtained by minimizing distances between 3D calibration targets and the mapped 3D gaze rays as in the 3D-to-3D approach. In the implementation, we used the same polynomial representation as the 2D-to-2D method to provide a fair comparison with the baseline.
3 Data Collection
In order to evaluate the potential and limitations of the introduced mapping approaches, we conducted two studies. First, we used data we obtained from a simulation environment, whereas the second study exploited real-world data collected from 14 participants.
We first analyzed the different mapping approaches in a simulation environment. Our simulation environment is based on a basic model of the human eye consisting of a pair of spheres [Lefohn et al. 2003] and the scene and eye camera models. The eye model and a screenshot of the simulation environment are illustrated in Figure 1. We used human average anatomical parameters: , , , and . The pupil is considered as the center of the circle which represents the intersection of the two spheres. For both eye and scene cameras, we used the pinhole camera model. Intrinsic parameters were set to values similar to those of the actual eye tracking headset we used in the real-world environment.
One of the key questions about 3D gaze estimation is whether calibration at single depth is sufficient or not. Intuitively, obtaining calibration data at different depths from the scene camera can improve the 3D mapping performance. We set calibration and test plane depths and to 1m, 1.25m, 1.5m, 1.75m, and 2m. At each depth, points are selected from two grids, a 5 by 5 grid which gives us 25 calibration points (blue) and an inner 4 by 4 grid for 16 test points (red) displayed on the white plane of Figure 1. Both of the grids are symmetric with respect to the scene camera’s principal axis. From the eye model used, we are able to estimate the corresponding gaze ray.
We also present evaluation of gaze estimation approaches using a real-world dataset to show the validity of 3D gaze estimation approaches.
The recording system consisted of a Lenovo G580 laptop and a Phex Wall 55” display (121.5cm 68.7cm) with a resolution of 1920 1080. Gaze data was collected using a PUPIL head-mounted eye tracker connected to the laptop via USB [Kassner et al. 2014] (see Figure 1(a)). The eye tracker has two cameras: one eye camera with a resolution of pixels recording a video of the right eye from close proximity, as well as an egocentric (scene) camera with a resolution of pixels. Both cameras recorded videos at 30 Hz. Pupil positions in the eye camera were detected using the PUPIL eye tracker’s implementation.
We implemented remote recording software which conducts the calibration and test recordings shown on the display to the participants. As shown in Figure 1(b), the target markers were designed so that their 3D positions can be obtained using the ArUco library [Garrido-Jurado et al. 2014]. Intrinsic parameters of the scene and eye cameras were calibrated before recording, and used for computing 3D fixation target positions and 3D pupil poses .
We recruited 14 participants aged between 22 and 29 years. The majority of them had little or no previous experience with eye tracking. Every participant had to perform two recordings, a calibration and a test recording of five different distances from the display. Recording distances were marked by red stripes on the ground (see Figure 1(b)). They were aligned parallel to the display with an initial distance of 1 meter and the following recording distances with a spacing of 25cm (1.0, 1.25, 1.5, 1.75, 2.0). For every participant we recorded 10 videos.
As in the simulation environment, the participants were instructed to look at 25 fixation target points from the grid pattern in Figure 1(b). After this step the participants had to perform the same procedure again while looking at 16 fixation targets placed on different positions than in the initial calibration to collect the test data for our evaluation part. This procedure was then repeated for the other four mentioned distances. The only restriction we imposed was that the participants should not move their head during the recording.
Since the ground-truth eyeball position is not available in the real-world study, we evaluate the estimation accuracy using an angular error observed from the scene camera. For the case where 2D gaze positions are estimated (2D-to-2D mapping), we back-projected the estimated 2D gaze position into the scene, and directly measured the angle between this line and the line from the origin of the scene camera to the measured fixation target . For the cases where 3D gaze vectors are estimated, we first determined the estimated 3D fixation target position assuming the same depth as the ground-truth target . Then the angle between the lines from the origin was measured.
We compared different mapping approaches in Figure 3 using an increasing number of calibration depths in both simulation and real-world environments. Each plot corresponds to mean estimation errors of all test planes and all combinations of calibration planes. Angular error is evaluated from the ground-truth eyeball position. It can be seen that in all cases the estimation performance can be improved by taking more calibration planes. Even the 2D-to-2D mapping approach performs slightly better with multiple calibration depths overall in both environments. The 2D-to-3D mapping approach performed better than the 2D-to-2D mapping in all cases in the simulation environment. For the 3D-to-3D mapping approach a parallax error near to zero can be achieved.
Similarly to the simulation case, we first compare the 2D-to-3D mapping with the 2D-to-2D mapping in terms of the influence of different calibration depths displayed as stable lines in Figure 3. Since it turned out that the 3D-to-3D mapping on real-world data has more angular error (over 10) than the 2D-to-3D mapping, we omit the results in the following analysis.
Contrary to the simulation result, with a lower number of calibration depths the 2D-to-2D approach performs better than the 2D-to-3D approach for real-world data. However, with an increasing number of calibration depths, the 2D-to-3D approach outperforms 2D-to-2D comparing the angular error in visual degrees. For five calibration depths we can achieve for the 2D-to-3D case an overall mean of less than 1.3 visual degrees over all test depths and all participants. A more detailed analysis and discussion with corresponding performance plots are available in the appendix.
We discussed three different approaches for 3D gaze estimation using head-mounted eye trackers. Although it was shown that the 3D-to-3D mapping is not a trivial task, the 2D-to-3D mapping approach was shown to perform better than the standard 2D-to-2D mapping approach using simulation data. One of the key observations from the simulation study is that the 2D-to-3D mapping approach requires at least two calibration depths. Given more than two calibration depths, the 2D-to-3D mapping can significantly reduce the parallax error.
On the real data, we could observe a decreasing error for the 2D-to-3D mapping with an increasing number of calibration depths, and could outperform the 2D-to-2D mapping. However, the performance of the 2D-to-3D mapping became worse than in the simulation environment. Reasons for the different performance of the mapping approaches in the simulation and real-world environment are manifold and reveal their limitations. Our simulation environment considers an ideal setting and does not include noise that occurs in the real world. This noise is mainly produced by potential errors in the pupil and marker detection, as well as head movements of the participants.
In future work it will be important to investigate how the 3D-to-3D mapping approach can work in practice. The fundamental difference from the 2D-to-3D mapping is that the mapping function has to explicitly handle the rotation between eye and scene camera coordinate systems. In addition to the fundamental estimation inaccuracy of the 3D pupil pose estimation technique without modeling real-world factors such as corneal refraction, we did not consider the difference between optical and visual axes. A more appropriate mapping function could be a potential solution for the 3D-to-3D mapping, and another option could be to use more general regression techniques considering the 2D-to-3D results.
Throughout the experimental validation, this research also illustrated the fundamental difficulty of the 3D gaze estimation task. It has been shown that the design of the calibration procedure is also quite important, and it is essential to address the issue from the standpoint of both calibration design and mapping formulation. Since the importance of different calibration depths has been shown, the design of automatic calibration procedure, e.g., how to obtain calibration data at different depths using only digital displays, is another important HCI research issue.
Finally, it is also important to combine the 3D gaze estimation approach with 3D scene reconstruction methods and evaluate the overall performance of 3D gaze mapping. In this sense, it is also necessary to evaluate performance with respect to scene reconstruction error.
In this work, we provided an extensive discussion on different approaches for 3D gaze estimation using head-mounted eye trackers. In addition to the standard 2D-to-2D mapping approach, we discussed two potential 3D mapping approaches using either 3D or 2D observation from the eye camera. We conducted a detailed analysis of 3D gaze estimation approaches using both simulation and real data.
Experimental results showed the advantage of the proposed 2D-to-3D estimation methods, but its complexity and technical challenges were also revealed. Together with the dataset and simulation environment, this study would provide a solid basis for future research on 3D gaze estimation with lightweight head-mounted devices.
We would like to thank all participants for their help with the data collection. This work was funded, in part, by the Cluster of Excellence on Multimodal Computing and Interaction (MMCI) at Saarland University, the Alexander von Humboldt Foundation, and a JST CREST research grant.
- [Beymer and Flickner 2003] Beymer, D., and Flickner, M. 2003. Eye gaze tracking using an active stereo head. In Proc. CVPR.
- [Bulling 2016] Bulling, A. 2016. Pervasive attentive user interfaces. IEEE Computer 49, 1, 94–98.
- [Duchowski et al. 2014] Duchowski, A. T., House, D. H., Gestring, J., Congdon, R., Świrski, L., Dodgson, N. A., Krejtz, K., and Krejtz, I. 2014. Comparing estimated gaze depth in virtual and physical environments. In Proc. ETRA, 103–110.
- [Garrido-Jurado et al. 2014] Garrido-Jurado, S., Muñoz-Salinas, R., Madrid-Cuevas, F. J., and Marín-Jiménez, M. J. 2014. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition 47, 6, 2280–2292.
- [Kassner et al. 2014] Kassner, M., Patera, W., and Bulling, A. 2014. Pupil: an open source platform for pervasive eye tracking and mobile gaze-based interaction. In Adj. Proc. UbiComp, 1151–1160.
- [Lefohn et al. 2003] Lefohn, A., Budge, B., Shirley, P., Caruso, R., and Reinhard, E. 2003. An ocularist’s approach to human iris synthesis. IEEE Trans. Computer Graphics and Applications 23, 6, 70–75.
- [Mardanbegi and Hansen 2012] Mardanbegi, D., and Hansen, D. W. 2012. Parallax error in the monocular head-mounted eye trackers. In Proc. UbiComp, ACM, 689–694.
- [Munn and Pelz 2008] Munn, S. M., and Pelz, J. B. 2008. 3d point-of-regard, position and head orientation from a portable monocular video-based eye tracker. In Proc. ETRA, ACM, 181–188.
- [Nagamatsu et al. 2010] Nagamatsu, T., Sugano, R., Iwamoto, Y., Kamahara, J., and Tanaka, N. 2010. User-calibration-free gaze tracking with estimation of the horizontal angles between the visual and the optical axes of both eyes. In Proc. ETRA, ACM, 251–254.
[Pfeiffer and Renner 2014]
Pfeiffer, T., and Renner, P.
Eyesee3d: A low-cost approach for analyzing mobile 3d eye tracking data using computer vision and augmented reality technology.In Proc. ETRA, ACM, 369–376.
- [Świrski and Dodgson 2013] Świrski, L., and Dodgson, N. A. 2013. A fully-automatic, temporal approach to single camera, glint-free 3d eye model fitting [abstract]. In Proc. PETMEI.
- [Takemura et al. 2014] Takemura, K., Takahashi, K., Takamatsu, J., and Ogasawara, T. 2014. Estimating 3-d point-of-regard in a real environment using a head-mounted eye-tracking system. IEEE Trans. on Human-Machine Systems 44, 4, 531–536.
3D Gaze Estimation Approaches
We first introduce detailed formulations of three approaches that are briefly presented above.
2D-to-2D Mapping Approach
As briefly described above, standard 2D gaze estimation methods assume 2D pupil positions in the eye camera images as input, and the task is to find the polynomial mapping function from to 2D gaze positions in the scene camera images. 2D pupil positions are first converted into their polynomial representations , and a coefficient vector which minimizes a cost function
is obtained via linear regression methods. Then any pupil positions can be mapped to 2D gaze positions as .
3D-to-3D Mapping Approach
In this case, the input to the mapping function is 3D pupil pose unit vectors . Given the calibration data with 3D calibration targets , the task is to find the rotation and translation between the scene and eye camera coordinate systems.
If we denote the origin of the pupil pose vectors as , 3D gaze rays after the rotation and translation are defined as a line , where parameterize the gaze line111Please note that is the parameter required to determine the 3D gaze point by intersecting the gaze ray to the scene, and does not have to be obtained during calibration stage.. Given the calibration data, and are obtained by minimizing distances between 3D gaze targets and the 3D gaze rays. In a vector form, the squared distance can be written as
Since denotes the eyeball center position in the scene camera coordinate system, the cost function can be defined as
Minimization of Eq. (3) can be done using nonlinear optimization methods such as the Levenberg-Marquardt algorithm. At the initialization step of the nonlinear optimization, we assume and considering the opposite direction of the scene and eye cameras in the world coordinate system.
2D-to-3D Mapping Approach
Another potential approach is to directly map 2D pupil positions to 3D gaze directions . In this case, we map the polynomial feature to unit gaze vectors originating from the eyeball center in the scene camera coordinate system. can be parameterized in a polar coordinate system as
and we assume a linear mapping from the polynomial feature to the angle vector as
Given the 3D calibration data , can be obtained by minimizing distances between 3D gaze targets and the gaze rays. Therefore, similarly to the 3D-to-3D mapping case, the target cost function to be minimized is
In order to initialize the parameters for nonlinear optimization, we first set . Then using the polar coordinates of gaze targets , the initial can be obtained by solving the linear regression problem
In this section, we provide extended analysis on the different performance taking single and multiple calibration depth combinations into account.
Figure 4 shows the error for all three mapping approaches on the simulation data by fixing the calibration depth in a similar manner as in Mardanbegi and Hansen’s work [Mardanbegi and Hansen 2012]. (a) and (b) are corresponding to performances using one and three calibration depth, respectively. Each plot shows the mean angular error distribution over test depths, and each color corresponds to a certain calibration depth. The error bars describe the corresponding standard deviations. Dashed lines correspond to the 2D-to-3D mapping, dotted lines correspond to the 3D-to-3D mapping, and solid lines correspond to the 2D-to-2D mapping.
With one calibration depth ((a)), the performance of the 2D-to-3D mapping is always better than the 2D-to-2D case. However, we can observe that the parallax error is still present in the 2D-to-3D case, which indicates the fundamental limitations of the approximated mapping approach. With three calibration depth ((b)), the 2D-to-3D mapping approach performs significantly better than in (a) and the parallax error reaches a near zero level. However, there is a tendency for the error to become larger as the test depth becomes closer to the camera, which indicates the limitations of the proposed mapping function. The performance of the 2D-to-2D mapping is also improved, but we can see that the increased number of calibration depths cannot be a fundamental solution to the parallax error issue. For the 3D-to-3D mapping, the angular error is close to zero even for only one calibration depth. Taking more calibration depths into account does not lead to a further improvement.
Similarly, we show a detailed comparison of the 2D-to-2D and 2D-to-3D mapping approaches using the real-world data. (a) displays the mean angular error for both approaches taking only one calibration depth over all 14 participants in the same manner as in (a). For both mapping approaches, each calibration depth setting performed best for the corresponding test depth, and the error increased with an increased test distance from the calibration depth. However, for the 2D-to-2D approach the angular error values over all distances are smaller than for the 2D-to-3D case, except for the case where the calibration depth and test depth are the same.
This behavior changes for an increasing number of calibration depths, as can be seen in Figure 4(b), where we used three different calibration depths as in (b). The 2D-to-3D mapping approach performs better than the 2D-to-2D mapping for nearly all combinations, except for the test depth D1, exploiting the additional 3D information collected during calibration to improve the gaze direction estimation.
Figure 6 shows the mean angular errors with respect to the offset between the calibration and test depths for the one calibration depth setting. The negative distance values on the horizontal axis indicate cases where the test depth is closer than the calibration depth, and vice versa for the positive distance values. As can be seen, the 2D-to-3D mapping approach tends to produce higher error if the test depth distance from the calibration depth increases.