With an ever-increasing number of devices competing for it, developing attentive user interfaces that adapt to users’ limited visual attention has emerged as a key challenge in human-computer interaction (HCI) [43, 6]. This challenge has become particularly important in mobile HCI, i.e. for mobile devices used on the go, in which attention allocation is subject to a variety of external influences and highly fragmented [29, 40]
. Consequently, the ability to robustly sense attentive behaviour has emerged as a fundamental requirement for predicting interruptibility (i.e. identifying opportune moments to interrupt a user)[7, 13, 30], estimating the noticeability of user interface content such as notifications , and for measuring fatigue, boredom , or user engagement .
Previous methods to sense mobile user attention either required special-purpose eye tracking equipment  or limited users’ mobility [34, 8], thereby limiting the ecological validity of the obtained findings. The need to study attention during everyday interactions has triggered research on using device interactions or events as a proxy to user attention, i.e. assuming that attention is on the device whenever the screen is on (e.g. Apple Screen Time) or whenever touch events [45, 27], notifications , or messages  occur. While proxy methods facilitate daily-life recordings, it is impossible to know whether users actually looked at their device, and resulting attention metrics are therefore unreliable. One solution to this problem is manual annotation of attentive behaviour using video recordings 
but this approach is tedious, time-consuming, and impractical for large-scale studies. In this work we instead study mobile attention sensing using off-the-shelf cameras and appearance-based gaze estimation based on machine learning. This approach has a number of advantages. First, it does not require special-purpose eye tracking equipment given that front-facing cameras are readily integrated into an ever-increasing number of mobile devices and offer increasingly high resolution images. Second, our approach enables recordings of attention allocation in-situ, i.e. during interactions that users naturally perform in their daily life. Third, in combination with recent advances in machine learning methods for appearance-based gaze estimation[56, 53] and device-specific model adaptation , our approach not only promises a new generation of mobile gaze-based interfaces  but also a direct (no proxy required) and fully automatic (no manual annotation required) means to sense user attention during mobile interactions. We extend a recent method for unsupervised eye contact detection in stationary settings 
and address challenges specific to mobile interaction scenarios: We use a multi-task CNN for robust face detection even on partially visible faces, which is a key challenge in mobile settings 
. We further combine a state-of-the-art hourglass neural network architecture
with a Kalman filter for more accurate facial landmark detection and head pose estimation. Reliable head pose estimates are particularly critical in mobile settings given the large variability in head poses. We finally normalize the images and train an appearance-based gaze estimator on the large-scale GazeCapture dataset . The specific contributions of this work are threefold. First, we present the first method to quantify human attention allocation during everyday mobile phone interactions. Our method addresses key challenges specific to mobile settings. Second, we evaluate our method on the sample use case of eye contact detection and show that our method significantly outperforms the state of the art and is robust to the significant variability caused by mobile settings with respect to users, mobile devices, and daily-life situations on two publicly available datasets [20, 14]. Third, we present a set of attention metrics enabled by our method and discuss how our method can be used as a general tool to study and quantify attention allocation on mobile devices in-situ.
2 Related Work
Our work is related to previous works on (1) user behavior modeling on mobile devices, (2) attention analysis in mobile settings, and (3) eye contact detection.
2.1 User Behavior Modeling on Mobile Devices
Over the years, smartphones have become more powerful, feature-rich, miniaturised computers. Having such devices with us all the time has implications and our usage patterns have changed significantly. A study shows that the nature of attentional resources on mobile devices has become highly fragmented and can last for as little as 4 seconds . This conclusion is similar to what Karlson et al. identified by looking at task disruption and the barriers faced when performing tasks on their mobile device . Smartphone overuse can have negative consequences in young adults and could lead to sleep deprivation and attention deficits . With such changes in the interaction patterns, it has become highly relevant to study and model user behaviour and visual attention. Sensor-rich mobile devices enable us to collect data and build models with applications in many different domains. A significant area of research is concerned with interruptibility or predicting the opportune moments to deliver messages and notifications. Mehrotra et al. investigated people’s receptivity to mobile notifications . A different study measured the effects of interrupting a user while performing a task and then evaluated task performance, emotional state, and social attribution . While many approaches only look at the immediate past for predicting interruptibility, Choy et al. proposed a method which also looks at a longer history of up to one day . In a Wizard of Oz study, Hudson et al. analysed which sensors are useful in predicting interruptibility . The way users interact with a certain device does not only depend on the content or the application used, but could be affected by the environment. Smartphone usage and interruptibility can also depend on the user’s location or social context [13, 12]. Besides looking at interruptibility, others have used attention to model different behavioural traits. Pielot et al. tried to predict attentiveness to mobile instant messages . User engagement can be analysed by collecting, for example, EEG data  or by looking at visual saliency and how this affects different engagement metrics . Toker et al. investigate engagement metrics in visualisation systems which can adapt to individual user characteristics . Alertness, another indicator for attention, can be monitored continuously and unobtrusively . Such characteristics can be used to better understand user attention patterns.
2.2 Attention Analysis in Mobile Settings
Previous research on user behaviour, human attention, or modelling behavioural traits has not focused on mobile devices. Given their popularity, understanding, detecting, modelling, and predicting human attention has developed into a new area of research. VADS explored the possibility of smartphone-based detection of the user’s visual attention . Users had to look at the intended object and hold the device so that the object, as well as the user’s face, can be simultaneously captured by the front and rear cameras. In an analysis task, knowing where users direct their attention might be sufficient, however, to fulfill the vision of pervasive attentive user interfaces , a system needs to predict where the user’s attention will be. Steil et al. proposed an approach to forecast visual attention by leveraging a wearable eye tracker and device integrated sensors . Another approach is to anticipate the user’s gaze with Generative Adversarial Networks (GANs) applied to egocentric videos . Attention allocation and modelling user attention goes beyond research and lab studies. With iOS version 12, Apple has released a built-in feature called Screen Time which measures the amount of time the screen is on and presents usage statistics. A similar app from Google for Android is Digital Wellbeing. Such applications provide interesting insights into one’s own usage, however, they are rather naive and always assume the users’ attention when the screen is on.
2.3 Eye Contact Detection
Unlike gaze estimation, which regresses the gaze direction, eye contact detection is a binary classification task, i.e. detecting whether someone is looking at a target object or not. The first works in this direction used LEDs attached to the target object to detect whether users were looking at the camera or not [43, 38, 10]. Selker et al. proposed a glass-mounted device which transmitted the user ID to the gaze target object . These methods require dedicated eye contact sensors and cannot be used with unmodified mobile devices. Recent works focused on using only off-the-shelf cameras for eye contact detection. Smith et al. proposed GazeLocking , a simple supervised appearance-based classification method for sensing eye contact. Ye et al. used head-mounted wearable cameras and a learning-based approach to detect eye contact . With recent advances in appearance-based gaze estimation [21, 55, 35, 54, 56], Zhang et al. proposed a full-face gaze estimation method  and introduced an unsupervised approach to eye contact detection in stationary settings based on it . In their approach, during training, the gaze samples in the camera plane were clustered to automatically infer eye contact labels. Extending this method, Mueller et al.  proposed an eye contact detection method which additionally correlates people’s gaze with their speaking behaviour by leveraging the fact that people often tend to look at the person who is speaking. All these methods were limited to stationary settings and assumed that the camera always has a clear view of the user. Only few previous works focused on gaze estimation and interaction on mobile devices but either their performance and robustness was severely limited [46, 15] or were studied in highly controlled and simplified laboratory settings . As demonstrated in previous works , these assumptions no longer hold when using the front-facing camera from mobile devices.
To detect whether users are looking at their mobile device or not, we extended the unsupervised eye contact detection method proposed by Zhang et al.  to address challenges specific to mobile interactive scenarios. The main advantage of this method is the ability to automatically detect the gaze target in an unsupervised fashion, which eliminates the need for manual data annotation. The only assumption of this approach is that the camera needs to be mounted next to the target object. This assumption is still valid in our use case, since the front-facing camera is always next to the device’s display. Figure 2
illustrates our method. During training, the pipeline first detects the face and facial landmarks of the input image. Afterwards, the image is normalized by warping it to a normalized space with fixed camera parameters and is fed into an appearance-based gaze estimation CNN. The CNN infers the 3D gaze direction and the on-plane gaze location. By clustering the gaze locations of the different images, the samples belonging to the cluster closest to the origin of the camera coordinate system are labeled with positive eye contact labels and all other samples are labeled with negative non-eye contact labels. The labeled samples are then used to train a binary support vector machine (SVM), which uses 4096-dimensional face-feature vectors to predict eye contact. For inference, the gaze estimation CNN extracts features from the normalized images which is then classified by the trained SVM.
3.1 Face Detection and Alignment
Images taken from the front-facing camera of mobile devices in the wild often contain large variation in head pose and only parts of the face or facial landmarks may be visible . To address this challenge specific to mobile scenarios, we use a more robust face detection approach which consists of three multi-task deep convolutional networks . In case of multiple faces, we only keep the face with the largest bounding box, since we assume that only one user at a time is using the mobile device. If the detector fails to detect any face, we automatically predict this image to have no eye contact. After detecting the face bounding box, it is particularly important to accurately locate the facial landmarks since these are used for head pose estimation and image normalization. For additional robustness, we use a state-of-the-art hourglass model  which estimates the 2D position of 68 different facial landmarks.
3.2 Head Pose Estimation and Data Normalization
The facial landmarks obtained from the previous step are used to estimate the 3D head pose of the detected face by fitting a generic 3D facial shape model. In contrast to Zhang et al. who used a facial shape model with six 3D points (four from the two eye corners and two from the mouth), we instead used a model with all the 68 3D points , which is more robust for extreme head poses, often the case in mobile settings. We first estimate an initial solution by fitting the model using the EPnP algorithm  and then further refine this solution by doing a Levenberg-Marquardt optimization. The final estimation is stabilized with a Kalman filter. The PnP problem typically assumes that the camera which captured the image is calibrated. However, since we do not know the calibration parameters of the different front-facing cameras from the mobile devices (nor do not want to enforce this requirement due to the overhead to calibrate every camera), we approximated the intrinsic camera parameters with default values. Once the 3D head pose is estimated, the face image is warped and cropped as proposed by Zhang et al.  to a normalized space with fixed parameters. The benefit of this normalization is the ability to handle variations in hardware setups as well as variations due to different shapes and appearance of the face. For this, we define the head coordinate system in the same way as proposed by the authors: The head is defined based on a triangle connecting the three midpoints of the eyes and mouth. The -axis is defined to be the direction of the line connecting the midpoint of the left eye with the midpoint of the right eye. The -axis is defined to be the direction from the eyes to the midpoint of the mouth and lays perpendicular to the -axis within the triangle plane. The remaining -axis is perpendicular to the triangle plane and points towards the back of the face. In our implementation, we chose the focal length of the normalized camera to be 960, the normalized distance to the camera to be 300 mm and the normalized face image size to be 448 x 488 pixels.
3.3 Gaze Estimation
We use a state-of-the-art gaze estimator based on a convolutional neural network (CNN) to estimate the 3D gaze direction. Besides the gaze vector, the CNN also outputs a 4096-dimensional feature vector, which comes from the last fully-connected layer of the CNN. This face feature vector will later be used as input for the eye contact detector. Given that our method was designed for robustness on images captured with mobile devices, we trained our model on the large-scale GazeCapture dataset . This dataset consists of 1,474 different users and around 2,5 million images captured using smartphone and tablet devices. Our trained model achieves a within-dataset angular error of 4.3° and a cross-dataset angular error of 5.3° on the MPIIFaceGaze dataset  (which is comparable to current gaze estimation approaches). To overcome inaccurate or incorrect gaze estimates caused by extreme head poses we propose the following adaptive thresholding mechanism: Whenever the pitch of the estimated head pose is outside the range [, ], or the yaw outside [, ], we use the head pose instead of the estimated gaze vector as a proxy for gaze direction. More specifically, we assume that the gaze direction is the -axis of the head pose. In practice, we set a value of 40° for both and . Together with the estimated 3D head pose, the gaze direction can be converted to a 2D gaze location in the camera image plane. We assume that each gaze vector in the scene originates from the midpoint of the two eyes. This midpoint can easily be computed in the camera coordinate system, since the 3D head pose has already been estimated in an earlier step of the pipeline. Given that the image plane is equivalent to the -plane of the camera coordinate system, the on-plane gaze location can be calculated by intersecting the gaze direction with the image plane.
3.4 Clustering and Eye Contact Detection
After estimating the on-plane gaze locations for the provided face images, these 2D locations are sampled for clustering. Similarly, as proposed by Zhang et al. , we assume that each cluster corresponds to a different eye contact target object. In our case, the cluster closest to the camera (i.e., closest to the origin) corresponds to looking at the mobile device. To filter out unreliable samples, we skip images for which the confidence value reported by the face detector is below a threshold of 0.9. Clustering of the remaining samples is done using the OPTICS algorithm 
. As a result of clustering, all the images which belong to the cluster closest to the camera are labeled as positive eye contact samples. Finally, taking the labeled samples from the clustering step, we train a weighted SVM based on the feature vector extracted from the gaze estimation CNN. To reduce the dimensionality of these high-dimensional feature vectors, we first apply a principal component analysis (PCA) to the entire training data and reduce the dimension so that the new subspace still retains 95% of the variance of the data. At test time, the clustering phase is no longer necessary. In this case, the 4096-dimensional feature vector is extracted from the appearance-based gaze estimation model and projected into the low-dimensional PCA subspace. With the trained SVM, we can then classify the resulting feature vector as eye contact or non eye contact.
4.0.1 Mobile Face Video Dataset (MFV)
This dataset includes 750 face videos from 50 users captured using the front-facing camera of an iPhone 5s. During data collection, users had to perform five different tasks under different lighting conditions (well-lit, dim light, and daylight). From the five tasks available in the dataset, we selected the “enrollment” task where users were asked to turn their heads in four different directions (left, right, up, and down). We picked this task because it enabled us to collect both eye contact and non eye contact data. From this subset (1 video per task 3 sessions 50 users), we randomly sampled 4,363 frames that we manually annotated with positive eye contact or negative non eye contact labels. 58% of the frames were labeled as positive and 42% were labeled as negative samples. This dataset is challenging because it contains large variations between users, head pose angles, and illumination conditions.
4.0.2 Understanding Face and Eye Visibility Dataset (UFEV)
This dataset consists of 25,726 in the wild images taken using the front-facing camera of different smartphones of ten participants. The images were collected during everyday activities in an unobtrusive way using an application running in the background. We randomly sampled 5,065 images from this dataset and manually annotated them with eye contact labels. We only sampled images where at least parts of the face were visible (which was the case for 14,833 photos). Around 17% of the frames were labeled as negative and 83% were labeled as positive eye contact samples. In contrast to the previous dataset, these samples exhibit a class imbalance between positive and negative labels. This dataset is challenging because the full face is only visible in about 29% of the images.
There are different ways to detect eye contact, such as GazeLocking  which is fully supervised, or methods that infer the coarse gaze direction  or leverage head orientation for visual attention estimation . However, all of these methods are inferior to the state-of-the-art eye contact detector proposed by Zhang et al. . We therefore opted to only compare our method (Ours) to two variants of the latter:
Zhang et al. + FA. Here, we replace the dlib face and landmark detector. For face detection, we used the more robust approach which leverages three multi-task CNNs  which can detect partially visible faces, a challenge and a requirement in mobile gaze estimation. Similarly, we replaced the landmark detector with a newer approach which uses a state-of-the-art hourglass model  to estimate the 2D location of the facial landmarks. The CNN architecture and trained model were the same as in the first baseline.
In all experiments that follow, we evaluated performance in terms of the Matthews Correlation Coefficient (MCC). The MCC score is commonly used as a performance measure for binary (two-class) classification problems. The MCC is more informative than other metrics (such as accuracy) because it takes into account the balance ratios of the four confusion matrix categories (true positives TP, true negatives TN, false positives FP, false negatives FN). This is particularly important for eye contact detection on mobile devices. For example, in the UFEV dataset, from the manually annotated images, 83% of them are positive eye contact and only 17% represent non eye contact. A MCC of +1.0 indicates perfect predictions, -1.0 indicates total disagreement between predictions and observations, and 0 is the equivalent of random guessing.
4.2 Eye Contact Detection Performance
Figure 4 shows the performance comparison of the three methods. Our evaluation was conducted on the two datasets using a leave-one-participant-out cross validation. The bars represent the mean MCC value and the error bars represent the standard deviation across the different runs. As can be seen from the figure, on the MFV dataset our method (MCC 0.84) significantly outperforms both baselines (MCC 0.52 and 0.41). The same holds for the UFEV dataset where Ours (MCC 0.75) shows significantly increased robustness in comparison to Zhang et al. + FA (0.18) and Zhang et al. (0.42). The differences between Ours
and the other baselines are significant (t-test,). To better understand the limitations of the clustering and the potential for further improvements, we also analysed the impact of the unsupervised clustering approach on the eye contact classification performance. To eliminate the influence of wrong labels resulting from incorrect clustering, we replaced the estimated labels with the manual ground truth annotations. As such, this defines an upper bound on the classification accuracy given perfect labels. The transparent bars in Figure 4 show the result of this analysis, i.e. the potential performance increase when using ground truth labels. Despite the improvement of the two baselines, our proposed method is still able to outperform them (an MCC score of 0.88 in comparison to 0.72 for both baselines on the MFV dataset and 0.73 in comparison to 0.45 and 0.56 on the UFEV dataset). Furthermore, our proposed method is close to the upper bound performance with ground truth information. We believe this difference can be attributed to our gaze estimation pipeline. Due to the improved training steps of the gaze estimator (face and landmark detection, head pose estimation, and data normalization) combined with the GazeCapture  dataset, our model can extract more meaningful features from the last fully connected layer of the CNN which, in turn, improves the weighted SVM binary classifier.
4.3 Performance of Detecting Non-Eye Contact
The complementary problem to eye contact detection is to identify non eye contact or when the users look away from the device. In some datasets, there are only a few non eye contact samples (e.g. only 17% in the UFEV dataset). Accurately detecting such events is equally, if not more important and at the same time significantly more challenging than detecting eye contact events due to their sparsity. One performance indicator in such cases is the true negative rate (TNR). These events are critical in determining whether there was an attention shift from the device to the environment or the other way around. As seen in previous work , these events are not only relevant attention metrics but they can be used as part of approaches to forecast user attention (i.e. predict an attention shift before it actually happens).
|Zhang et al.||3,663||32.0%||7.6%||40%|
|Zhang et al. + FA||3,960||36.4%||8.8%||50%|
|Zhang et al.||3,517||16.4%||4.8%||51%|
|Zhang et al. + FA||4,909||16.3%||9.1%||41%|
Table 1 summarises the results of comparing the TNR of the three methods. The TNR measures the proportion of non eye contact (negative) samples correctly identified. On the MFV dataset, our method is able to outperform the two baselines and identify more than twice as many non eye contact samples (21.5% in comparison to 7.6% or 8.8%) and more accurately (TNR of 88%). On the UFEV dataset the number of predicted samples is comparable for all three methods but our method again significantly outperforms the other two in terms of robustness of identifying non eye contact events (TNR of 74% compared to 51% and 41% achieved by the other methods).
4.4 Cross-Dataset Performance
In order to realistically assess performance for eye contact detection with a view to practical applications and actual deployments, it is particularly interesting to evaluate the cross-dataset performance. Cross-dataset performance evaluations have only recently started to being investigated in gaze estimation research  and, to the best of our knowledge, never for the eye contact detection task. To this end, we first trained on one dataset, either UFEV or MFV, and then evaluated on the other one. Figure 5 summarises the results of this analysis and shows that our method is able to outperform both baselines by a significant margin both when training on UFEV and testing on MFV, and vice versa. When training on the UFEV dataset, our method (MCC 0.83) performs better than the two baselines (MCC 0.38 and 0.47). The other way around, training on MFV and testing on UFEV, Ours (0.57) still outperforms Zhang et al. + FA (0.04) and Zhang et al. (0.14). Taken together, these results demonstrate that our method, which we specifically optimised for mobile interaction scenarios, is able to abstract away dataset specific-biases and to generalize well on other datasets. As such, this result is particularly important for HCI practitioners who want to use such a method for real-world experiments on unseen data.
4.5 The Influence of Head Pose Thresholding
In order to reduce the impact of incorrect or inaccurate gaze estimates on eye contact detection performance, in our method we introduced a thresholding step based on the head pose angle. Current datasets [21, 55] have improved the state of the-art in appearance-based gaze estimation significantly, however, they offer limited head pose variability when compared to data collected in the wild (see Figure 6
). Like in many other areas in computer vision, this fundamentally limits the performance of learning-based methods. In our method, we train our model on the GazeCapture dataset which, currently, is the largest publicly available dataset for gaze estimation. Still, both the MFV and UFEV dataset show larger head pose variability. To overcome the above limitation, we apply the following adaptive thresholding technique: Whenever the horizontal or vertical head pose angle is below or above a certain threshold, we replace the gaze estimates by the head pose angles. This adaptive thresholding technique happens in the normalized space, thus only two threshold values are necessary, one vertical and one horizontal.
Given the distribution of the GazeCapture training data (see Figure 6), in our approach we empirically determined a threshold of 40° for the head pose in the normalized camera space. This is, whenever the head pose angle of either component (vertical or horizontal) is above or below this threshold, we use the head pose angles as a proxy for gaze estimates. Table 2 shows the results of an ablation study with two other versions of our pipeline: The Gaze only (MCC 0.74 on MFV and 0.30 on UFEV) baseline does not use any thresholding. The Head pose only (MCC 0.37 on MFV and 0.73 on UFEV) baseline replaces all the gaze estimates with head pose estimates. The results show that Gaze only or Head pose only can yield reasonable performance for individual datasets. However, only our method (MCC 0.86 on MFV and 0.76 on UFEV) is able to perform well on both datasets, outperforming both baselines. This result also shows that, since this value is set in the normalized camera space, the same threshold value is effective across datasets.
|Head pose only||0.37||0.73|
|Ours (Pitch = Yaw = 40°)||0.86||0.76|
4.6 Robustness to Variability in Illumination
Given that unconstrained mobile eye contact detection implies different environments and conditions, we analysed how varying illumination affected our method’s performance in comparison to the two baselines (see Figure 7). In this evaluation, we trained all three methods on the UFEV dataset and evaluated their performance in three different scenarios on a subset from the MFV dataset for which we had both eye contact and illumination labels: dim light (986 images), well-lit (2157 images), and daylight (1221 images). Our method clearly outperforms the two baselines in all the three scenarios (0.67 vs. 0.50 for dim light, 0.88 vs. 0.46 for well-lit, and 0.86 vs. 0.47 for daylight against the best baseline, Zhang et. al. ). The Zhang et al. + FA baseline is inferior to Zhang et al. because it uses the improved face and landmark detector which detects more challenging images otherwise skipped in the evaluation with the Zhang et al. baseline.
Our evaluations show that our method not only significantly outperforms the state of the art in terms of mobile eye contact detection performance both within- and cross-dataset (see Figure 4, Figure 5, and Table 1) but also in terms of robustness to variability in illumination conditions (see Figure 7). These results, combined with the evaluations on head pose thresholding, also demonstrate the unique challenges of the mobile setting as well as the effectiveness of the proposed improvements to the method by Zhang et al. . One of the most important applications enabled by our eye contact detection method on mobile devices is attention quantification. In contrast to previous works that leveraged device interactions or other events as a proxy to user attention, our method can quantify attention allocation unobtrusively and robustly, only requiring the front-facing cameras readily integrated in an ever-increasing number of devices. Being able to accurately and robustly sense when users look at their device, or when they look away, is a key building block for future pervasive attentive user interfaces (see Figure 8). As a first step, in this work we focused on the sample task of eye contact detection. It is important to note that our method allows to automatically calculate additional mobile attention metrics (see Figure 8) that pave the way for a number of exciting new applications in mobile HCI. The first metric that can be calculated is the number of glances that indicates how often a user has looked briefly at their mobile device. A metric which considers how long users look at their device is the average attention span. In Figure 8, the average attention span towards the device is given by the average time of the black boxes and the average attention span towards the environment is given by the duration of the white boxes. Other attention metrics were recently introduced by Steil et al.  in the context of attention forecasting. One such metrics is the primary attentional focus: By aggregating and comparing the duration of all attention spans towards the mobile device as well as the environment we can decide whether the users’ attention during the analyzed time interval is primarily towards the device or towards the environment. Besides aggregating, the shortest or the longest attention span might also reveal insights into users’ behaviour. Finally, the number of attention shifts can capture the users’ interaction with the environment. An attention shift occurs when users shift their attention from the device to the environment or the other way around.
An analysis which quantifies attentive behaviour with some of the metrics described previously is only the first step. Mobile devices are powerful sensing platforms equipped with a wide range of sensors besides the front-facing camera and a user’s context might provide additional behavioral insights. Future work could compare attention allocation relative to the application running in the foreground on the mobile device. Such an analysis could reveal, for example, differences (or similarities) in attentive behaviour when messaging, when using social media, or when browsing the internet. A different analysis could factor in the user’s current activity (attention allocation while taking the train, walking, or standing) or the user’s location. Going beyond user context, attention allocation could even be conditionally analysed on demographic factors such as age, sex, profession, or ethnicity.
5.1 Limitations and Future Work
While we have demonstrated significant improvements in terms of performance and robustness for mobile eye contact detection, our method also has several limitations. One of the key components in our pipeline is the appearance-based gaze estimator and our method’s performance is directly influenced by it. In our experiments, we highlighted a limitation of current gaze estimation datasets, namely the limited variability in head pose angles in comparison to data collected in the wild. As a result, gaze estimates tend to be inaccurate and harm performance of our method. We addressed this limitation by introducing adaptive thresholding which, for extreme head poses, uses the head pose as a proxy to the unreliable gaze estimates. Overall, this improved performance but may miss cases when the head is turned away from the device but users still look at it. One possibility to address this problem is to collect new gaze estimation datasets with more realistic head pose distributions to improve model training. Besides further improved performance, runtime improvements will broaden our method’s applicability and practical usefulness. In its current implementation, our approach is only suited for offline attention analysis, i.e. for processing image data post-hoc. While such post-hoc analysis will already be sufficient for many applications, real-time eye contact detection on mobile devices will pave the way for a whole new range of applications completely unthinkable today. In particular, we see significant potential of real-time eye contact detection for mobile HCI tasks such as predicting user interruptibility, estimating noticeability of user interface content, or measuring user engagement. Additionally, a real-time algorithm could process the recorded video directly on the device and would not require to store them externally, potentially even in the cloud, as this will likely raise serious privacy concerns.
In this work, we proposed a novel method to sense and analyse users’ visual attention on mobile devices during everyday interactions. Through in-depth evaluations on two current datasets, we demonstrated significant performance improvements for the sample task of eye contact detection across mobile devices, users, or environmental conditions compared to the state of the art. We further discussed a number of additional attention metrics that can be extracted using our method and that have wide applicability for a range of applications in attentive user interfaces and beyond. Taken together, these results are significant in that they, for the first time, enable researchers and practitioners to unobtrusively study and robustly quantify attention allocation during mobile interactions in daily life.
-  Saeed Abdullah, Elizabeth L. Murnane, Mark Matthews, Matthew Kay, Julie A. Kientz, Geri Gay, and Tanzeem Choudhury. 2016. Cognitive Rhythms: Unobtrusive and Continuous Sensing of Alertness Using a Mobile Phone. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp ’16). ACM, New York, NY, USA, 178–189. DOI:http://dx.doi.org/10.1145/2971648.2971712
-  Piotr D. Adamczyk and Brian P. Bailey. 2004. If Not Now, when?: The Effects of Interruption at Different Moments Within Task Execution. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’04). ACM, New York, NY, USA, 271–278. DOI:http://dx.doi.org/10.1145/985692.985727
-  Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Joerg Sander. 1999. OPTICS: Ordering Points to Identify the Clustering Structure. Sigmod Record 28 (6 1999), 49–60. DOI:http://dx.doi.org/10.1145/304182.304187
-  Tadas Baltrusaitis, Peter Robinson, and Louis-Philippe Morency. 2016. OpenFace: An open source facial behavior analysis toolkit.. In WACV. IEEE Computer Society, 1–10.
-  Andreas Bulling. 2016. Pervasive Attentive User Interfaces. IEEE Computer 1 (2016), 94–98.
-  Minsoo Choy, Daehoon Kim, Jae-Gil Lee, Heeyoung Kim, and Hiroshi Motoda. 2016. Looking Back on the Current Day: Interruptibility Prediction Using Daily Behavioral Features. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp ’16). ACM, New York, NY, USA, 1004–1015. DOI:http://dx.doi.org/10.1145/2971648.2971649
-  Alexandre De Masi and Katarzyna Wac. 2018. You’Re Using This App for What?: A mQoL Living Lab Study. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers (UbiComp ’18). ACM, New York, NY, USA, 612–617. DOI:http://dx.doi.org/10.1145/3267305.3267544
-  Jiankang Deng, Yuxiang Zhou, Shiyang Cheng, and Stefanos P. Zafeiriou. 2018. Cascade Multi-View Hourglass Model for Robust 3D Face Alignment. 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (2018), 399–403.
-  Connor Dickie, Roel Vertegaal, Jeffrey S. Shell, Changuk Sohn, Daniel Cheng, and Omar Aoudeh. 2004. Eye Contact Sensing Glasses for Attention-sensitive Wearable Video Blogging. In CHI ’04 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’04). ACM, New York, NY, USA, 769–770. DOI:http://dx.doi.org/10.1145/985921.985927
-  Tilman Dingler and Martin Pielot. 2015. I’Ll Be There for You: Quantifying Attentiveness Towards Mobile Messaging. In Proceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI ’15). ACM, New York, NY, USA, 1–5. DOI:http://dx.doi.org/10.1145/2785830.2785840
-  Trinh Minh Tri Do, Jan Blom, and Daniel Gatica-Perez. 2011. Smartphone Usage in the Wild: A Large-scale Analysis of Applications and Context. In Proceedings of the 13th International Conference on Multimodal Interfaces (ICMI ’11). ACM, New York, NY, USA, 353–360. DOI:http://dx.doi.org/10.1145/2070481.2070550
-  Anja Exler, Marcel Braith, Andrea Schankin, and Michael Beigl. 2016. Preliminary Investigations About Interruptibility of Smartphone Users at Specific Place Types. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct (UbiComp ’16). ACM, New York, NY, USA, 1590–1595. DOI:http://dx.doi.org/10.1145/2968219.2968554
-  Mohammed E. Fathy, Vishal M. Patel, and Rama Chellappa. 2015. Face-based Active Authentication on mobile devices. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), 1687–1691.
-  Corey Holland and Oleg Komogortsev. 2012. Eye tracking on unmodified common tablets: challenges and solutions. In Proceedings of the Symposium on Eye Tracking Research and Applications. ACM, 277–280.
-  Scott Hudson, James Fogarty, Christopher Atkeson, Daniel Avrahami, Jodi Forlizzi, Sara Kiesler, Johnny Lee, and Jie Yang. 2003. Predicting Human Interruptibility with Sensors: A Wizard of Oz Feasibility Study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’03). ACM, New York, NY, USA, 257–264. DOI:http://dx.doi.org/10.1145/642611.642657
-  Zhiping Jiang, Jinsong Han, Chen Qian, Wei Xi, Kun Zhao, Han Ding, Shaojie Tang, Jizhong Zhao, and Panlong Yang. 2016. VADS: Visual attention detection with a smartphone. In Computer Communications, IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on. IEEE, 1–9.
-  Amy K. Karlson, Shamsi T. Iqbal, Brian Meyers, Gonzalo Ramos, Kathy Lee, and John C. Tang. 2010. Mobile Taskflow in Context: A Screenshot Study of Smartphone Usage. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’10). ACM, New York, NY, USA, 2009–2018. DOI:http://dx.doi.org/10.1145/1753326.1753631
-  Mohamed Khamis, Florian Alt, and Andreas Bulling. 2018a. The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned. In Proc. International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI). 38:1–38:17. DOI:http://dx.doi.org/10.1145/3229434.3229452
-  Mohamed Khamis, Anita Baier, Niels Henze, Florian Alt, and Andreas Bulling. 2018b. Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones Used in the Wild. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). ACM, New York, NY, USA, Article 280, 12 pages. DOI:http://dx.doi.org/10.1145/3173574.3173854
Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra
Bhandarkar, Wojciech Matusik, and Antonio Torralba. 2016.
Eye Tracking for Everyone. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
-  Uichin Lee, Joonwon Lee, Minsam Ko, Changhun Lee, Yuhwan Kim, Subin Yang, Koji Yatani, Gahgene Gweon, Kyong-Mee Chung, and Junehwa Song. 2014. Hooked on Smartphones: An Exploratory Study on Smartphone Overuse Among College Students. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). ACM, New York, NY, USA, 2327–2336. DOI:http://dx.doi.org/10.1145/2556288.2557366
-  Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. 2009. EPnP: An accurate O(n) solution to the PnP problem. International Journal of Computer Vision 81 (2 2009). DOI:http://dx.doi.org/10.1007/s11263-008-0152-6
-  Akhil Mathur, Nicholas D. Lane, and Fahim Kawsar. 2016. Engagement-aware Computing: Modelling User Engagement from Mobile Contexts. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp ’16). ACM, New York, NY, USA, 622–633. DOI:http://dx.doi.org/10.1145/2971648.2971760
-  Lori McCay-Peet, Mounia Lalmas, and Vidhya Navalpakkam. 2012. On Saliency, Affect and Focused Attention. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’12). ACM, New York, NY, USA, 541–550. DOI:http://dx.doi.org/10.1145/2207676.2207751
-  Abhinav Mehrotra, Veljko Pejovic, Jo Vermeulen, Robert Hendley, and Mirco Musolesi. 2016. My Phone and Me: Understanding People’s Receptivity to Mobile Notifications. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). ACM, New York, NY, USA, 1021–1032. DOI:http://dx.doi.org/10.1145/2858036.2858566
-  Philipp Müller, Daniel Buschek, Michael Xuelin Huang, and Andreas Bulling. 2019. Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage. In Proc. International Symposium on Eye Tracking Research and Applications (ETRA). DOI:http://dx.doi.org/10.1145/3314111.3319918
-  Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018. Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour. In Proc. International Symposium on Eye Tracking Research and Applications (ETRA). 31:1–31:10. DOI:http://dx.doi.org/10.1145/3204493.3204549
-  Antti Oulasvirta, Sakari Tamminen, Virpi Roto, and Jaana Kuorelahti. 2005. Interaction in 4-second Bursts: The Fragmented Nature of Attentional Resources in Mobile HCI. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’05). ACM, New York, NY, USA, 919–928. DOI:http://dx.doi.org/10.1145/1054972.1055101
-  Martin Pielot, Bruno Cardoso, Kleomenis Katevas, Joan Serrà, Aleksandar Matic, and Nuria Oliver. 2017. Beyond Interruptibility: Predicting Opportune Moments to Engage Mobile Phone Users. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1, 3, Article 91 (Sept. 2017), 25 pages. DOI:http://dx.doi.org/10.1145/3130956
-  Martin Pielot, Karen Church, and Rodrigo de Oliveira. 2014a. An In-situ Study of Mobile Phone Notifications. In Proceedings of the 16th International Conference on Human-computer Interaction with Mobile Devices & Services (MobileHCI ’14). ACM, New York, NY, USA, 233–242. DOI:http://dx.doi.org/10.1145/2628363.2628364
-  Martin Pielot, Rodrigo de Oliveira, Haewoon Kwak, and Nuria Oliver. 2014b. Didn’T You See My Message?: Predicting Attentiveness to Mobile Instant Messages. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). ACM, New York, NY, USA, 3319–3328. DOI:http://dx.doi.org/10.1145/2556288.2556973
-  Martin Pielot, Tilman Dingler, Jose San Pedro, and Nuria Oliver. 2015. When Attention is Not Scarce - Detecting Boredom from Mobile Phone Usage. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp ’15). ACM, New York, NY, USA, 825–836. DOI:http://dx.doi.org/10.1145/2750858.2804252
-  Qing-Xing Qu, Le Zhang, Wen-Yu Chao, and Vincent Duffy. 2017. User Experience Design Based on Eye-Tracking Technology: A Case Study on Smartphone APPs. In Advances in Applied Digital Human Modeling and Simulation, Vincent G. Duffy (Ed.). Springer International Publishing, Cham, 303–315.
-  Rajeev Ranjan, Shalini De Mello, and Jan Kautz. 2018. Light-Weight Head Pose Invariant Gaze Tracking. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2018), 2237–22378.
-  Adria Recasens, Aditya Khosla, Carl Vondrick, and Antonio Torralba. 2015. Where are they looking? In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 199–207. http://papers.nips.cc/paper/5848-where-are-they-looking.pdf
-  Ted Selker, Andrea Lockerd, and Jorge Martinez. 2001. Eye-R, a Glasses-mounted Eye Motion Detection Interface. In CHI ’01 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’01). ACM, New York, NY, USA, 179–180. DOI:http://dx.doi.org/10.1145/634067.634176
-  Jeffrey S. Shell, Roel Vertegaal, and Alexander W. Skaburskis. 2003. EyePliances: Attention-seeking Devices That Respond to Visual Attention. In CHI ’03 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’03). ACM, New York, NY, USA, 770–771. DOI:http://dx.doi.org/10.1145/765891.765981
-  Brian A. Smith, Qi Yin, Steven K. Feiner, and Shree K. Nayar. 2013. Gaze Locking: Passive Eye Contact Detection for Human-object Interaction. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology (UIST ’13). ACM, New York, NY, USA, 271–280. DOI:http://dx.doi.org/10.1145/2501988.2501994
-  Julian Steil, Philipp Müller, Yusuke Sugano, and Andreas Bulling. 2018. Forecasting User Attention During Everyday Mobile Interactions Using Device-integrated and Wearable Sensors. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI ’18). ACM, New York, NY, USA, Article 1, 13 pages. DOI:http://dx.doi.org/10.1145/3229434.3229439
-  Dereck Toker, Cristina Conati, Ben Steichen, and Giuseppe Carenini. 2013. Individual User Characteristics and Information Visualization: Connecting the Dots Through Eye Tracking. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’13). ACM, New York, NY, USA, 295–304. DOI:http://dx.doi.org/10.1145/2470654.2470696
-  Vytautas Vaitukaitis and Andreas Bulling. 2012. Eye gesture recognition on portable devices. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 711–714.
-  Roel Vertegaal and others. 2003. Attentive user interfaces. Commun. ACM 46, 3 (2003), 30–33.
-  Michael Voit and Rainer Stiefelhagen. 2008. Deducing the Visual Focus of Attention from Head Pose Estimation in Dynamic Multi-view Meeting Scenarios. In Proceedings of the 10th International Conference on Multimodal Interfaces (ICMI ’08). ACM, New York, NY, USA, 173–180. DOI:http://dx.doi.org/10.1145/1452392.1452425
-  Pierre Weill-Tessier, Jayson Turner, and Hans Gellersen. 2016. How do you look at what you touch?: a study of touch interaction and gaze correlation on tablets. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications. ACM, 329–330.
-  Erroll Wood and Andreas Bulling. 2014. Eyetab: Model-based gaze estimation on unmodified tablet computers. In Proceedings of the Symposium on Eye Tracking Research and Applications. ACM, 207–210.
-  Z. Ye, Y. Li, Y. Liu, C. Bridges, A. Rozga, and J. M. Rehg. 2015. Detecting bids for eye contact using a wearable camera. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 1. 1–8. DOI:http://dx.doi.org/10.1109/FG.2015.7163095
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (Oct. 2016), 1499–1503. DOI:http://dx.doi.org/10.1109/LSP.2016.2603342
-  Mengmi Zhang, Keng Teck Ma, Joo-Hwee Lim, Qi Zhao, and Jiashi Feng. 2017a. Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks.. In CVPR. 3539–3548.
-  Xucong Zhang, Michael Xuelin Huang, Yusuke Sugano, and Andreas Bulling. 2018. Training Person-Specific Gaze Estimators from Interactions with Multiple Devices. In Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). 624:1–624:12. DOI:http://dx.doi.org/10.1145/3173574.3174198
-  Xucong Zhang, Yusuke Sugano, and Andreas Bulling. 2017b. Everyday eye contact detection using unsupervised gaze target discovery. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. ACM, 193–203.
-  Xucong Zhang, Yusuke Sugano, and Andreas Bulling. 2018. Revisiting Data Normalization for Appearance-based Gaze Estimation. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications (ETRA ’18). ACM, New York, NY, USA, Article 12, 9 pages. DOI:http://dx.doi.org/10.1145/3204493.3204548
-  Xucong Zhang, Yusuke Sugano, and Andreas Bulling. 2019. Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications. In Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). DOI:http://dx.doi.org/10.1145/3290605.3300646
-  X. Zhang, Y. Sugano, M. Fritz, and A. Bulling. 2015. Appearance-based gaze estimation in the wild. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4511–4520. DOI:http://dx.doi.org/10.1109/CVPR.2015.7299081
-  Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2017. It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation. In 1st International Workshop on Deep Affective Learning and Context Modelling. IEEE, 2299–2308.
-  Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2019. MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41, 1 (2019), 162–175. DOI:http://dx.doi.org/10.1109/TPAMI.2017.2778103