Driver Glance Classification In-the-wild: Towards Generalization Across Domains and Subjects

12/05/2020 ∙ by Sandipan Banerjee, et al. ∙ MIT Affectiva 1

Distracted drivers are dangerous drivers. Equipping advanced driver assistance systems (ADAS) with the ability to detect driver distraction can help prevent accidents and improve driver safety. In order to detect driver distraction, an ADAS must be able to monitor their visual attention. We propose a model that takes as input a patch of the driver's face along with a crop of the eye-region and classifies their glance into 6 coarse regions-of-interest (ROIs) in the vehicle. We demonstrate that an hourglass network, trained with an additional reconstruction loss, allows the model to learn stronger contextual feature representations than a traditional encoder-only classification module. To make the system robust to subject-specific variations in appearance and behavior, we design a personalized hourglass model tuned with an auxiliary input representing the driver's baseline glance behavior. Finally, we present a weakly supervised multi-domain training regimen that enables the hourglass to jointly learn representations from different domains (varying in camera type, angle), utilizing unlabeled samples and thereby reducing annotation cost.



There are no comments yet.


page 3

page 4

page 5

page 7

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Driver distraction has been shown to be a leading cause of vehicular accidents [fitch2013impact]. Anything that competes for a driver’s attention, such as talking or texting on the phone, using the car’s navigation system or eating, can be a cause of distraction. A distracted individual often directs their visual attention away from driving, which has been shown to increase accident risk [liang2012dangerous]. Therefore, driver glance behavior can be an important signal in determining their level of distraction. A system that can accurately detect where the driver is looking can then be used to alert drivers when their attention shifts away from the road. Such systems can also monitor driver attention to manage and motivate improved awareness [coughlin2011monitoring]. For example, the system can decide whether a driver’s attention needs to be cued back to the road prior to safely handing them back the control.

Figure 1: Our model takes as input the driver’s face and eye patch and generates glance predictions over 6 coarse regions-of-interest (ROIs): 1. Instrument Cluster, 2. Rearview Mirror, 3. Right, 4. Left, 5. Centerstack, 6. Road. While requiring little labeled data, it can jointly predict the glance ROI on samples from different domains (car interior) that vary in camera type, angle and lighting. In the figure, both subjects are looking at the road, but appear different due to mismatch in camera angle and lighting.

A real-time system that can classify driver attention into a set of ROIs can be used to infer their overall attentiveness and offer predictive indication of attention failures associated with crashes and near-crashes [seppelt2017glass]. Real-time tracking of driver gaze from video is attractive because of the low equipment cost but challenging due to variations in illumination, eye occlusions caused by eyeglasses/sunglasses and poor video quality due to vehicular movements and sensor noise. In this paper, we propose a model that can predict driver glance ROI, given a patch of the driver’s face along with a crop of their eye-region (Figure 1). We show that an hourglass network [UNet, StackedHourGlass], composed of encoder-decoder modules, trained with a reconstruction loss on top of the classification task, performs better than a vanilla CNN. The reconstruction task serves as a regularizer [DecoderTTIC], helping the model learn robust representations of the input by implicitly leveraging useful information around its context[Context3, PJP_FG].

However, a model that makes predictions based on only a single static frame may struggle to deal with variations in subject characteristics not well represented in the training set (a shorter or taller-than average driver may have different appearances for the default on-the-road driving behavior). To address this challenge, we add an auxiliary input stream representing the subject’s baseline glance behavior, yielding improved performance over a rigid network.

Another challenge associated with an end-to-end glance classification system is the variation in camera type (RGB/NIR) and placement (on the steering wheel or rearview mirror). Due to variations in cabin configuration, it is impossible to place the camera in the same location with a consistent view of the car interior and the driver. Therefore, a model trained on driver head-poses associated with a specific camera-view may not generalize. To overcome this domain-mismatch challenge, we present a framework to jointly train models in the presence of data from multiple domains (camera types and views). Leveraging our backbone hourglass’ reconstruction objective, this framework can utilize unlabeled samples from multiple domains along with weak supervision to jointly learn stronger domain-invariant representations for improved glance classification while effectively reducing labeling cost.

In summary, we make the following contributions: (1) we propose an hourglass architecture that can predict driver glance ROI from static images, illustrating the utility of adding a reconstruction loss to learn more robust representations even for classification tasks; (2) we design a personalized version of our hourglass model, that additionally learns residuals in feature space from the driver’s default ‘eyes-on-the-road’ behavior, to better tune output mappings with respect to the subject’s default; (3) we formulate a weakly supervised multi-domain training approach that utilizes unlabeled samples for classification and allows for model adaptation to novel camera types and angles, while reducing the associated labeling cost.

2 Related Work

Computer-vision based driver monitoring systems [chhabra2017survey]

have been used to estimate a driver’s state of fatigue

[joshi2020inthewild], cognitive load [fridman2018cognitive] or whether the driver’s eyes are off the road [vicente2015driver].

Gaze estimation: The problem of tracking gaze from video has been studied extensively [hansen2009eye, cazzato2020look]. Professional gaze tracking systems do exist (e.g. Tobii111, however they typically require user or session-specific calibration to achieve good performance. Appearance-based, calibration-free gaze estimation has numerous applications in computer vision, from gaze-based human-computer interaction to analysis of visual behavior. Researchers have utilized both real [zhang2017mpiigaze] and synthetic data [wood2015rendering, wood2016learning] to model gaze behavior, with generative approaches used to bridge the gap between synthetic and real distributions, so that models trained on one domain work well on another [shrivastava2017learning, kim2019nvgaze].

Glance Classification

: In the case of driver distraction, classifying where the driver is looking from an estimated gaze vector involves finding the intersection between the gaze vector and the 3D car geometry. A simpler alternative is to directly classify the driver image into a set of car ROIs using head pose

[jha2018probabilistic], as well as eye region appearance[fridman2016owl]. Rangesh et al. focused on estimating driver gaze in the presence of eye-occluding glasses to synthetically remove eyeglasses from input images before feeding them to a classification network [rangesh2020driver]. Ghosh et al. recently introduced the Driver Gaze in the Wild (DGW) dataset to further encourage research in this area [ghosh2020speak2label].

Figure 2: Sample frames from the datasets used in our experiments (MIT2013 (left), AVT (middle) and In-house (right). For each dataset, we present an example raw frame captured by the camera, and an example each of a driver’s cropped face for each driver glance region-of-interest class.

Personalization: Personalized training has been applied to other domains (facial action unit [chu2013selective] and gesture recognition [yao2014gesture, joshi2017personalizing]) but not yet on vehicular glance classification. In the context of eye tracking, personalization is usually achieved through apriori user calibration. [krafka2016eye] reported results for unconstrained (calibration-free) eye tracking from mobile devices and showed calibration to significantly improve performance. For personalizing gaze models latent representation for each eye has been used [linden2019learning], for utilizing saliency information in visual content [chang2019salgaze] or adapting a generic example using a few training samples [yu2019improving].

Domain Invariance: Domain adaptation has been used in a variety of applications, e.g. object recognition [DA2010]. Researchers have trained shared networks with samples from different domains, regularized via an adaptation loss between their embeddings [FineGrainedDom, AdvDiscDA], or trained models with domain confusion to learn domain-agnostic embeddings [DomainConfusion], implemented by reducing distance between the domain embeddings [SimDeepTrans, Peng2019DomainAL]

or reversing the gradient specific to domain classification during backpropagation

[GradRev]. Another popular approach towards domain adaptation is to selectively fine-tune task specific models from pre-trained weights [TransferLearning, ImgNetTransfer] by freezing pre-trained weights that are tuned to specific tasks or domains [PiggyBack, PackNet] or selectively pruning weights [PruningGuide] to prudently adapt to new domains [NetTailor]

. Specific to head pose, lighting and expression agnostic face recognition, approaches like feature normalization using class centers

[CenterLoss, RingLoss] and class separation using angular margins [SphereFace, CosFace, ArcFace] have been proposed. Such recognition tasks have also benefitted from mixing samples from different domains, like real and synthetic [MasiAug, SREFI2].

While most research on gaze estimation proposes models that predict gaze vectors, our glance classification model directly predicts the actual ROI of the driver’s gaze inside the vehicle. Unlike previous work, our multi-domain training approach tunes the model’s ROI prediction to jointly work on multiple domains (car interiors), varying in camera type, angle and lighting, while requiring very little labeled data. Our model can be personalized for continual tuning based on the driver’s behavior and anatomy as well.

3 Dataset Description and Data Analysis

MIT-2013: The dataset was extracted from a corpus of driver-facing videos, which were collected as part of large driving study that took place on a local interstate highway [mehler2016multi]. For each participant in the study, videos of the drivers were collected either in a 2013 Chevrolet Equinox or a Volvo XC60. The participants performed a number of tasks, such as using the voice interface to enter addresses or combining it with manual controls to select phone numbers, while driving. Frames with the frontal face of the drivers were then annotated to the following ROIs: ‘road’, ‘center stack’, ‘instrument cluster’, ‘rearview mirror’, ‘left’, ‘right’, ‘left blindspot’, ‘right blindspot’, ‘passenger’, ‘uncodable’, and ‘other’. The data of interest was independently coded by two evaluators and mediated according to standards described by [smith2005methodology]. Following Fridman et al. [fridman2016owl], frames labeled ‘left’ and ‘left blindspot’ were given a single generic ‘left’ label and frames labeled ‘right’, ‘right blindspot’ and ‘passenger’ were given a generic ’right’ label, while frames labeled ‘uncodable’, and ‘other’ were ignored. We used a subset of the data with 97 unique subjects, which was split into 60 train, 17 validation and 20 test subjects.

AVT: This dataset contains driver-initiated, non-critical disengagement events of Tesla Autopilot in naturalistic driving [morando2020driver] and was extracted from a large corpus of naturalistic driving data, collected from an instrumented fleet of 29 vehicles, each of which record the IMU, GPS, CAN messages, and video streams of the driver face, the vehicle cabin and the forward roadway [fridman2019advanced]. The MIT Advanced Vehicle Technology (MIT-AVT) study was designed to collect large-scale naturalistic driving data for better understanding of how drivers interact with modern cars to aid better design and interfaces as vehicles transition into increasingly automated systems. Each video in the AVT dataset was processed by a single coder with inter-rater reliability assessments as detailed in [morando2020driver].


: This dataset was collected to train machine learning models to estimate gaze from the RGB and NIR camera types and a challenging camera angle. A camera, with a wide-angle lens, was placed under the rear-view mirror for this collection, the focus of which was to capture data from a position where the entire cabin was visible. Participants followed instructions from a protocol inside a static/parked car, where they glanced at various ROIs using 3 behavior types: ‘owl’, ‘lizard’ and ‘natural’

[fridman2016owl]. In our experiments, we used samples from 85 participants - 50 for training, 18 for validation and 17 for testing. Videos of each participant was manually annotated by 3 human labelers. Example frames from all three datasets are shown in Figure 2.

Figure 3: Illustration of our (a) two channel hourglass and (b) multi-stream personalization models described in Sections 4.1 and 4.2 respectively.

4 Proposed Models

4.1 Two-channel Hourglass

While a standalone classification (encoder with prediction head) or reconstruction module (encoder-decoder) can produce high performance numbers for recognition or semantic segmentation or super-resolution tasks, combining them together has been shown to further boost model performance

[FB_Reg, AdvAuto, DecoderTTIC, HarisTTIC, SharmaTask]. The auxiliary module’s (prediction or reconstruction) loss acts as a regularizer [DecoderTTIC]

and boosts model performance on the primary task. For our specific task of driver glance classification, adding a reconstruction element can tune the model weights to implicitly pay close attention to contextual pixels while making a decision. Thus, instead of using a feed forward neural network, as traditionally done for classification tasks

[AlexNet, vgg_sim, ResNet, ILSVRC15], we use an hourglass structure consisting of a pair of encoder () and decoder () modules [UNet].

In our model, takes as input the cropped face and eye patch images and

respectively, concatenated together as a two-channel tensor

and produces a feature vector () as its encoded representation. This feature vector is then passed through a prediction head to extract the estimated glance vector , before being sent to to generate the face and eye patch reconstructions , as shown in Figure 3.a. is composed of a dilated convolution layer [atrous] followed by a set of n downsampling residual blocks [ResNet] and a dense layer for encoding. takes and passes it through n upsampling pixel shuffling blocks [pixshuff] followed by a convolution layer with tanh activation for image reconstruction [DCGAN, salimans]. For better signal propagation, we add skip connections [UNet] between corresponding layers in and [SREFI3]. The encoded feature is also passed through the prediction head , composed of two densely connected layers followed by softmax activation to produce the glance prediction vector 222Model architecture details can be found in Section 7.

The hourglass model is trained using a categorical cross entropy based classification loss between the ground truth glance vector and the predicted glance vector (), and a pixelwise reconstruction loss between the input tensor and its reconstruction . For a given training batch and the ground truth classes , they can be represented as:


where is the training set in a batch and the ground truth classes. The overall objective is defined as:

Figure 4: Our multi-domain training pipeline: For every iteration, the model is trained with mini-batches consisting of labeled input samples from () and (), and unlabeled input from (). The model weights are updated based on the overall loss accumulated over the mini-batches. It is to be noted that all three subjects in this figure are looking at the road, but appear very different due to different camera angles.

4.2 Personalized Training

As mentioned earlier, introducing an auxiliary channel of baseline information can better tune the classification model to specific driver anatomy and behaviors. To this end, we also propose a personalized version of our hourglass framework, composed of the same encoder and decoder modules, and respectively. For each driver (subject) in the training dataset, we extract their mean baseline face crop and eye patch , where and , for all cases where the driver is looking forward at the road. The baseline face crop and eye patch images are calculated offline prior to training.

During training, we extract the representation of the current frame by passing the face crop and eye patch images through . Additionally, the baseline representation of the driver is computed by utilizing the baseline images. The residual between these tensors is computed in the representation space using encoded features as

. This residual acts as a measure of variance of the driver’s glance behavior from looking forward, and is concatenated with the current frame representation

. This concatenated tensor is then passed through the prediction head to get the glance prediction . Two streams each for and are deployed during training that share weights, as depicted in Figure 3.b.

The classification loss is then calculated as:


where is training batch and the ground truth classes.

The reconstruction loss is calculated for both the current frame and baseline tensors as:


The overall objective is a weighted sum of these two losses, calculated as:


4.3 Domain Invariance

As can be seen in Figure 2, driver glance can look significantly different when the camera type (RGB or NIR), its placement (steering wheel or rear-view mirror) and car interior changes. Such a domain mismatch can result in considerable decrease in performance when the classification model is trained on one dataset and tested on another, as experimentally shown in Section 5. To mitigate this domain inconsistency problem, we propose a multi-domain training regime for our two-channel hourglass model. This regime leverages a rich set of labeled training images from one domain (MIT2013 dataset) to learn domain invariant features for glance estimation from training samples from a second domain (AVT or In-house dataset), only some of which are labeled. The hourglass structure of our model provides an advantage as the unlabeled samples from the second domain can also be utilized during training using ’s reconstruction error.

Landmarks + MLP [fridman2016owl]
0.907 0.892 0.960 0.919 0.886 0.835 0.900
Baseline CNN [AlexNet]
0.977 0.939 0.970 0.978 0.911 0.948 0.954
One-Channel Hourglass
0.979 0.945 0.976 0.983 0.927 0.946 0.960
Two-Channel Hourglass (proposed)
0.983 0.956 0.978 0.980 0.930 0.961 0.965
Personalized Hourglass (proposed)
0.983 0.953 0.981 0.982 0.941 0.959 0.967

Table 1: Class-wise performance (ROC-AUC) of the different glance classification models on the MIT2013 dataset.
Mixed Training
0.977, 0.969
0.956, 0.731
0.976, 0.961
0.980, 0.951
0.924, 0.977
0.944, 0.949
0.959, 0.923
0.891, 0.970
0.689, 0.732
0.930, 0.942
0.868, 0.953
0.880, 0.950
0.866, 0.918
Gradient Reversal[GradRev]
0.977, 0.964
0.946, 0.734
0.974, 0.945
0.979, 0.950
0.930, 0.976
0.944, 0.938
0.958, 0.918
0.974, 0.966
0.953, 0.793
0.972, 0.966
0.976, 0.955
0.934, 0.970
0.945, 0.935
0.959, 0.930
Table 2: Multi-domain performance (ROC-AUC) of our hourglass model, trained using different regimes, on the MIT2013 and AVT datasets.
Mixed Training
0.977, 0.893
0.952, 0.779
0.978, 0.930
0.980, 0.933
0.933, 0.946
0.949, 0.838
0.962, 0.887
0.706, 0.912
0.737, 0.799
0.877, 0.909
0.784, 0.915
0.727, 0.938
0.718, 0.835
0.758, 0.885
Gradient Reversal[GradRev]
0.975, 0.888
0.931, 0.774
0.967, 0.894
0.978, 0.922
0.918, 0.929
0.932, 0.803
0.950, 0.868
0.975, 0.903
0.944, 0.810
0.974, 0.920
0.976, 0.926
0.925, 0.925
0.947, 0.835
0.957, 0.887
Table 3: Multi-domain performance (ROC-AUC) of our hourglass model, trained using different regimes, on the MIT2013 and the In-house dataset.

Our multi-domain training starts with three input tensors:
(1) - the labeled face crop and eye patch images from the richly labeled domain ,
(2) - the labeled face crop and eye patch images from the sparsely labeled second domain ,
(3) - the unlabeled face crop and eye patch images from the second domain .

Each tensor is passed through the encoder to generate their embedding, which are then passed through to reconstruct the input. For the input tensors with glance labels ( and ), the encoded feature is also passed through the prediction head to get the glance predictions and respectively. We set shareable weights across the multi-streams of , and during training, as shown in Figure 4.

The classification loss for the multi-domain training is set as:


where and are the labeled training batches, and and are the ground truth glance labels from domains and respectively.

Similarly, the reconstruction error is calculated as:


where is the unlabeled training batch from domain .

The full multi-domain loss is calculated as:


The weighing scalars (3), (6) and (9) are hyper-parameters that are tuned experimentally.

Figure 5: tSNE [tsne] visualization of the encoded features: Our multi-domain training more compactly packs together the feature samples from (MIT2013) and (In-house dataset) than mixed-training, especially for the critical ‘Road’ class (in black).

5 Experiments

5.1 Training Details

To train our models we use 235K video frames from the MIT2013 dataset, and 153K and 163K for validation and testing respectively. The videos were split offline to assign into training, validation and testing buckets. Due to the large amount of labeled samples, we also use this dataset to represent the richly labeled domain () for our domain invariant experiments, while using the AVT or In-house datasets as the second domain (check Section 4.3). We randomly sample 204K frames (Training: 162K, Validation: 22K, Testing: 20K) from the AVT and 377K video frames (Training: 240K, Validation: 65K, Testing: 72K) from the In-house datasets for these experiments333Check Section 8 for classwise breakdown.. All frames were downsampled to 96961 to generate the facial image and the eye patch was cropped out (also 96961 in size) using the eye-landmarks extracted using the FAN network from [Bulat3D], as can be seen in Figure 1. Any frame with no detected faces was removed.

During training, we use the Adam optimizer [Adam] with the base learning rate set as with a Dropout [Dropout] layer (rate=0.7) between the dense layers of the prediction head in the two-channel hourglass network (Section 4.1). The weighing scalars , and

are empirically set as 1 1 and 10 respectively. We train all models using Tensorflow


coupled with Keras

[chollet2015keras] on a single NVIDIA Tesla V100 card with the batch size set as 8. For the personalized model however, we find it optimal to train with a batch size of 16 and learning rate of

. To reduce computation cost and further prevent overfitting, we stop model training once the validation loss plateaus across three epochs and save the model snapshot for testing. We only use the trained encoder and prediction head during inference.

For training the personalization framework, we prepare multiple mini-batches for every iteration with the current frame (, ) and baseline frame (, ) inputs. For the domain invariant regimen, the mini-batches are prepared with labeled (), labeled () and unlabeled inputs (). The overall loss is computed from the mini-batches before updating model weights.

Computation Overhead: In terms of model size, the encoder and prediction head together consist of 24M parameters while adding the decoder for reconstruction increases the number to 54M. While does add computational load during training, only and together are required for inference. Thus, turning the typical classifier into an hourglass does not introduce additional overhead when deployed in production. The personalized version of the model has the same number of trainable parameters but does require an additional stream of baseline driver information.

5.2 Performance on the MIT2013 Dataset

Post training, we test our two-channel hourglass and personalization models for glance estimation on the test frames from the MIT2013 dataset[mehler2016multi]. To gauge of their effectiveness, we compare our model with the following:
(1) Landmarks + MLP. Following [fridman2016owl], we train a baseline MLP model with 3 dense layers on a flattened representation of facial landmarks extracted using [Bulat3D].
(2) Baseline CNN

. We also train a baseline CNN with 4 convolutional and max pooling layers followed by 3 dense layers, similar to AlexNet

[AlexNet]. The baseline CNN takes as input the 96961 cropped face image.
(4) One-Channel Hourglass. This model only receives the cropped face image without the eye-patch channel . The hyper-parameters and losses however remain the same.

As can be seen in Table 1, increasing the input quality (landmarks vs. actual pixels) and model complexity (baseline CNN vs. residual encoder) also improves classification performance, with both the personalization multi-stream and hourglass models outperforming the other approaches and the latter producing the best macro average ROC-AUC. This suggests providing the model with an additional stream of subject-specific information (personalization) can better tune the model with respect to movement of the driver head, as validated by the improved performance on ’Instrument Cluster’, ’Left’ ’Right’ and ’Road’ classes. Alternatively, adding an auxiliary reconstruction task (adding decoder) can also boost the overall primary classification accuracy by learning useful contextual information while requiring no extra stream of data.

Model (labeled data)
Instrument Cluster
Rearview Mirror
Macro Average
Mixed Training (50%)
0.876 0.785 0.920 0.936 0.922 0.828 0.877
Ours (50%)
0.897 0.818 0.944 0.943 0.938 0.842 0.897
Mixed Training (10%)
0.821 0.734 0.882 0.867 0.924 0.798 0.838
Ours (10%)
0.881 0.814 0.900 0.898 0.893 0.845 0.872
Mixed Training (1%)
0.776 0.704 0.859 0.775 0.854 0.696 0.777
Ours (1%)
0.830 0.790 0.904 0.847 0.852 0.777 0.833
Table 4: Performance (ROC-AUC) of our two-channel hourglass model with mixed training and our multi-domain regime, on the In-house dataset with different amount of labeled samples. The utility of unlabeled samples paired with reconstruction loss is evident as the percentage of labeled data from the second domain decreases.
Instrument Cluster
Rearview Mirror
Macro Average
w/ MSE
0.977 0.944 0.975 0.974 0.934 0.941 0.957
wo/ skip connections
0.980 0.942 0.974 0.981 0.935 0.943 0.959
0.977 0.945 0.976 0.978 0.926 0.946 0.958
wo/ [ImageGPT]
0.979 0.936 0.974 0.976 0.933 0.952 0.958
Full Model
0.983 0.956 0.978 0.980 0.930 0.961 0.965
Table 5: Performance (ROC-AUC) of our two-channel hourglass model with different components ablated on the MIT2013 dataset.

5.3 Domain Invariance

For the domain invariance task, as described in Section 4.3, we assign the MIT2013 dataset as the richly labeled domain as it has a large number of video frames with human-annotated glance labels and use the AVT and our In-house datasets interchangeably as the new domain . To evaluate its effectiveness, we compare our multi-domain training approach with following regimes while keeping the backbone network (two-channel hourglass) the same:
(1) Mixed Training. Only labeled data from and are pooled together based on their glance labels for training.
(2) Fine-tuning [ImageGPT]. We train the model on labeled data from and then fine-tune the saved snapshot on labeled data from , a strategy similar to [ImageGPT].
(3) Gradient Reversal [GradRev]. We add a domain classification block on top of the encoder output to predict the domain of each input. However, its gradient is reversed during backpropagation to confuse the model and shift its representations towards a common manifold, similar to [GradRev]444We use the implementation from [GradRevDrive]..

Although our multi-domain training approach can utilize the unlabeled samples from , for our first experiment we use 100% of the annotated images from both and to level the playing field. The same model snapshot is used for testing on both the MIT2013 dataset () and the AVT or In-house datasets (). The results can be seen in Tables 2 and 3 respectively. In both cases, the fine-tuning approach fails to generalize to both domains, essentially “forgetting” details of the initial task (). Adding the gradient reversal head, does generate a boost over fine-tuning, however it overfits slightly on the training set and takes almost twice as the other approaches to converge. The mixed training and our multi-domain approaches perform competitively and generate the best ROC-AUC numbers on the MIT2013-Inhouse and MIT2013-AVT respectively.

However, using all labeled data from the new domain does not fairly evaluate the full potential of our approach. Unlike the other approaches, our training regimen can utilize the unlabeled data () via the reconstruction loss, as proposed in Section 4.3. To put this functionality into effect, we use different amount of labeled samples (50%, 10% and 1%) from during training the hourglass model with mixed training and multi-domain regimes. As shown in Table 4, our approach significantly outperforms mixed training as the amount of labeled data in the new domain diminishes. Interestingly, our multi-domain hourglass trained with 50% labeled data generalizes better than when trained with 100% labeled data suggesting more generalizable global features are learned when an unsupervised component is added to a classification task. This is further validated when visualizing the encoded features using tSNE [tsne], as depicted in Figure 5. Our multi-domain training more compactly packs together the feature samples from and than mixed-training, especially for the critical ‘Road’ class. Thus, this technique can be used to gauge the amount of labeling required when adapting models to new domains and consequently reduce annotation cost.

5.4 Ablation Studies

To check the contribution of each component of our two-channel hourglass network, we train the following variations of our model:
(1) w/ MSE. Instead of mean absolute error, the reconstruction loss is computed with mean squared error.
(2) wo/ skip connections. We remove skip connections between the encoder and decoder layers.
(3) wo/ . The reconstruction loss is removed, essentially making the model a traditional classification module with residual layers.
(4) wo/ . Taking inspiration from [ImageGPT], we first train the hourglass solely with the reconstruction task (no ) and then use the encoder module as a feature extractor to train the prediction block. For all the model variations, we keep everything else the same for consistency.

As presented in Table 5, ablating the different components generates slightly different results. Due to the pixel normalization between before training, using MSE based reconstruction slightly dampens the error due to squaring. The skip connections help in propagating stronger signals across the network [UNet], hence removing them negatively affects model performance. Removing altogether deteriorates model performance as contextual information gets overlooked. Surprisingly, unsupervised pre-training performs quite well, suggesting the reconstruction task can teach the model features useful for classification. This reconstruction element, present in our full model, helps it achieve the best overall performance.

6 Conclusion

Advanced driver assistance systems that can detect whether a driver is distracted can help improve driver safety but pose many research challenges. In this work, we proposed a model that takes as input a patch of the driver’s face along with a crop of the eye-region and provides a classification into 6 coarse ROIs in the vehicle. We demonstrated that an hourglass network consisting of encoder-decoder modules, trained with a secondary reconstruction loss, allows the model to learn strong feature representations and perform better in the primary glance classification task. In order to make the system more robust to subject-specific variations in appearance and driving behavior, we proposed a multi-stream model that takes a representation of a driver’s baseline glance behavior as an auxiliary input for learning residuals. Results indicate such personalized training to improve model performance for multiple glance ROIs over rigid models.

Finally, we designed a multi-domain training regime to jointly train our hourglass model on data collected from multiple camera views. Leveraging the hourglass’ auxiliary reconstruction objective, this approach can learn domain invariant representations from very little labeled data in a weakly supervised manner, and consequently reduce labeling cost. As a future work, we plan to use our hourglass model as a proxy for annotating unlabeled data from new domains and actively learn from high confidence samples.


7 Detailed Model Architecture

Here we describe in detail the architecture of the encoder and decoder modules, and the prediction head of our two-channel hourglass model. As discussed in Section 4.1 of the main text, takes as input a 96962 input and passes it through a dilated convolution layer [atrous] before followed by 5 residual blocks [ResNet]

with stride = 2 for downsampling. This downsampled output is fed to a densely connected layer with 512 neurons and linear activation to generate the encoded feature representation of the input.

is designed like a mirror image of and takes this dense 512-D input and feeds it through 5 upsampling pixel shuffling layers [pixshuff]. We also add skip connections [UNet] between layers in and with the same feature map resolution for stronger signal propagation. The final upsampled output is passed through a convolution layer with tanh activation to reconstruct the 96962 input [DCGAN, salimans]. is composed of two dense layers with a dropout [Dropout] layer in between for regularization. We apply softmax activation for the second dense layer to get the final glance prediction. Unless stated otherwise, all layers use a leaky ReLU activation.

The detailed layers of , and are listed in Tables 6, 7, and 8 respectively. The convolution layers, dense layers, residual blocks and pixel shuffling blocks are represented as ‘conv’, ‘fc’, ‘RB’, and ‘PS’ respectively in the tables.

# of filters
conv1 33/1/2 128
conv2 33/2/1 64
RB1 33/1/1 64
conv3 33/2/1 128
RB2 33/1/1 128
conv4 33/2/1 256
RB3 33/1/1 256
conv5 33/2/1 512
RB4 33/1/1 512
conv6 33/2/1 1,024
RB5 33/1/1 1,024
fc1 512 -
Table 6: Encoder architecture (input size is 96962)
# of filters
fc2 3*3*1024 -
conv7 33/1/1 4*512
PS1 - -
conv8 33/1/1 4*256
PS2 - -
conv9 33/1/1 4*128
PS3 - -
conv10 33/1/1 4*64
PS4 - -
conv11 33/1/1 4*64
PS5 - -
conv12 55/1/1 2
Table 7: Decoder architecture (input size is (512,)
# of filters
fc3 256 -
fc4 6 -
Table 8: Prediction head architecture (input size is (512,)

8 Classwise Data Distribution

Figure 6: Class wise distribution of samples in the Train, Validation and Test splits in the MIT2013 [mehler2016multi], AVT [fridman2019advanced] and our In-house datasets. The ‘Centerstack’, ‘Instrument Cluster’ and ‘Rearview Mirror’ are abbreviated as ‘CS’, ‘IC’ and ‘RVM’.
Figure 7: Randomly sampled cropped face images (input) and their reconstructions produced by our hourglass model. Subtle differences between the two sets can be observed by zooming in. All images are 96961 in resolution.

Here we present the class wise distribution of samples for the MIT2013 [mehler2016multi], AVT [fridman2019advanced] and our In-house collected datasets in Figure 6

. As can be seen, the pre-dominant class (ROI) is the driver actually looking on the road (‘Road’), especially for the MIT2013 and In-house datasets. This imbalance can cause the trained model’s representations to be skewed towards the largely populated ROIs and perform poorly for the sparse classes. However, as presented in the results from the main text our model does not exhibit such bias and performs competitively for all the ROI classes.

9 Reconstruction Example

We present a random set of face samples from our In-house dataset and their reconstructions generated by our hourglass model in Figure 7. Except for some noise and grid-like artifact in some cases, we find there to be little difference between the input and reconstructed images.

Figure 8: Normalized confusion matrices across the different classes on the MIT2013 [mehler2016multi] test samples using - (a) Landmarks + MLP model, (b) Baseline CNN [AlexNet], (c) One-Channel Hourglass, (d) Two-Channel Hourglass, and (e) Personalized Hourglass model.

10 Confusion Matrices

In the main text, we utilize ROC-AUC as the metric to report performance of the different models. Since the ROC-AUC metric is threshold agnostic, it can be used to gauge model performance while sweeping through different thresholds. However, we also present the performance of each of our candidate models on the MIT2013 [mehler2016multi] test set in Figure 8 using confusion matrices normalized by total samples for each class.