Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

01/17/2021 ∙ by Zachary Wharton, et al. ∙ Edge Hill University 10

There is significant progress in recognizing traditional human activities from videos focusing on highly distinctive actions involving discriminative body movements, body-object and/or human-human interactions. Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes. To address this, we propose a novel framework by exploiting the spatiotemporal attention to model the subtle changes. Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse network. The goal is to allow the glimpse to capture high-level temporal relationships, such as 'during', 'before' and 'after' by focusing on a specific part of a video. These branches also respect the topology of the temporal dynamics in the video, ensuring that different branches learn meaningful spatial and temporal changes. The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition by exploring the hidden states of an LSTM. The attention mechanism helps in learning to decide the importance of each hidden state for the recognition task by weighing them when constructing the representation of the video. Our approach is evaluated on four publicly accessible datasets and significantly outperforms the state-of-the-art by a considerable margin with only RGB video as input.



There are no comments yet.


page 5

page 8

page 9

page 10

page 11

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recognizing human/driver activities while driving is not only a key ingredient for the development of Advanced Driver Assistance System (ADAS) but also for the development of many intelligent transportation systems. These include autonomous driving [merat2014transition, Kim17], driving safety monitoring [prat17, Kaplan15], Vehicle to Vehicle (V2V) and Vehicle to Infrastructure (V2I) [talebpour2016modeling] systems, just to name a few. The rise of automation and a growing interest in fully autonomous vehicles encourage more non-driving or distractive behaviors of the driver. Therefore, understanding human drivers’ behavior is crucial for accurate prediction of Take-Over-Request and surrounding vehicles’ activities, which result in developing control strategies and human-like planning. Moreover, understanding drivers’ behavior such as human drivers’ interaction with each other, as well as with transportation infrastructure provides significant insight into the efficient design of V2V and V2I systems. Similarly, real-time monitoring of drivers’ activities and body language constitutes a safe driving profile for each driver. It is vital for emerging vehicle/ride sharing industries and fleet management platforms.

Real-world driving scenarios are a multi-agent system in which diverse participants interact with each other and with infrastructures. Moreover, each driver has their own driving style and often depends on sophisticated multi-tasking human intelligence, including the perception of traffic situations, reasoning surrounding road-users’ intentions, paying attention to the potential hazards, planning ego-trajectory, and finally executing the driving task. Therefore, it is a complex problem involving a large diversity in daily driving scenarios, driving behaviors, and different granularity of activities, resulting in significant challenges in understanding and representing driving behaviors. To address this, recent research on recognizing fundamental fine-grained driver’s actions such as eating, drinking, interacting with the vehicle controls, and so on is only the first step [martin2019drive, behera2018context, abouelnaga2017real, eraqi2019driver].

Driver behavior recognition is closely linked to the broader field of human action recognition, which has rapidly gained much attention due to the rise of deep learning

[hussein2019timeception, hussein2019videograph, piergiovanni2019evolving, wang2018non, carreira2017quo, girdhar2017attentional, tran2018closer]. These approaches are data-intensive and are trained on large-scale video datasets, usually originated from YouTube [carreira2017quo, karpathy2014large]

, and consist of highly discriminative actions often executed by different subjects. Whereas, driving behavior commonly involves various driving/non-driving activities executed by the same driver with very similar body parts movements, resulting in subtle changes. For example, talking vs texting using a mobile phone, eating vs drinking, etc. in which many actions have a similar upper-body pose and the only difference is the object of interest. Furthermore, in such scenarios, only the part of the body (e.g. upper-body) is visible, making the problem even harder. Therefore, the above-mentioned conventional human action recognition models might not be suitable for drivers’ activities.

Our work: Our CTA-Net uses visual attention in an innovative way to capture both subtle spatiotemporal changes and coarse temporal relationships. It attends visual cues specific to temporal segments to preserve the temporal ordering in a given video and then a temporal attention mechanism, which dictates how much to attend the current visual cues conditioned on their temporal neighborhood contexts. It is a recurrent model (an LSTM) in which a visual representation of a video frame is learned using a residual network [He16] (ResNet-50). The last convolutional block (CONV5

) of our model focuses on a segment of the input video, allowing our novel attention to assign estimated importance to each segment of the video by considering the knowledge of the coarse temporal range. For example, such coarse temporal range might indicate that the driver’s hand moving towards an object of interest (e.g. phone, bottle, etc.), carrying out the required task (e.g. talking, drinking, etc.) and then the hand moving away. Many different activities exhibit the same spatiotemporal pattern of the hand moving toward and moving away. However, the proposed coarse temporal range, their temporal ordering, and the appearance of a specific object(s) in a given activity would allow to discriminate different activities. Moreover, our novel temporal attention

learns to attend the different parts of the hidden states of the LSTM in discriminating fine-grained activities.

Our contributions: They can be summarized as: 1) a driver activity recognition model is proposed with a residual CNN-based glimpse sensor and a novel attention mechanism; 2) our novel attention mechanism is designed to learn how to emphasize the hidden states of an LSTM in an adaptive way; 3) to capture task-specific high-level features, a spatial attention mechanism conditioned on coarse temporal segments is developed by introducing branches in the last convolutional layer; and 4) extensive validation of the proposed model on four datasets, obtaining state-of-the-art results.

(a) Glimpse Sensor
(b) Self-Attention layer in glimpse sensor (c) LSTM and our novel temporal attention
Figure 1: The proposed CTA-Net consists of - a) Glimpse sensor: Given an input video consisting frames, the sensor extracts feature of the frame, where . b) Self-Attention: It captures important cues on activity-specific spatial changes. c) Temporal Attention: The module uses the internal state of an LSTM (unrolled) that takes as input and selectively focuses on the to infer activity.

2 Related Work and Motivation

Traditional Human Activity Recognition: Recent surge of deep learning has significantly influenced the advancement in recognizing human activities from videos. Most attempts in this genre are usually derived from the image-based networks, which are used to extract features from individual frames and extended them to perform temporal integration by forming a fixed size descriptor using statistical pooling such as max and average pooling [hussein2017unified, habibian2016video2vec], attentional pooling [girdhar2017attentional], rank pooling [fernando2016rank], context gating [miech2017learnable] and high-dimensional feature encoding [girdhar2017actionvlad, xu2015discriminative]. However, an important visual cue representing the temporal pattern is overlooked in such statistical pooling and high-dimensional encoding. On the other hand, recurrent networks [behera2018context, Xu_2019_ICCV], Temporal Convolutional Networks (TCN) [lea2017temporal], and learning spatiotemporal features through 3D convolutions [tran2018closer, piergiovanni2019evolving, carreira2017quo] are used to capture temporal dependencies. Recurrent networks such as LSTMs are capable of modeling long-term dependencies and thus, adapted in the activity recognition problem. To the best of our knowledge, no substantial improvements have been reported recently.

To learn long-term temporal dependencies, Hussein et al. propose Timeception [hussein2019timeception], which uses multi-scale temporal convolutions to reduce the complexity of 3D convolutions. In [wang2018non], Wang et al. present non-local operations as a generic family of building blocks for capturing long-range dependencies. Zhou et al. [zhou2018temporal] introduce a Temporal Relation Network (TRN) to learn and reason about temporal dependencies between video frames at multiple time scales. Similarly, Wang et al. [wang2016temporal] propose a Temporal Segment Network (TSN) with a sparse temporal sampling strategy. A Long-term Temporal Convolution (LTC) is proposed in [varol2017long] to consider different temporal resolutions as a substitute to bigger temporal windows. Another influential approach is the use of 3D CNNs for action recognition. Carreira and Zisserman [carreira2017quo] propose a model (I3D) that inflates 2D CNNs pre-trained on images to 3D for video classification. Tran et al. [tran2018closer] describe a spatiotemporal convolution by factorizing the 3D convolutional filters into separate spatial and temporal components to recognize actions.

Attention in Activity Recognition:

Attention mechanism in machine learning has drawn increasing interest in areas such as video question answering

[li2019beyond], video captioning [pei2019memory, song2017hierarchical], and video recognition [girdhar2017attentional, girdhar2019video, baradel2018human, song2017end, sharma2015action]. This is influenced by human perception, which focuses selectively on parts of the scene to acquire information at specific places and times. This has been explored by Girdhar and Ramanan [girdhar2017attentional] for action recognition by bottom-up and top-down attention. Similarly, a recurrent mechanism is proposed in [sharma2015action], focusing selectively on the part of the video frames, both spatially and temporally. Girdhar et al. [girdhar2019video] propose an attention mechanism that learns to emphasize hands and faces to discriminate an action. An LSTM-based temporal attention mechanism is proposed by Baradel et al. [baradel2018human] to emphasize features representing hands. Song et al. [song2017hierarchical] propose an end-to-end spatial and temporal attention to selectively focus on discriminative skeleton joints in each frame and pays different levels of attention to the frames.

Driver Activity Recognition: Driver activities are a subset of conventional human activities [martin2019drive, behera2018context, abouelnaga2017real, eraqi2019driver, martin2018body, oliver2000graphical, doshi2011tactical, ramanishka2018toward]. It can be categorized into two sub-classes: 1) primary maneuvering (e.g. passing, changing lanes, start, stop, etc.) [oliver2000graphical, doshi2011tactical, ramanishka2018toward] and 2) secondary non-driving (e.g. eating, drinking, talking, etc.) [martin2019drive, behera2018context, abouelnaga2017real, eraqi2019driver, martin2018body] activities. In this work, we focus on secondary activities, which are crucial for safe driving and take-over-request. Moreover, it will be more frequent during the autonomous driving mode. Martin et al. [martin2018body] propose a method to combine multiple streams involving body pose and contextual information. Behera et al. [behera2018context] advocate a multi-stream LSTM for recognizing driver’s activities by combining high-level body pose and body-object interaction with CNN features. A genetically weighted ensemble approach is used in [abouelnaga2017real]. The VGG-16 [simonyan2014very] network is modified by Baheti et al. [Baheti18] to reduce the number of parameters for faster execution. Similarly, Li et al. [li2019learning] propose a tactical behavior model that explores the egocentric spatial-temporal interactions to understand how human drives and interacts with road users.
Motivation: It is evident that the traditional activity recognition models are developed to recognize highly distinctive actions. Lately, attention mechanisms are brought in to improve the recognition accuracy of these models. The conventional models are adapted for drivers’ activity monitoring by tweaking a few layers or simply evaluating the target driving datasets. In this work, we move a step forward by innovating within frame self-attention, between frames coarse and fine-grained temporal attention to recognize driver’s secondary activities. These activities are different from the traditional human activities since they are executed by the same subject resulting in subtle changes among various activities. Our coarse temporal attention introduces three branches to model high-level temporal relationships (‘during’, ‘before’, and ‘after’) with the assumption is that main action is performed in ‘during’ (e.g. drinking), ‘before’ focuses on pre-action event (e.g. take the bottle) and ‘after’ emphasizes on post-action episode (e.g. put the bottle). The self-attention within each branch selectively focuses on capturing spatial changes. Finally, we introduce a novel temporal attention by focusing on the distribution of hidden states of an LSTM instead of image feature maps [girdhar2017attentional] or hard attention involving the subject’s hands [baradel2018human]. We argue that our contribution includes not only the design of the CTA-Net but also an empirical study on the role of attention in improving accuracy.

3 Proposed End-to-End CTA-Net

3.1 Problem formulation

For video-based activity recognition, we are given training videos and the activity label for each video . The aim is to find a function that predicts that matches the actual activity of a given video as much as possible. We learn by minimizing the categorical cross-entropy between the predicted and the actual activity :


3.2 Glimpse sensor

The CTA-Net is built around glimpse sensor for visual attention [mnih2014recurrent] in which information in an image is adaptively selected via encoding regions progressively around a given location in the image. Inspired by this, our approach encodes information in temporal locations within a video. The proposed glimpse receives image () at time from a video

. It produces the glimpse feature vector

from by limiting the temporal bandwidth around , where is the coarse temporal bandwidth of the video and is the model parameter.

Our glimpse is implemented using ResNet-50 [He16] (Fig. (a)a). We modify this network by introducing two essential ingredients: 1) Coarse temporal bandwidth and 2) Self-Attention layer (Fig. (b)b). The aims to limit to focus on certain temporal positions in . If it is limited to a single frame (i.e. ) then the sensor complexity will increase. To address this, we use coarse bandwidth (). It allows to focus on different temporal parts of a video, motivated by [behera2014real] that uses before, during and after to capture the temporal relationships in a video. Moreover, driver secondary activities often involve human-object interactions (e.g. phones, car controls, etc.) and consist of spatiotemporal dynamics such as: i) hand approaching towards objects, ii) object manipulation, and iii) hand moving away. This involves three distinctive sub-activities. Our approach explores it by introducing three branches involving the last block of ResNet-50 (Fig. (a)a). The reason is that CNNs learn features from general (e.g. color blobs, Gabor filters, etc.) to more specific (e.g. shape, complex structures, etc.) as we move from the input to output layer. Thus, we share the parameters of lower layers ( to ) among frames to produce a generic representation that is then processed by the bandwidth-specific layers (, where ) to generate the required outputs.

Within each branch of (Fig. (a)a), we also add an attention map (Fig. (b)b) to capture bandwidth-specific important cues focusing on spatial changes. The aim is to model long-range, multi-level dependencies across image regions and is complementary to the convolutions to capture the spatial structure within the image. Our model explicitly learns the relationships between features located at and position in and is represented as , . It conveys how much to focus on the location when synthesizing the position in . To achieve this, we compute the attention map by adapting the self-attention in SAGAN [zhang2018self] where the query, the key and the value are computed from feature map (Fig (b)b) via three separate convolutions. The key multiplies with the query and then use a softmax to create attention map . The value is multiplied with to get the desired output (). Afterwards for each frame at , is multiplied with a learnable scalar

(initialized as zero) and added back to the input as a residual connection, i.e.

. The feature map passes through the block (Fig. (a)a) to produce the desired glimpse feature vector .

(a) Drive&Act [martin2019drive]
(b) Distracted Driver V1 [abouelnaga2017real]
(c) Distracted Driver V2 [eraqi2019driver]
Figure 2: Examples from the datasets used to evaluate our model.

3.3 Temporal attention architecture

The temporal attention sub-module receives a sequence of glimpse vector . The goal is to encode using an internal state that summarizes information extracted from the history of past observations. Such state encodes the sequence knowledge and is instrumental in deciding how to act. A common approach to model this state is to use hidden units of the recurrent network and is updated over time as: , where is a nonlinear function with parameter . It provides a prediction at each time step , and the sequence recognition is generally carried out by considering prediction in the last time step based on the associated feature and the previous context vector involving hidden states. This is an inherent flaw in LSTM since the model uses recurrent connections to maintain and communicate temporal information. Therefore, researchers have recently explored temporal pooling (e.g. sum, average, etc.) [sharma2015action] and temporal attention for dynamical pooling [yeung2018every] as additional direct pathways for referencing previously seen frames. Our temporal attention is inspired by [yeung2018every] and focuses on only hidden states of the LSTM. The novelty is to allow the model learns to attend automatically the different parts of the hidden states at each step of the output generation. We achieve this by introducing an attention-focused weighted summation , where consists of learnable weight matrices and biases to compute the attention-focused hidden state representation at .


The element is computed as a residual connection of hidden state representations of the input feature at time . The similarity map is computed from

using the element-wise sigmoid function

and capturing the similarity between the LSTM’s hidden state responses and . Basically, dictates how much to attend the LSTM’s current response conditioned on their neighborhood contexts. and are the weight matrices for the corresponding hidden states and ; is the weight matrix for their nonlinear combination; and

are the bias vectors.

The sequence of attention-focused residual activation

is then used to compute the activity probability as shown in Fig

(c)c. We achieve this by using a simple approach of weighted summation:


Here, provides the score (probability) for each attention-focused residual activation and is computed using weight and bias . Finally, the weighted summation is then used by a Softmax to estimate the activity probability of a given input video. The parameter is learned during training.

3.4 Training

The parameter of our model consists of glimpse , LSTM network , and the temporal attention network . The glimpse is implemented with the ResNet-50 [He16] (Fig. (a)a, Section 3.2

), and initialized with ImageNet’s pre-trained weights. We use the standard implementation of fully-gated LSTM network

[hochreiter1997long] with parameter . These are learned via end-to-end training.

We uniformly sample 12 frames from each video segment. The frames are resized to

, and we use the standard evaluation metric of the top-1 accuracy. Our model is trained using the Adam optimizer

[kingma2014adam] with an initial learning rate of 0.001, and parameters and . The learning rate is reduced by a factor of

after every 25 epochs. The experiments are performed on an Ubuntu PC with an Intel Core i9 9820X CPU and a Titan V GPU (12 GB). A batch size of 4 videos is used.

4 Experimental Results

4.1 Datasets and evaluation metric

We evaluate our model on three popular driving datasets: 1) Drive&Act [martin2019drive], 2) Distracted Driver V1 [abouelnaga2017real], and 3) Distracted Driver V2 [eraqi2019driver]. To the best of our knowledge, these are only available video datasets for secondary driving activity recognition (Fig. 2). We also further evaluate our model using SBU Kinect Interaction [yun2012two] dataset consisting of traditional human activities.

Drive&Act [martin2019drive]: This is a large-scale video dataset (over 9.6 million frames) consisting of various driver activities. Annotations are provided for 12 classes (full scene actions) of top-level activities (e.g. eating and drinking), 34 categories (semantic actions) of fine-grained activities (e.g. opening bottle, preparing food, etc.), and 372 classes (object interactions) of atomic action unit involving triplet of action, object, and location. There are 5 types of actions, 17 object classes and 14 location annotations. We follow the same three splits based on the participant identity and use the same train, test, and validation sets in each split as those in [martin2019drive]. Final result is the average over the three splits.

Distracted Driver V1 [abouelnaga2017real]: It contains 12977 train and 4331 test images from 31 drivers (22 male and 9 female) from 7 different countries. There are 10 activity classes (e.g. safe driving, texting, etc.). It consists of videos of each subject, but the frame-based evaluation is carried out in [abouelnaga2017real], subject-wise video-based evaluation is done in [behera2018context]. We follow the evaluation protocol in [behera2018context], which uses the videos of 22 participants for training and the rest of the videos for testing.

Distracted Driver V2 [eraqi2019driver]: This is a newer iteration of dataset V1 [abouelnaga2017real], containing 14478 images from 44 drivers (29 male and 15 female) using the same 10 activities. The dataset is split into 12555 (36 drivers) training and 1923 (8 drivers) testing images, respectively. The dataset associated approach [eraqi2019driver] has used the frame-wise evaluation. In this work, we are the first one to provide a video-based evaluation. A total of 360 videos from 36 participants are used for training and the rest of the videos are used for testing.

SBU Kinect Interaction [yun2012two]: The dataset is used to justify our model’s wider applicability. It consists of 282 videos with 8 different activity classes. It contains interactions between two subjects and is close to the driver’s secondary activities involving human-objects and human-car interactions. We follow the same train/test split in [yun2012two].

Figure 3: Timeline of a video example from the Drive&Act dataset displaying 12 different coarse activities executed by a subject. The duration of each activity is represented by the respective color bar. A visual explanation of the classification decision is overlaid using the class activation map [selvarajuvisual] representing salient regions of various activities over the video sequence. Scenarios are: (0) fasten seat belt (and get in vehicle); (1) hand over (turn on autonomous vehicle); (2) eat and drink; (3) read newspaper; (4) put on sunglasses; (5) take off sunglasses; (6) put on jacket; (7) take off jacket; (8) read magazine; (9) watch video (on vehicle display); (10) work (type on laptop); and (11) final task (get out of vehicle). Best view in color.
Model Fine-grained Coarse task Action Object Location All
Val Test Val Test Val Test Val Test Val Test Val Test
Pose [martin2019drive] 53.17 44.36 37.18 32.96 57.62 47.74 51.45 41.72 53.31 52.64 9.18 7.07
Interior [martin2019drive] 45.23 40.30 35.76 29.75 54.23 49.03 49.90 40.73 53.76 53.33 8.76 6.85
2-Stream [wang2017modeling] 53.76 45.39 39.37 34.81 57.86 48.83 52.72 42.79 53.99 54.73 10.31 7.11
3-Stream [martin2018body] 55.67 46.95 41.70 35.45 59.29 50.65 55.59 45.25 59.54 56.50 11.57 8.09
C3D [tran2015learning] 49.54 43.41 - - - - - - - - - -
P3D Net [qiu2017learning] 55.04 45.32 - - - - - - - - - -
I3D Net [carreira2017quo] 69.57 63.64 44.66 31.80 62.81 56.07 61.81 56.15 47.70 51.12 15.56 12.12
CTA-Net 72.42 65.25 62.82 52.31 57.59 56.41 63.37 59.19 56.41 63.01 46.44 49.41
Table 1: Recognition results (Validation and Testing accuracy in %) of the fine-grained and coarse tasks, as well as Atomic Action Units defined as {Action, Object, Location} triplets, and their combinations in Drive&Act dataset [martin2019drive]. A total of 34 fine-grained, 12 coarse tasks. There are 5 actions, 17 object categories, 14 locations and 372 (All) possible combinations.

4.2 Results and comparative studies

We first compare the CTA-Net with the state-of-the-art on Drive&Act dataset. An example of a 12 coarse activity video with a duration of 27 minutes is shown in Fig. 3. In this figure, we have also shown the class activation map [selvarajuvisual] representing the visual explanation of the classification decision of our model for various coarse scenarios. The accuracy (%) of our model and state-of-the-art approaches for recognizing 12 coarse and 34 fine-grained activities is presented in Table 1. It is observed that the CTA-Net outperforms in both validation and testing sets by a significantly large margin. For example, in coarse activity, CTA-Net (62.82%) is 18.2% higher than the best model (I3D Net [carreira2017quo]: 44.66) and 16.9% higher than the three-stream [martin2018body] (35.45%) on the respective validation and test set. Similarly, I3D Net is the best performer (Val: 69.57% and Test: 63.64%) in recognizing fine-grained activities. Our CTA-Net outperforms these by a margin of 2.85% (Val) and 1.61% (Test), respectively. It is seen that the margin of improvement in recognizing coarse activities (Val: 18.2% and Test 16.9%) is significantly larger than those of fine-grained ones. This suggests that our model can effectively capture long-term dependencies. This is due to the introduction of novel coarse temporal branches to model the ‘during’, ‘before’, and ‘after’ temporal relationships explicitly in videos. Moreover, I3D Net is developed to recognize distinctive human activities and is used here to recognize the driver’s activities involving subtle changes. This suggests that it might not be suitable for such applications. The visual explanation using class activation map [selvarajuvisual] representing our coarse temporal relationships in ‘reading magazine’ and ‘exiting vehicle’ activities is shown in Fig. (b)b and Fig. (c)c, respectively. More examples are included in the supplementary.

The confusion matrix using our CTA-Net is shown in Fig

(a)a for the coarse tasks in the Drive&Act dataset. It is clear that the performance of activities ‘watching videos’ (class 9), ‘final task’ (class id 11, get out of vehicle), ‘take off sunglasses’ (class 5) and ‘turn on AV feature’ (class 1) is low. This is mainly due to the involvement of very little action in ‘watching videos’ and ‘turn on AV feature’ activities except pressing a button. Thus, watching a video is confused with ‘turn on AV feature’. The ‘take off sunglasses’ activity is confused with ‘put on sunglasses’ and ‘turn on AV feature’ since sunglasses is a small object representing very little visual information. Moreover, the sunglasses are kept in the holder close to the vehicle touch screen, confusing with ‘turn on AV feature’. Similarly, ‘getting in’ is confused with ‘getting out’ since there are no significant visual changes but, motion direction information would help in discriminating such activities. The confusion matrix for the fine-grained activities and the split-wise confusion matrices of both coarse and fine-grained activities are included in the supplementary material.

The accuracy of the Atomic Action Units {Action, Object, Location} is provided in Table 1. Like in coarse and fine-grained activities, the CTA-Net outperforms in each triplet, as well as their unique 372 combinations (All in Table 1). A notable performance of our model can be seen for recognizing the above combinations. The best performer is 15.56% (Val) and 12.12% (Test) by the I3D Net [carreira2017quo]. Whereas, the proposed approach is significantly better (Val: 46.44% and Test: 49.41%). This is mainly due to our self-attention module (Fig. (b)b), which explicitly learns the relationships between pixels located at the CONV4 output (Fig. (a)a). It allows to capture the subtle changes within a video frame to discriminate the unique combinations of action-object-location

. This suggests that our model is not only suitable for recognizing long-term dependencies in videos, but also appropriate in classifying atomic action units involving action, location, objects and their distinct combinations. This is due to the design, which considers both coarse temporal attention to model high-level temporal dependencies (glimpse in Section

3.2) and fine-grained temporal attention for each frame by weighing them (Section 3.3) when constructing the representation of an input video. The proposed approach also performs better than the state-of-the-art for individual atomic action units except in location and action validation sets. For location, our accuracy (56.41%) is not far from the best (59.54%) [martin2018body] that combines three streams, whereas our approach uses only the RGB video stream. For action, I3D Net [carreira2017quo] performed (62.81%) better in the validation set, but in the testing set, ours is slightly better. This could be due to the action consisting of atomic verbs such as opening, closing, reaching for, etc. These are very minimal duration and thus, inflated 2D convolution is appropriate in capturing 3D spatiotemporal information resulting in higher accuracy.

Table 2 presents our CTA-Net’s accuracy on Distracted Driver V1 [abouelnaga2017real] and V2 [eraqi2019driver] datasets. Both datasets consist of the video sequence. The existing approaches use frame-wise evaluation on V2 [eraqi2019driver], and we are the first one to provide a video-based evaluation. The Multi-stream LSTM [behera2018context] has used the video-based evaluation on V1 [abouelnaga2017real] and we followed it to evaluate our CTA-Net. In [behera2018context], multiple streams focusing on body pose and body-object interactions, and CNN features are used by an LSTM to recognize various activities, whereas we only focus on RGB video. The accuracy of our approach is significantly (84.09%) better. Similarly, the accuracy of our model is 92.5% on V2 [eraqi2019driver].

(a) Confusion matrix (Coarse tasks)
(b) Reading: ‘before’, ‘during’, and ‘after’
(c) Exiting car: ‘before’, ‘during’, and ‘after’
Figure 4: a) Our CTA-Net’s confusion matrix showing 12 coarse tasks in the Drive&Act test set. A visual explanation of decision using class activation map [selvarajuvisual] representing our coarse temporal attention of ‘before’ (left), ‘during’ (middle), and ‘after’ (right) segment of an input video with b) reading activity and c) exiting the vehicle. Best view in color.
Distracted Driver V1 [abouelnaga2017real] Distracted Driver V2 [eraqi2019driver]
Model ACC Model ACC
One-stream [behera2018context] 42.22 Incep. V3 [szegedy2016rethinking] 90.07
Two-streams [behera2018context] 44.44 ResNet-50 [He16] 81.70
Three-streams [behera2018context] 52.22 VGG-16 [simonyan2014very] 76.13
Four-streams [behera2018context] 37.78
CTA-Net 84.09 CTA-Net 92.50
Table 2: Recognition accuracy (%) of 10 different driver’s activities using Distracted Driver datasets. These methods are used for frame-wise evaluation.
Approaches Pose RGB Depth ACC
Raw Skeleton [yun2012two] - - 49.7
Joint Feature [yun2012two] - - 80.3
Raw Skeleton [ji2014interactive] - - 79.4
Joint Feature [ji2014interactive] - - 86.9
Co-occ. RNN [zhu2016co] - - 90.4
STA-LSTM [song2017end] - - 91.5
ST-LSTM [liu2016spatio] - - 93.3
DSPM [lin2016deep] - 93.4
Ijjina [ijjina2017human] - - 82.2
Ijjina [ijjina2017human] - 85.1
Baradel [baradel2018human] - - 90.5
Baradel [baradel2018human] - 94.1
Ijjina [ijjina2017human] - - 75.5
Baradel [baradel2018human] - - 72.0
CTA-Net - - 92.9
Table 3: CTA-Net’s accuracy (%) and its comparison to the state-of-the-art using SBU Kinect Interaction dataset [yun2012two].
Annotation Split Without during, before and after With during, before and after
No Attention Attention No Attention Attention
Val Test Val Test Val Test Val Test
Fine-grained 0 56.05 52.35 51.71 53.76 50.36 44.74 76.97 71.43
1 49.71 39.41 50.59 45.07 48.82 41.50 72.94 67.94
2 55.30 43.67 56.41 43.98 53.75 38.07 67.34 56.85
Avg 53.69 45.14 52.90 47.60 50.98 41.44 72.42 65.41
Coarse scenarios 0 47.41 43.92 46.55 39.80 43.34 44.29 63.09 61.13
1 41.94 44.43 41.12 44.91 38.77 49.39 55.34 54.34
2 53.66 31.23 60.28 33.60 45.73 30.05 70.02 41.47
Avg 47.67 39.86 49.32 39.44 42.61 41.24 62.82 52.31
Table 4: Split-wise accuracy (%) of fine-grained and coarse scenario activities with and without temporal relationships (‘before’, ‘during’, and ‘after’), as well as with and without our novel attention mechanism using Drive&Act dataset [martin2019drive].
Split Fine-Grained Coarse Action Object Location All
Val Test Val Test Val Test Val Test Val Test Val Test
0 76.97 71.43 63.09 61.13 57.82 60.94 63.01 57.94 46.50 57.01 42.95 52.07
1 72.94 67.94 55.34 54.34 56.74 54.88 62.87 64.86 68.78 64.10 52.79 49.89
2 67.34 56.85 70.02 41.47 58.20 53.40 64.23 54.77 53.94 67.92 43.57 46.27
Avg 72.42 65.25 62.82 52.31 57.59 56.41 63.37 59.19 56.41 63.01 46.44 49.41
Table 5: Split-wise accuracy (%) of fine-grained, coarse activities and atomic action units using our model on Drive&Act.

On the SBU Kinect dataset [yun2012two], our model significantly outperforms (92.9%) the state-of-the-arts using RGB only (72% [baradel2018human], 75.5% [ijjina2017human]), as shown in Table 3. Moreover, the accuracy is close to the existing approaches that use multi-modal (RGB+Depth: 93.4% [lin2016deep], RGB+Pose: 94.1% [baradel2018human]) and even better than the approach in [ijjina2017human], which uses RGB+Depth (85.1%). However, such multi-modal information is not always available or requires additional devices for data capture. This demonstrates that our CTA-Net is not only suitable for recognizing driver’s activity but also appropriate in classifying traditional human activities.

4.3 Ablation studies

We have conducted ablation studies to understand the impact of the proposed high-level temporal relationships (‘before’, ‘during’, and ‘after’), as well as our novel attention mechanism (see Section 3.3) on the performance of our model using individual split. The results are shown in Table 4. It is evident that the performance of combined high-level temporal relationships and attention mechanism is significantly higher than the rest of the combinations. Moreover, the average accuracy (fine-grained: Val 72.42%, Test 65.41% and scenario: Val 62.82%, Test 52.31%) using ‘before’, ‘during’, and ‘after’ relationships is considerably higher than without them (fine-grained: Val 52.9%, Test 47.6% and coarse: Val 49.32%, Test 39.44%). This justifies the inclusion of the proposed coarse temporal relationships. Similarly, the performance is higher with the inclusion of our attention mechanism than without it. This vindicates the significance of the proposed attention mechanism in our model.

We have also provided our model’s accuracy using individual split in Drive&Act (Table 5). There is not any significant difference in accuracy among the splits, suggesting the splits are balanced. We have also included additional confusion matrices in the supplementary document.

5 Conclusion

In this paper, we have proposed a novel end-to-end network (CTA-Net) for driver’s activity recognition and monitoring by employing an innovative attention mechanism. The proposed attention generates a high-dimensional contextual feature encoding for activity recognition by learning to decide the importance of hidden states of an LSTM that takes inputs from a learnable glimpse sensor. We have shown that capturing coarse temporal relationships (‘before’, ‘during’, and ‘after’) via focusing certain segments of videos and learning meaningful temporal and spatial changes have a significant impact on the recognition accuracy. Our proposed architecture has notably outperformed existing methods and obtains state-of-the-art accuracy on four major publicly accessible datasets: Drive&Act, Distracted Driver V1, Distracted Driver V2, and SBU Kinect Interaction. We have demonstrated that the proposed end-to-end network is not only suitable for monitoring driver’s activities but also applicable to traditional human activity recognition problems. Finally, our model’s state-of-the-art results on benchmarked datasets and ablation studies justify the design of our approach. Future work will be to apply the proposed technique for the development of the driving assistance system.

Acknowledgements: This research was supported by the UKIERI (CHARM) under grant DST UKIERI-2018-19-10. The GPU is kindly donated by the NVIDIA Corporation.

Figure 5: Our CTA-Net’s confusion matrix showing 34 fine-grained activities in the Drive&Act test set.
(a) Reading Magazine (Before, During, After)
(b) Reading Newspaper (Before, During, After)
(c) Put on Sunglasses (Before, During, After)
(d) Put on Seat-belt (Before, During, After)
(e) Working (Before, During, After)
Figure 6: More examples demonstrating visual explanation of decision from our CTA-Net using class activation map representing our coarse temporal attention of ‘before’ (left), ‘during’ (middle) and ‘after’ (right) segment of 5 different actions
(a) Validation set
(b) Test set
Figure 7: CTA-Net’s confusion matrix of the 34 fine-grained activities using split 0 in Drive&Act dataset
(a) Validation set
(b) Test set
Figure 8: CTA-Net’s confusion matrix of the 34 fine-grained activities using split 1 in Drive&Act dataset
(a) Validation set
(b) Test set
Figure 9: CTA-Net’s confusion matrix of the 34 fine-grained activities using split 2 in Drive&Act dataset
(a) Validation set
(b) Test set
Figure 10: CTA-Net’s confusion matrix of the 12 coarse/scenario tasks using split 0 in Drive&Act dataset
(a) Validation set
(b) Test set
Figure 11: CTA-Net’s confusion matrix of the 12 coarse/scenario tasks using split 1 in Drive&Act dataset
(a) Validation set
(b) Test set
Figure 12: CTA-Net’s confusion matrix of the 12 coarse/scenario tasks using split 2 in Drive&Act dataset