Channel-Temporal Attention for First-Person Video Domain Adaptation

08/17/2021 ∙ by Xianyuan Liu, et al. ∙ The University of Sheffield 0

Unsupervised Domain Adaptation (UDA) can transfer knowledge from labeled source data to unlabeled target data of the same categories. However, UDA for first-person action recognition is an under-explored problem, with lack of datasets and limited consideration of first-person video characteristics. This paper focuses on addressing this problem. Firstly, we propose two small-scale first-person video domain adaptation datasets: ADL_small and GTEA-KITCHEN. Secondly, we introduce channel-temporal attention blocks to capture the channel-wise and temporal-wise relationships and model their inter-dependencies important to first-person vision. Finally, we propose a Channel-Temporal Attention Network (CTAN) to integrate these blocks into existing architectures. CTAN outperforms baselines on the two proposed datasets and one existing dataset EPIC_cvpr20.



There are no comments yet.


page 5

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video-based action recognition is a challenging computer vision problem. Great progress has been made on several benchmark datasets

[Kuehne11, soomro2012ucf101, kay2017kinetics, Damen2018EPICKITCHENS]

using architectures based on Convolutional Neural Networks (CNNs)

[Feichtenhofer_2016_CVPR, Carreira_2017_CVPR, tran2018closer, 8454294]. However, it remains a major obstacle to generalize models learned on one domain to another domain due to the distribution mismatch between them, i.e., the domain shift. Moreover, domain discrepancy not only exists between different datasets but also between different videos in the same dataset, e.g. as shown in Table 1.

Figure 1: First, we learn common channel-wise and temporal-wise attention for both source and target videos to focus on interactions important for actions in first-person videos (e.g. by the hands). Then we adapt the network to align source and target distributions.

Unsupervised Domain Adaptation (UDA) can address the domain shift problem by designing networks that take domain discrepancy into account to learn common features. Such networks can explore a shared distribution for the two domains without labels from the target domain. Recently, UDA has shown remarkable progress on still images such as object detection [Kim_2019_ICCV], person re-identification [Fu_2019_ICCV], and semantic segmentation [Jaritz_2020_CVPR]. While UDA methods for still images focus on minimizing the distribution distance between domains in the spatial feature space [Yan_2017_CVPR, sun2016return, ganin2016domain, long2018conditional], UDA methods for videos aim to minimize the distribution distance between domains in both spatial and temporal feature spaces.

Test \Train Video 1 Video 2 Video 3
Video 1 95.8 25.0 27.4
Video 2 41.1 93.5 37.5
Video 3 28.6 24.8 95.1
Table 1: Action recognition accuracy (%) across three videos from the same ADL dataset [pirsiavash2012detecting].

We can categorize existing video datasets for action recognition into third-person vision and first-person vision datasets [ryoo2013first]. This paper focuses on UDA for first-person action recognition, where videos are collected by a wearable camera so the availability is limited compared to the abundant third-person datasets. To the best of our knowledge, UDA for first-person action recognition has only been studied on subsets of the EPIC-KITCHEN (EPIC) dataset [Damen2020RESCALING], where the first version in [munro20multi] (EPIC) took 9 hours to train on 8 V100 GPUs with a batch size of 128. For researchers with limited computing resource, it is difficult/infeasible to develop and study UDA models on such a large-scale dataset. This motivates us to build datasets that are of smaller scale yet still meaningful for UDA to lower the bars and accelerate the development in this area. Therefore, we first propose two small-scale, first-person action recognition datasets for UDA: 1) ADL: We collect three long-duration videos from the ADL dataset [pirsiavash2012detecting] and restructure them for UDA. 2) GTEA-KITCHEN: We combine two first-person video datasets, GTEA [fathi2011learning] and KITCHEN [de2009guide], for UDA by restructuring the GTEA dataset labels and re-label the KITCHEN dataset manually. These two datasets are small-scale but provide large domain shift for UDA research. See more details in Section 4 and Table 4 and Table 5.

There are two categories of UDA networks for action recognition, with different temporal feature extraction mechanisms. The first category

[munro20multi] takes a two-stream approach to 1) extract spatial information and temporal information separately from video data, 2) transfer spatial and temporal knowledge separately between datasets, and 3) fuse both in the end. For example, the MM-SADA method [munro20multi] utilizes RGB frame and optical flow as the two streams to boost UDA performance on an EPIC subset (EPIC). However, it has a high space and time complexity. Moreover, although MM-SADA includes a self-supervised module to capture the relationship between RGB frames and optical flow, the extraction processes for spatial and temporal information are separated.

The second category extracts spatio-temporal information from videos directly and transfers spatio-temporal knowledge between domains [Chen_2019_ICCV, jamal2018deep]. [Chen_2019_ICCV] extends image-based UDA to video by adding an attentive temporal alignment to increase UDA performance on third-person videos. However, needs features extracted by 2D CNN as the input and does not extract spatio-temporal features from videos directly, which could not learn task-specific features in an end-to-end fashion. Moreover, the attention mechanism in is for the extraction and adaptation of temporal domain features only, rather than spatio-temporal information.

Compared with third-person videos, first-person videos have some unique characteristics. For example, actions tend to occur in some local areas, and particularly where the hands and objects interact. Thus, we hypothesize that networks paying more attention to such areas can leverage such characteristics to benefit common feature learning in UDA, as shown in Figure 1. In addition, different channels in different layers of CNNs capture different characteristics. The Squeeze-Excitation (SE) block in [hu2018squeeze] can generate attention scores to excite important channels. This inspires us to design excitation attention approaches that can weigh the channel-wise and temporal-wise features in the CNN layers to reveal the channel-temporal relationships for first-person videos. To this end, we propose a Channel-Temporal Attention block (CTA) that can make the network to pay more attention to action-related features in first-person videos. Moreover, for UDA, the network should not only focus on these important features but also focus on the common features across domains. Therefore, we utilize an adversarial approach at the video level for alignment to minimize the discrepancy between important channels in the source and target domains.

In summary, our contributions are three-fold:

  • First-person Video Domain Adaptation Dataset Collection: We collect two small-scale first-person datasets for UDA, ADL and GTEA-KITCHEN. They both have sufficient domain shift and can lower the bars for researchers to enter this field, stimulate more studies, and accelerate research in this area. To our knowledge, these two datasets are the only datasets besides the EPIC dataset [munro20multi, Damen2020RESCALING] for studying the first-person video UDA problems.

  • Channel-Temporal Excitation Attention for First-person Video UDA: We explore different excitation attention approaches for UDA by modeling channel-wise and temporal-wise inter-dependencies. The results show that these approaches can make the network better focus on the key features from first-person videos and benefit the video UDA.

  • Channel-Temporal Attention Network: We propose a new adversarial Channel-Temporal Attention Network (CTAN) with the channel-temporal excitation attention above. We train our network end-to-end based on I3D [Carreira_2017_CVPR] and test on our proposed small-scale datasets and the large-scale EPIC dataset. Our network outperforms all the baselines on the three datasets.

2 Related Works

Figure 2:

The proposed Channel-Temporal Attention Network (CTAN) model for first-person action recognition. Source and target domains share the same feature extractor, which is composed of Inception modules


and proposed Channel-Temporal Attention (CTA) blocks. In training, the feature extractor takes labeled source videos and unlabeled target videos as the input and generates source and target features as the output. Source features are fed into both action and domain classifiers, while target features are only fed to domain classifier. In test, only target videos are the input to the feature extractor and then the action classifier.

2.1 Action Recognition

There are three categories of action recognition networks according to different ways to extract temporal features. The first category takes a two-stream approach to extract temporal features, typically optical flow, directly and then combine them with spatial feature, via late fusion [simonyan2014two], constant fusion [Feichtenhofer_2016_CVPR], and sparse sampling [8454294].

The second category takes spatial features as inputs to extract spatio-temporal features directly by the 3D CNNs. C3D [tran2015learning] constructs 3D kernels to extract short-term information from the RGB frame input. R(2+1)D [tran2017convnet] applies skip connection to C3D and explore different 3D and 2D convolution combinations. I3D [Carreira_2017_CVPR] inflates 2D convolutional and pooling kernels of 2D CNN trained on image datasets into 3D to use well-trained 2D CNN parameters. In general, I3D is considered to be superior to C3D and R(2+1)D.

The third category utilizes temporal modeling to extract spatio-temporal feature from spatial inputs, via recurrent neural networks

[7558228], multi-scale temporal relation pooling [zhou2017temporalrelation], or spatial feature channel shifting [lin2019tsm].

We choose I3D as our backbone considering the trade-off between performance and model size. However, our proposed method is applicable to the two-stream and temporal modeling networks as well, with potentially better performance but higher computational cost.

2.2 Unsupervised Domain Adaptation

Unsupervised domain adaptation aims to find a common feature space between the labeled source data and unlabeled target data, with three approaches. The first approach is discrepancy-based. It aligns source-target distributions by minimizing a divergence that measures the distance between them, e.g. via weighted Maximum Mean Discrepancy (MMD) [Yan_2017_CVPR] or Correlation Alignment (CORAL) [sun2016return].

The second approach is adversarial-based. It utilizes domain discriminators and conducts adversarial training to reduce the discrepancy. DANN [ganin2016domain] utilizes discriminators and gradient reversal layer (GRL) to accomplish alignment through standard back-propagation training. CDAN [long2018conditional] leverages multilinear and entropy conditioning on discriminative information to enable alignment of multi-modal distributions.

The third approach is reconstruction-based. DRCN [ghifary2016deep] uses a pair-wise squared reconstruction loss to reconstruct the target data, while DSN [bousmalis2016domain] uses scale-invariant mean squared error reconstruction loss to reconstruct both the source and target data.

We consider the first two approaches, specifically, their extensions from image UDA to video UDA.

2.3 Domain Adaptation for Action Recognition

Most domain adaptation models for action recognition consider third-person videos. DAAA [jamal2018deep] utilizes 3D CNN to extract spatio-temporal feature, projects them to a latent subspace, and then uses discriminators to reduce the discrepancy in the subspace. TAN [Chen_2019_ICCV] utilizes temporal relation module from [zhou2017temporalrelation] to extract spatio-temporal features and extends image-based domain adaptation to videos by adding temporal attentive alignment. TCoN [pan2020adversarial] applies a co-attention mechanism to CNN features to guide the network to focus on common keyframes across domains on RGB and optical flow.

For domain adaptation on first-person videos, MM-SADA [munro20multi] adopts two-stream networks and utilizes self-supervised multi-modal UDA to learn the relationship between RGB and optical flow. It has high computational cost due to the usage of optical flow for creating multi-modal inputs. There is no consideration of unique characteristics of first-person videos either

In contrast to the above approaches, this paper will utilize the excitation attention approach to take the characteristics of first-person videos into account in a channel-wise and temporal-wise manner.

3 Proposed Method

Figure 2 shows the proposed unsupervised domain adaptation model for first-person action recognition, named as Channel-Temporal Attention Network (CTAN). In training, source and target videos are fed into a feature extractor that modifies the I3D [Carreira_2017_CVPR] pretrained on Kinetics by adding multiple channel and temporal attention (CTA) blocks. Each proposed CTA block contains a channel attention module and a temporal attention module and it is inserted into I3D to perform re-calibration of both channel-wise and temporal-wise features. After feature extraction, source features are fed into an action classifier and also, both source and target features are fed into a discriminator for adversarial domain discrimination. In test, only target videos are taken as the input to the feature extractor to extract features for the action classifier to predict the action class.

3.1 Channel-Temporal Attention Module

In first-person video action recognition, different channels of CNN layers capture different spatio-temporal information related to the action. Such spatio-temporal information can benefit domain adaptation for action recognition. Firstly, inspired by the SE block [hu2018squeeze] that excites informative features in input image channels, we extend the SE block to channel-wise attention (CA) module for video input, as shown in Figure 3. Secondly, people can usually recognize an action at a glance as long as they see a small but informative part of this action. This phenomenon inspires us to extend the previous CA module to the temporal-wise attention (TA) module so that our network can focus on informative features in temporal dimensions. Finally, we integrate the CA and TA modules into the channel-temporal attention (CTA) module described in detail below.

(a) Channel-wise attention module
(b) temporal-wise attention module
Figure 3: Architecture of the channel-temporal attention module.

Given a 5D video feature . , and denote batch size, temporal dimension and feature channel size. and correspond to height and width. First, we utilize 3D average pooling to extract channel-wise information among dimensions , and . Then, we capture the channel-reduced feature for efficiency by a linear layer with parameters and a reduction ratio .


where denotes the output of pooling and denotes the squeezed channel-reduced feature.

Another linear layer is used with parameters

to restore the channel dimension of the feature and a sigmoid function

is used to capture channel-attentive weights . In order to excite the informative channels, we compute a Hadamard product between these weights and the video feature as



denotes the output of the channel attention module with the excited and enhanced channel-wise informative features. Because wrong channel attention may hurt the performance to some degree and some channel attention may suppress other information, we add a residual connection to mitigate these negative effects.

We then feed the output into the temporal module and conduct a 3D average pooling on the video feature among , and to extract the temporal-wise information. We adopt two linear layers to model the temporal feature. The sigmoid function is again adopted to get the temporal-attentive weights . The Hadamard product is computed to excite temporal features on the input . A skip connection is also applied to prevent temporal attention from suppressing other information.


where and denote the squeezed temporal feature and temporal-reduced feature. and are the parameters of the linear layer. is the excited output.

3.2 Adversarial UDA

After introducing the CTA for important channel-temporal feature extraction from source domain, we need to consider how to design the network for learning the common features across domains. For UDA, the network needs to learn common features across domains while focusing on the important features for target domain classification. For convenience, discrepancy-based and adversarial-based approaches such as DAN [long2015learning] and DANN [ganin2016domain] are easy to be adapted to our task compared with reconstruction-based UDA. In comparison, linear version DAN needs a large batch size to avoid negative MMD loss resulting in more computation resource requirement than DANN.

In this paper, we utilize the domain adversarial neural network (DANN)

[ganin2016domain], in which a two-player mini-max game is constructed considering the limited computation resources. The main idea of DANN is to add one domain classifier to discriminate whether the data is from the source or target domain. The parameters of the domain classifier are trained by minimizing the discriminator loss , while another feature extractor maximizes this discriminator loss to train the extractor parameters. The aim is to confuse the discriminator to guide the feature extractor to learn common features between the source and target domain. Here, we utilize a discriminator as in DANN to align features extracted by the feature extractor across domains. The domain loss is defined for each video input as:


where and are source and target domains respectively, is the number of samples from both domains, and is the domain label of . If is from the source (target) domain, is set as 1 (0)

3.3 Integration with I3D Network

Finally, we integrate the proposed modules and adversarial UDA into I3D, as illustrated in Figure 2. The channel-temporal attention modules are integrated into I3D after Inception modules. Following the finding in [hu2018squeeze] that lower layer features are typically more general, while higher layer features have greater specificity, we integrate our proposed modules into Inception module “Mixed4b” to Inception module “Mixed4f” in the I3D architecture instead of the too early and too late modules. The adversarial discriminator and a two-layer action classifier are integrated after the average pooling layers of I3D. Gradient reversal layer (GRL) is also used between the discriminator and feature extractor to invert the gradient. The overall loss can be expressed as follows:


where is a hyper-parameter to trade-off domain adaptation with classification respectively. refers to the action labels of input . The whole network is trained by two cross entropy loss, and .

4 Proposed First-Person UDA Datasets

(a) ADL
Figure 4: The distribution of classes in the proposed datasets.
Resolution EPIC: 640x480 / GTEA: 456x256 / KITCHEN: 342x256 / ADL: 342x256
Frame rate EPIC: 60 / GTEA: 15 / KITCHEN: 30 / ADL: 30
Number of classes 8 6 7
Number of action videos 10094 454 222
Domains D1 D2 D3 D1 D2 D1 D2 D3
Number of training segments 1543 2495 3897 1166 2582 570 633 421
Number of test segments 435 750 974 291 646 142 159 106
Table 2: The comparison of the first-person cross-domain video datasets. Note that our proposed datasets adopt different sampling method from EPIC due to data augmentation.

The EPIC-KITCHEN dataset (EPIC) is the largest dataset for first-person action recognition, with daily activities captured in the kitchen [Damen2020RESCALING]. In this paper, we follow the same setting in [munro20multi] to refer to P08, P01, P22 from EPIC as D1, D2, D3 to build EPIC for UDA. The eight verb classes in EPIC are put, take, open, close, mix, pour, wash and cut. Despite being a subset of EPIC, EPIC is still of large scale, the training in [munro20multi] took 9 hours on 8 NVIDIA V100 GPUs with a batchsize of 128. This sets a high bar for researchers with limited computing resources to do research in this area. Therefore, we propose two small-scale but useful datasets to provide more options to evaluate first-person DA approaches. For both datasets, we follow EPIC settings, select the same verb classes from our datasets. ADL covers the first seven classes, while GTEA-KITCHEN covers the first six, as shown in Figure 4.

Figure 5: Example frames in the class mix/stir from the proposed datasets: top: D1, D2 and D3 from ADL; bottom: GTEA and KITCHEN.

4.1 Adl

ADL dataset is an activity dataset of daily living in first-person camera views [pirsiavash2012detecting], containing 10-hour action videos by 20 persons in 20 different apartments. Each video records similar actions by different persons in different environments, which causes significant domain shift as shown in Table 1. The original annotation provides action labels. To be consistent with EPIC, we extract the verbs from action annotations and reorganize them as verb annotations. However, ADL only contains the first seven classes. Therefore, we select videos containing these seven classes. These videos are P4, P6 and P11, which we refer to as D1, D2 and D3 respectively. Considering all the dataset have only 222 action videos, we extract every 16 frames with 4-frame overlap from each action video to make an action sample for data augmentation. After the augmentation, we finally present a new small-scale dataset for UDA on first-person videos, ADL. As the class distribution shown in Fig. 4(a), all domains have distinct distribution. Although D3 has all the seven classes, it contains a small number of samples in each class. It is still the smallest after data augmentation.

4.2 Gtea-Kitchen

GTEA [li2015delving, fathi2011learning] and KITCHEN [de2009guide] datasets are both first-person video datasets recording actions in the kitchen. However, GTEA videos are in the real kitchen, while KITCHEN videos are in a temporary-built kitchen in the lab. Like the dataset name, most actions in these two datasets are about cooking, resulting in six overlapping verbs. We use the whole GTEA and KITCHEN dataset as two domains. As in ADL, we extract the verb annotation from action annotation in GTEA. However, KITCHEN annotation contains many useless segments. Therefore, we re-annotate verbs in KITCHEN to make these two datasets match each other. This leads to another new first-person video dataset for UDA, GTEA-KITCHEN. The class distribution is shown in Figure 4(b).

Table 2 summarizes the key statistics for the three datasets. GTEA-KITCHEN has more samples than ADL but less than EPIC. For our proposed datasets, we show some sample frames for each dataset in Figure 5 with the class stir/mix. We will study the source-only and target-only recognition accuracy as well. The target-only setting means training and test are both on the target dataset, while the source-only setting means the model trained on the source dataset is directly tested on the target dataset without UDA. These two results serve as the upper and lower bound of UDA on these datasets.

5 Experiments

We evaluate our proposed method on the three datasets: ADL, GTEA-KITCHEN and EPIC against other image-based DA networks including DAN [long2015learning], DANN [ganin2016domain] and CDAN [long2018conditional].

5.1 Experimental Setup

Datasets. For EPIC, we follow the dataset settings in MM-SADA [munro20multi] to split the EPIC into training and test sets and randomly sample a 16-frame segment from each action video as the input for both training and test. For our proposed datasets, we follow similar experimental protocol as EPIC. We extract every 16-frame segment from each action video as a sample and make adjacent samples to have 4-frame overlap for data augmentation. All the segments are divided randomly into training and test set at a ratio of :, with details in Table 2.

Implementation details.

We utilize the I3D as our backbone for feature extraction and train all our network end-to-end. Each domain discriminator and classifier are composed of 2 fully connected layers with a dimension of 100 and a ReLU activation function. The only difference from I3D setting is that a dropout rate of 0.5 and a soft-max activation are applied to the classifier to avoid over-fitting and to predict class labels. We select the outputs of the final average pooling layer of I3D, with a dimension of 1024, as the inputs for the discriminator and classifier.

In the training process, we use the labeled source data and unlabeled target data, while in the test process we only utilize the unlabeled target data. For both source and target, the input data is the 16-frame segments sampled from the action video and each frame is resized to 256256 and randomly cropped to 224224. Optimisation is performed using SGD with momentum of 0.9 and batch size of 16. A weight decay with 5e-4 is applied for all parameters. The training process is divided into two stages. First, we set

as 0 and train the feature extractor and classifier at a learning rate of 1e-2 for 10 epochs. Second, we follow the same strategy in

[ganin2016domain] to increase from 0 to 1 and reduce the learning rate to train the overall network for further 20 epochs.

5.2 Experimental Results

Baselines Considering that MM-SADA [munro20multi] has a two-stream architecture and needs extracted features, we do not include them as our baselines. We extend three state-of-the-art image-based UDA models DAN [long2015learning], DANN [ganin2016domain] and CDAN [long2018conditional] to videos as our baselines. We follow their default settings except the feature extraction for fair comparison. We conduct each experiment three times with different random seeds and report the average accuracy on test target set as the result for fair comparison. The best result for each task is highlighted in bold, and the second best is underlined.

D1D2 39.4 39.4 40.7 36.3 41.3 52.8
D1D3 32.0 32.9 30.3 34.1 35.0 52.8
Mean 35.7 36.2 35.5 35.2 38.1 52.8
Gain - 0.5 -0.2 -0.5 2.4 -
Table 3: The comparison of accuracy (%) with other approaches on EPIC. On average, we outperform the source-only performance by 2.4%. Source refers to source-only and Target refers to target-only. Gain represents the absolute difference from the source-only accuracy. The best result for each task is in bold, and the second best is underlined.

EPIC We first evaluate CTAN on EPIC first. We select two hardest UDA tasks D1D2 and D1D3 according to [munro20multi]. The results are shown in Table 3. By comparing the recognition accuracy, CTAN outperforms all other baselines on the selected domains. Both CTAN and DANN have improved over the the source-only baseline. Specifically, DANN slightly improve the source-only baseline by 0.5% while CTAN outperforms the baselines by 2.4% on average, which proves the efficacy of our proposed method.

GTEA-KITCHEN We then evaluate our network on our proposed GTEA-KITCHEN dataset and the results are shown in Table 4. For the GK task, CTAN outperforms the discrepancy-based approach DAN [long2015learning] by 3.3%, adversarial-based approaches CDAN [long2018conditional] by 7.1% and DANN by 2.9%. For KG task, though the performance of CTAN is slightly lower than CDAN, it is still better than other networks. On average, CTAN can achieve the highest accuracy. In comparison, although CDAN performs best in the KG task, it gets the worst results (34.0%) in the GK task.

ADL We finally compare CTAN to the baselines in Table 5. In general, our network significantly improves on the source-only baseline in 5 out of 6 cases by 2.5% in average. For D1D2 and D1D3 tasks, only CTAN achieves better recognition accuracy than the source-only baseline. For D2D1 and D2D3 tasks, though CTAN did not outperform CDAN (D2D1) and DAN (D2D3), it still performs much better than DANN (by 3.9% and 5.1% respectively) and source-only (by 6.0% and 4.1% respectively).

Overall, CTAN reaches outstanding performance on the three datasets. CTAN has considerably improved the source-only baseline and DANN, which confirmed its efficacy again The improvement is consistent for all pairs of domains in 9 out of 10 cases. In contrast, DAN, CDAN and DANN improve the source-only baseline in 5, 6 and 7 out of 10 cases respectively. The only off-target is the D3D1 task in ADL, which is reasonable because D3 is the smallest and imbalanced domain among all domains from this dataset as shown in Fig. 4(a). In UDA, transferring knowledge from a small dataset with imbalanced classes to a large dataset is still challenging. Similarly, the D3D2 task is also improved slightly by CTAN as shown in Table 5.

GK 36.8 38.2 34.0 37.8 41.1 95.9
KG 45.9 46.5 48.4 43.4 47.6 94.5
Mean 41.4 42.4 41.2 40.6 44.3 95.2
Gain - 1.0 -0.2 -0.8 2.9 -
Table 4: The comparison of accuracy (%) with other approaches on GTEA-KITCHEN. Source refers to source-only and Target refers to target-only.

5.3 Ablation Study and Analysis

Saturation level of Proposed Datasets. We examine the saturation level of our proposed datasets. As shown in Table 5, the average accuracy gap between source-only and target-only results for ADL is 64.8%. From Table 4, the average gap for GTEA-KITCHEN is over 50%. CTAN can narrow the gap by 2.9% but there is still a large room for further improvement. This shows that our proposed small-scale datasets are challenging and far from saturation so they can support further research in this area.

Domain Discrepancy in Datasets. We then investigate the reason why the datasets have large discrepancy. We take ADL as the example and visualize the distribution of feature extractor’s last average pooling layer output in both target-only setting and CTAN as shown in Figure 6. As shown, the distributions of the features from classes pour, wash and mix/stir are easy to classify in both Figure 6(a) and Figure 6(b). In comparison, the features from the two classes take and put are mixed together. Unlike pour, wash and mix/stir always occurs with specific objects like kettle or faucet, take and put can occurs with many objects in many environments. Even in Figure 6(a) which is the upper bound performance, the distribution of the features from take and put classes are not totally separated. It helps us to better understand the challenges of our proposed dataset, and inspires us to tackle it in the future.

D1D2 41.1 40.6 41.1 35.9 43.2 95.8
D1D3 28.6 28.1 27.3 26.6 31.5 95.8
D2D1 25.0 27.1 34.0 25.7 31.0 93.5
D2D3 24.8 23.8 26.2 31.1 28.9 93.5
D3D1 27.4 29.5 23.6 31.2 26.7 95.1
D3D2 37.5 37.5 42.7 36.5 38.2 95.1
Mean 30.7 31.1 32.5 31.2 33.3 94.8
Gain - 0.4 1.8 0.5 2.5 -
Table 5: The comparison of accuracy (%) with other approaches on ADL. Source refers to source-only and Target refers to target-only.
(a) Target-only
(b) CTAN
Figure 6: The comparison of -SNE visualization for the feature extractor output on ADL, with D1 as the source domain and D2 as the target domain.

Inter-dependencies. We also measure different variants of CT-block to explore the channel-wise and temporal-wise inter-dependencies. We construct four different blocks. C-block only contains channel-wise attention, while T-block contains only temporal-wise attention. CT-block and TC-block both contain channel-wise and temporal-wise attentions but with different orders. As shown in Table 6, firstly, CT-block achieves the best accuracy among the four structures and C-block outperforms T-block. It verifies that making network focus on important channels can benefit the feature extraction and UDA. Secondly, channels seem benefit the network more than temporal dimensions because channels carries spatio-temporal information but temporal dimensions do not seem to carry spatial information. Thirdly, the performance of T-block is poorer than the baseline DANN in most pairs. It means simply paying more attention to temporal information may suppress spatial information.

D1D2 40.6 44.8 40.9 40.6 43.2
D1D3 28.1 27.6 27.9 29.4 31.5
D2D1 27.1 30.6 25.7 30.3 31.0
D2D3 23.8 27.8 28.9 28.9 28.9
D3D1 29.5 19.8 24.0 31.9 26.7
D3D2 37.5 39.9 37.2 31.3 38.2
Mean 31.1 31.8 30.8 32.1 33.2
Gain - - 0.7 -0.3 1.0 2.1
Table 6: The comparison of recognition accuracy (%) with different blocks on ADL.

6 Conclusion and Future Work

This paper proposed two small-scale action recognition datasets for first-person video domain adaptation, ADL and GTEA-KITCHEN, both having large domain discrepancy. We utilize these datasets to explore the channel-wise and temporal-wise relationship, and propose channel-wise and temporal-wise excitation attention modules for video to make the network focus on the important channels and temporal dimensions of the CNN features. Finally, we propose Channel-Temporal Attention Network (CTAN) with the attention among channels and temporal dimensions. Our network outperforms the baselines on our proposed small-scale datasets and an existing large-scale dataset.

Future work can be divided into two categories. First, we will continue to focus on ADL and GTEA-KITCHEN datasets due to their large domain discrepancy. We will extend more image-based UDA methods for these two datasets and explore the problems about how to recognize put and take. Second, we plan to extend CTAN to more baselines for robustness test. Considering that CTAN only use video-level alignment, we will also integrate some local alignment methods like channel-wise alignment to make CTAN focus on more common and important information.


1 Datasets

This section introduces the details of our small-scale datasets. We follow the similar setting of EPIC [munro20multi] to create the datasets. Note that mix and stir are both categorised into mix in EPIC and we keep the same setting.

1.1 Adl

We created the ADL dataset by collecting three videos from the original ADL dataset [pirsiavash2012detecting], which are P4, P6 and P11. All the videos record real daily life activity. The total length of the videos is one hour and 22 minutes. We restructured the verb annotation in [singh2016first] by removing unclear and non-overlapping labels and segmented all the untrimmed videos into action video clips according to the annotation. We selected seven categories from these videos, which are put, take, open, close, mix, pour and wash. The minimum length of each action video is one second, while the maximum is 46 seconds. After restructuring, we extracted every 16 frames with a 4-frame overlap from each action video as the action segment. We then split all the action segments into training and test sets equidistantly in each category with a ratio of 8:2. In the training process, we also split the training set into training and validation sets randomly with a ratio of 9:1.

As shown in Table 7, ADL contains three domains: D1, D2 and D3, which refer to P4, P6 and P11, respectively. D1 includes 570 training segments and 142 test segments. D2 includes 633 training segments and 159 test segments. D3 includes 421 segments for training and 106 segments for test. ADL includes 222 original action videos and 2031 extracted action segments in total. ADL has two characteristics. First, compared with EPIC, ADL is about daily life rather than cooking. This leads to a larger difference between actions in the same category. For example, put toothpaste on toothbrush and put computer on table both belong to put. Second, some categories are imbalanced. mix in D2 and wash in D3 have only one action video. This increases the difficulty of UDA on this dataset because the designed network needs to learn the common feature from only one action video.

Total Verb category
put take open close mix pour wash
D1 Action video 80 9 19 11 9 4 13 10
Action segment 712 46 73 45 41 108 137 262
Training segment 570 37 58 36 33 86 110 210
Test segment 142 9 15 9 8 22 27 52
D2 Action video 97 22 32 10 12 1 11 3
Action segment 792 89 198 59 49 24 252 121
Training segment 633 71 158 47 39 19 202 97
Test segment 159 18 40 12 10 5 50 24
D3 Action video 45 3 20 7 6 2 5 1
Action segment 527 42 164 50 58 109 65 39
Training segment 421 34 131 40 46 87 52 31
Test segment 106 8 33 10 12 22 13 8
Table 7: The summary of ADL dataset.

1.2 Gtea-Kitchen

GTEA S1_Cheese_C1,        S1_Coffee_C1, S1_CofHoney_C1,     S1_Hotdog_C1, S1_Pealate_C1,        S1_Peanut_C1, S1_Tea_C1,             S2_Cheese_C1, S2_Coffee_C1,         S2_CofHoney_C1, S2_Hotdog_C1,        S2_Pealate_C1, S2_Peanut_C1,         S2_Tea_C1, S3_Cheese_C1,        S3_Coffee_C1, S3_CofHoney_C1,     S3_Hotdog_C1, S3_Pealate_C1,        S3_Peanut_C1, S3_Tea_C1
KITCHEN S07_Brownie,          S09_Brownie, S12_Brownie,          S13_Brownie, S14_Brownie,          S16_Brownie, S17_Brownie,          S18_Brownie, S19_Brownie,          S20_Brownie, S22_Brownie,          S24_Brownie
Table 8: The lists of all collected videos in GTEA and KITCHEN dataset.
(a) Example frames of GTEA dataset.
(b) Example frames of KITCHEN dataset.
Figure 1: Example frames of all categories in GTEA-KITCHEN dataset. From left to right: put, take, open, close, mix, pour.
put put cheese, put mayonnaise, put mustard, put bread, put coffee, put sugar, put water, put honey, put hotdog, put ketchup, put jam, put peanut, put tea
take take bread, take cheese, take mayonnaise, take mustard, take cup, take coffee, take spoon, take sugar, take water, take honey, take hotdog, take jam, take peanut
open open cheese, open mayonnaise, open mustard, open coffee, open sugar, open water, open ketchup, open honey, open chocolate, open peanut, open tea
close close mayonnaise, close mustard, close coffee, close sugar, close water, close honey, close jam, close chocolate
mix stir spoon, stir cup
pour pour mayonnaise, pour mustard, pour water, pour coffee, pour sugar, pour ketchup, pour chocolate, pour honey
Table 9: The lists of all collected verb categories in GTEA dataset.
Total Verb category
put take open close mix pour
GTEA Action video 115 20 31 16 15 17 16
Action segment 1457 174 387 267 146 105 378
Training segment 1166 139 310 214 117 84 302
Test segment 291 35 77 53 29 21 76
KITCHEN Action video 339 39 109 49 15 51 76
Action segment 3228 304 442 433 180 818 1051
Training segment 2582 243 354 346 144 654 841
Test segment 646 61 88 87 36 164 210
Table 10: The summary of GTEA-KITCHEN dataset.

We first collected all 28 videos in the original GTEA dataset [fathi2011learning], as shown in Table 8. The total length of the videos is around 35 minutes. The minimum length of each action video is one second, while the maximum is about 10 seconds. Note that all videos in GTEA are continuous actions without interruption. Therefore, GTEA has shorter video length but more action videos. We extracted verb from action category and collected all of the relevant and overlapping verb categories between GTEA and EPIC, which results in six categories: put, take, open, close, mix, pour. Each verb category corresponds to multiple action categories in the original GTEA dataset, as shown in Table 9. We then created action video clips, action segments, training set, validation set and test set with the same setting in ADL, as shown in Table 10.

We collected 12 videos in the KITCHEN dataset [de2009guide], as shown in Table 8. The total length is about one hour and 36 minutes. The minimum and maximum length are one second and 80 seconds, respectively. We manually annotated the verb category for our dataset according to the overlapping categories of GTEA. We followed the same setting to create training, validation and test sets, as shown in Table 10. Note that the number of mix action segments is 3272, which is larger than the sum of other categories. In our experiment, we randomly selected a quarter of them to make mix have similar sample sizes with the second most category pour, because we do not explore extremely imbalanced UDA in this paper.

We then built the GTEA-KITCHEN dataset, which is the first first-person video dataset across different datasets for UDA on action recognition, including 1166 training segments and 291 test segments from GTEA, 2582 training segments and 646 test segments from KITCHEN. There are three characteristics of GTEA-KITCHEN, which provide more options for UDA network exploration. First, the resolution difference between GTEA and KITCHEN is a challenge for UDA. The resolution of GTEA data is 456 256, while the KITCHEN is 342 256, as shown in Figure 1. Second, illumination change in KITCHEN increases the difficulty of UDA. The brightness in GTEA is nearly unchanged, while the brightness is sometimes very low in KITCHEN, as shown in Figure 1. Third, the extremely imbalanced mix category mentioned before leads to another challenge worth studying: class-imbalanced domain adaptation.

1.3 Implementation details

First, we downloaded all videos of our datasets from their official website. Occupied spaces are 4GB for ADL, 122MB for GTEA and 887MB for KITCHEN. Second, we extracted frames from videos at their respective sampling rates, which are 15 fps for GTEA and 30 fps for both ADL and KITCHEN. Third, we restructured the annotations following the details in Section 1.1 and 1.2. Finally, we will also publish our annotation files on GitHub.

2 Testing Time for Our Proposed Datasets

Figure 2: Training time of source-only setting with 30 epochs on ADL, GTEA-KITCHEN and EPIC.

Our implementation is based on the PyTorch

[paszke2017automatic] and PyTorch Lightning [falcon2019pytorch] framework. We trained the DANN [ganin2016domain] baseline with the source-only setting for 30 epochs on one V100 GPU and reported the time cost. As shown in Figure 2, training takes about three hours on ADL, and nine hours on GTEA-KITCHEN, on average. On EPIC, the average time is over 16 hours. Note that we utilize full sampling to extract action segments from action videos in our datasets while utilize random sampling to extract only one segment from each action video in EPIC. Therefore, the time consumed will be reduced more if random sampling is applied to segment extraction in our proposed datasets. This shows our datasets are feasible for researchers with limited computing resource to develop UDA networks.