Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

01/27/2020 ∙ by Jonathan Munro, et al. ∙ University of Bristol 69

Fine-grained action recognition datasets exhibit environmental bias, where multiple video sequences are captured from a limited number of environments. Training a model in one environment and deploying in another results in a drop in performance due to an unavoidable domain shift. Unsupervised Domain Adaptation (UDA) approaches have frequently utilised adversarial training between the source and target domains. However, these approaches have not explored the multi-modal nature of video within each domain. In this work we exploit the correspondence of modalities as a self-supervised alignment approach for UDA in addition to adversarial alignment. We test our approach on three kitchens from our large-scale dataset, EPIC-Kitchens, using two modalities commonly employed for action recognition: RGB and Optical Flow. We show that multi-modal self-supervision alone improves the performance over source-only training by 2.4 adversarial training with multi-modal self-supervision, showing that our approach outperforms other UDA methods by 3

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Our proposed UDA approach for multi-modal action recognition. Improved target domain performance is achieved via multi-modal self-supervision on source and target domains simultaneously, jointly optimised with multiple domain discriminators, one per-modality.

Fine-grained action recognition is the problem of recognising actions and interactions such as “cutting a tomato” or “tightening a bolt” compared to coarse-grained actions such as “preparing a meal”. This has a wide range of applications in assistive technologies in homes as well as in industry. Supervised approaches rely on collecting a large number of labelled examples to train discriminative models. However, due to the difficulty in collecting and annotating such fine-grained actions, many datasets collect long untrimmed sequences. These contain several fine-grained actions from a single [41, 48] or few [7, 45] environments.

Figure 2 shows the recent surge in large-scale fine-grained action datasets. Two approaches have been attempted to achieve scalability: crowd-sourcing scripted actions [16, 45, 44], and long-term collections of natural interactions in homes [7, 41, 36]. While the latter offers more realistic videos, many actions are collected in only a few environments. This leads to learned representations which do not generalise well [52].

Transferring a model learned on a labelled source domain to an unlabelled target domain is known as Unsupervised Domain Adaptation (UDA). Recently, significant attention has been given to deep UDA in other vision tasks [50, 14, 31, 32, 13, 54]. However, very few works have attempted deep UDA for video data [18, 6]. Surprisingly, none have tested on videos of fine-grained actions and all these approaches only consider the video as images (i.e. RGB modality). This is in contrast with self-supervised approaches that have successfully utilised multiple modalities within video when labels are not present during training [1].

Figure 2: Fine-grained action datasets [16, 45, 44, 7, 35, 25, 49, 28, 40], x-axis: number of action segments per environment (ape), y-axis: dataset size divided by ape. EPIC-Kitchens [7] offers the largest ape relative to its size.

Up to our knowledge, no prior work has explored the multi-modal nature of video data for UDA in action recognition. We summarise our contributions as follows:

  • [leftmargin=*,itemsep=-2ex,partopsep=1ex,parsep=2ex]

  • We show that multi-modal self-supervision, applied to both source and unlabelled target data, can be used for domain adaptation in video.

  • We propose a multi-modal UDA strategy, which we name MM-SADA, to adapt fine-grained action recognition models to unlabelled target environments, using both adversarial alignment and multi-modal self-supervision.

  • We test our approach on three domains from EPIC-Kitchens [7], trained end-to-end using I3D [5], and providing the first benchmark of UDA for fine-grained action recognition. Our results show that MM-SADA outperforms source-only generalisation as well as alternative domain adaptation strategies such as batch-based normalisation [27], distribution discrepancy minimisation [31]

    and classifier discrepancy 

    [43].

2 Related Works

This section discusses related literature starting with general UDA approaches, then supervised and self-supervised learning for action recognition, concluding with works on domain adaptation for action recognition.

Unsupervised Domain Adaptation (UDA) outside of Action Recognition. UDA has been extensively studied for vision tasks including object recognition [50, 14, 31, 32, 13, 54], semantic segmentation [64, 17, 59] and person re-identification [47, 61, 9]

. Typical approaches adapt neural networks by minimising a discrepancy measure 

[50, 14], thus matching mid-level representations of source and target domains. Maximum Mean Discrepancy (MMD) [14, 31, 32] minimises the distance between the means of the projected domain distributions in Reproducing Kernel Hilbert Space. More recently, domain adaptation has been influenced by adversarial training [13, 54]. Simultaneously learning a domain discriminator, whilst maximising its loss with respect to the feature extractor, minimises the domain discrepancy between source and target. In [54]

, a GAN-like loss function allows separate weights for source and target domains, while in 

[13] shared weights are used, efficiently removing domain specific information by inverting the gradient produced by the domain discriminator with a Gradient Reversal Layer (GRL).

Utilising multiple modalities (image and audio) for UDA has been recently investigated for bird classification [37]. Multiple adversarial discriminators are trained on a single modality as well as mid-level fusion and a cross-modality attention is learnt. The work shows the advantages of multi-modal domain adaptation in contrast to single-modality adaptation, though in their work both modalities demonstrate similar robustness to the domain shift.

Very recently, self-supervised learning has been proposed as a domain adaptation approach for semantic segmentation and object recognition [51]. For object recognition, self-supervision on tasks, such as rotation and translation, replaces adversarial training. For semantic segmentation, self-supervision was shown to benefit adversarial training when jointly trained. Both tasks only use a single image. Our work utilises the multiple modalities offered by video, showing that self-supervision can be used to adapt action recognition models to target domains.

Figure 3: Proposed architecture: feature extractors and are shared for both target and source domains. Domain Discriminators, and , are applied to each modality. Self-supervised correspondence of modalities, , is trained from both source and unlabelled target data. Classifiers, and are trained using source domain examples only from the average pooled classification scores of each modality. During inference, multimodal target data is classified.

Supervised Action Recognition. Convolutional networks are state of the art for action recognition, with the first seminal works using either 3D [19] or 2D convolutions [21]. Both works utilise a single modality—appearance information from RGB frames. Simonyan and Zisserman [46] address the lack of motion features captured by these architectures, proposing two-stream late fusion that learns separate features from the Optical Flow and RGB modalities, outperforming single modality approaches.

Recent architectures have focused on modelling longer temporal structure, through consensus of predictions over time [57, 62, 29] as well as inflating traditional CNNs to 3D convolutions [5], all using the two-stream approach of late-fusing RGB and Flow. The latest architectures have focused on reducing the high computational cost of 3D convolutions [11, 20, 60], yet still show improvements when reporting results of two-stream fusion [60].

Self-supervision for Action Recognition. Self-supervision methods learn representations from the temporal [12, 58] and multi-modal structure of video [1, 24], leveraging pretraining on a large corpus of unlabelled videos. Methods exploiting the temporal consistency of video have predicted the order of a sequence of frames [12] or the arrow of time [58]. Alternatively, the correspondence between multiple modalities has been exploited for self-supervision, particularly with audio and RGB [1, 24, 34]. Works predicted if modalities correspond or are synchronised. We test both approaches for multi-modal self-supervision in our UDA approach.

Domain Adaptation for Action Recognition. Of the several domain shifts in action recognition, only one has received significant research attention, that is the problem of cross-viewpoint (or viewpoint-invariant) action recognition [38, 23, 30, 44, 26]. These works focus on adapting to the geometric transformations of a camera but do little to combat other shifts, like changes in environment. Works utilise supervisory signals such as skeleton or pose [30] and corresponding frames from multiple viewpoints [44, 23]. Recent works have used GRLs to create a view-invariant representation [26]. Though several modalities (RGB, flow and depth) have been investigated, these were aligned and evaluated independently.

On the contrary, UDA for changes in environment has received limited recent attention. Before deep-learning, UDA for action recognition used shallow models to align source and target distributions of handcrafted features 

[4, 10, 63]. Two recent works have attempted deep UDA for action recognition [18, 6]. Both apply adversarial training using GRLs [13] to either C3D [53] or TRN [62] architectures using RGB only. Jamal et al[18] conclude that their approach outperforms shallow methods that use subspace alignment and Chen et al[6] show that attending to the temporal dynamics of videos can improve alignment. In [6], the method is evaluated on 4 pairs of source/target domains, while [18] evaluate their method on 6 pairs of domains from subsets of coarse-grained action datasets such as UCF [39], Olympics [33] and the KMS dataset [18].We also evaluate our method on 6 pairs of domains, however, our domains are larger than [18]. On average, we use 3.8 more training and 2 more testing video clips. Additionally, we focus on fine-grained actions previously unexplored for UDA.

The EPIC-Kitchens [7] dataset for fine-grained action recognition released two distinct test sets—one with seen and another with unseen/novel kitchens. In the 2019 challenges report, all participating entries exhibit a drop in action recognition accuracy of 12-20% when testing their models on novel environments compared to seen environments [8]. Up to our knowledge, there has been no previous effort to apply UDA on this or any other fine-grained action recognition dataset. In this work, we present the first approach to multi-modal UDA for action recognition, tested on fine-grained actions. We combine adversarial training on multiple modalities with a modality correspondence self-supervision task. This utilises the differing robustness to domain shifts between the modalities. We show that jointly training for both objectives outperforms adversarial or self-supervision alignment solely. Our method is detailed next.

3 Proposed Method

This section outlines our proposed action recognition domain adaptation approach, which we call Multi-Modal Self-Supervised Adversarial Domain Adaptation (MM-SADA). In Fig. 3, we visualise MM-SADA for two-stream action recognition using two modalities: RGB and Optical Flow, although any modalities could be used. We incorporate a self-supervision alignment classifier, , that determines whether modalities are sampled from the same or different actions to learn modality correspondence. This takes in the concatenated features from both modalities, without any labels. Learning the correspondence on source and target

encourages features that generalise to both domains. Aligning the domain statistics is achieved by adversarial training, with a domain discriminator per modality that predicts which domain a given example is sampled from. A Gradient Reversal layer (GRL) reverses and backpropagates the gradient to the features. Both alignment techniques are trained on

source and unlabelled target data whereas the action classifier is only trained with labelled source data.

We next detail MM-SADA, generalised to any two or more modalities. We start by revisiting the problem of domain adaptation and outlining multi-stream late fusion, then we describe our adaptation approach.

3.1 Unsupervised Domain Adaptation (UDA)

A domain is a distribution over the input population X and the corresponding label space Y. The aim of supervised learning, given labelled samples , is to find a representation, , over some learnt features, , that minimises the empirical risk, . The empirical risk is optimised over the labelled source domain, , where is a distribution of source domain samples. The goal of domain adaptation is to minimise the risk on a target domain, , where the distributions in the source and target domains are distinct, . In UDA, the label space is unknown, thus methods minimise both the source risk and the distribution discrepancy between the source and target domains [3].

3.2 Multi-modal Action Recognition

When the input is multi-modal, i.e where is the modality of the input, fusion of modalities can be employed. Most commonly, late fusion is implemented, where we sum prediction scores from modalities and backpropagate the error to all modalities, i.e.:

(1)

where is the modality’s task classifier, and is the modality’s learnt feature extractor. The consensus of modality classifiers is trained by a cross entropy loss, , between the task label, , and the prediction, . is defined as the softmax function. Training for classification expects the presence of labels and thus can only be applied to the labelled source input.

3.3 Within-Modal Adversarial Alignment

Both generative and discriminative adversarial approaches have been proposed for bridging the distribution discrepancy between source and target domains. Discriminative approaches are most appropriate with high-dimensional input data present in video. Generative adversarial requires a huge amount of training data and temporal dynamics are often difficult to reconstruct. Discriminative methods train a discriminator, , to predict the domain of an input (i.e. source or target), from the learnt features, . By maximising the discriminator loss, the network learns a feature representation that is invariant to both domains.

For aligning multi-modal video data, we propose using a domain discriminator per modality that penalises domain specific features from each modality’s stream. Aligning modalities separately avoids the easier solution of the network focusing only on the less robust modality in classifying the domain. Each separate domain discriminator, , is thus used to train the modality’s feature representation . Given a binary domain label, , indicating if an example or , the domain discriminator, for modality , is defined as,

(2)

3.4 Multi-Modal Self-Supervised Alignment

Prior approaches to domain adaptation have mostly focused on images and thus have not explored the multi-modal nature of the input data. Videos are multi-modal, where corresponding modalities are present in both source and target. We thus propose a multi-modal self-supervised task to align domains. Multi-modal self-supervision has been successfully exploited as a pretraining strategy [1, 2]. However, we show that self-supervision for both source and target domains can also align domains.

We learn the temporal correspondence between modalities as a self-supervised task. For positive examples, we sample modalities from the same action randomly at different temporal locations. For negative examples, each modality is sampled from a different action. The network is thus trained to determine if the modalities correspond. This is optimised over both domains. A self-supervised correspondence classifier head, , is used to predict if modalities correspond. This shares the same modality feature extractors, , as the action classifier. It is important that is as shallow as possible so that most of the self-supervised representation is learned in the feature extractors. Given a binary label defining if modalities correspond, , and concatenated features of the multiple modalities, we calculate the multi-modal self-supervision loss as follows:

(3)

3.5 Proposed MM-SADA

We define the Mutli-Modal Self-Supervised Adversarial Domain Adaptation (MM-SADA) approach as follows. The classification loss, , is jointly optimised with the adversarial and self-supervised alignment losses. The within-modal adversarial alignment is weighted by , and the multi-modal self-supervised alignment is weighted by . Optimising both alignment strategies achieves benefits in matching source and target statistics and learning cross-modal relationships transferable to the target domain.

(4)

Note that the first loss is only optimised for labelled source data, while the alignment losses and are optimised for both unlabelled source and target data.

4 Experiments and Results

This section first discusses the dataset, architecture, and implementation details in Sec. 4.1. We compare against baseline methods noted in Sec. 4.2. Results are presented in Sec. 4.3, followed by an ablation study of the method’s components in Sec. 4.4 and qualitative results including feature space visualisations in Sec. 4.5.

4.1 Implementation Details

Dataset. Our previous work, EPIC Kitchens [7], offers a unique opportunity to test domain adaptation for fine-grained action recognition thanks to its large number of domains and classes. Similar to previous works for action recognition [13, 18], we evaluate on pairs of domains. We select the three largest kitchens, in number of training action segments from [7], to form our domains. These are P01, P22, P08, which we refer to as D1, D2 and D3, respectively (Fig. 4). We analyse the performance for the 8 largest action classes: (‘put’, ‘take’, ‘open’, ‘close’, ‘wash’, ‘cut’, ‘mix’, and ‘pour’), which form 80% of the training action segments for these domains. This ensures sufficient examples per domain and class, without balancing the training set. The label imbalance of these 8 classes is depicted in Fig. 4 (middle) which also shows the differing distribution of classes between the domains. Most domain adaptation works evaluate on balanced datasets [42, 13, 15] with few using imbalanced datasets [56]. EPIC-Kitchens has a large class imbalance offering additional challenges for domain adaptation. The number of action segments in each domain are specified in Fig. 4 (bottom), where a segment is a labeled start/end time, with an action label.

Figure 4: Top: Three kitchens from EPIC-Kitchens selected as domains to evaluate our method Middle: Class distribution per domain, for the 8 classes in legend. Bottom: Number of action segments per domain.
D2 D1 D3 D1 D1 D2 D3 D2 D1 D3 D2 D3 Mean
MM Source-only 42.5 44.3 42.0 56.3 41.2 46.5 45.5
AdaBN [27] 44.6 47.8 47.0 54.7 40.3 48.8 47.2
MMD [31] 43.1 48.3 46.6 55.2 39.2 48.5 46.8
MCD [43] 42.1 47.9 46.5 52.7 43.5 51.0 47.3
MM-SADA 48.2 50.9 49.5 56.1 44.1 52.7 50.3
lightgray Supervised target 62.8 62.8 71.7 71.7 74.0 74.0 69.5
Table 1: Top-1 Accuracy on the target domain, for our proposed MM-SADA, compared to different alignment approaches. On average, we outperform the source-only performance by 4.8%.

Architecture. We train all our models end-to-end. We use the inflated 3D convolutional architecture (I3D) [5]

as our backbone for feature extraction, one per modality (

). In this work,

convolves over a temporal window of 16 frames. In training, a single temporal window is randomly sampled from within the action segment each epoch. In testing, as in 

[57], we use an average over 5 temporal windows, equidistant within the segment. We use the RGB and Optical Flow frames provided publicly [7]. The output of is the result of the final average pooling layer of I3D, with 1024 dimensions. is a single fully connected layer with softmax activation to predict class labels. Each domain discriminator

is composed of 2 fully connected layers with a hidden layer of 100 dimensions and a ReLU activation function. A dropout rate of 0.5 was used on the output of

and weight decay for all parameters. Batch normalisation layers are used in and are updated with target statistics for testing, as in AdaBN [27].

The self-supervised correspondence function (Eq. 3) is implemented as 2 fully connected layers of 100 dimensions and a ReLU activation function. The features from both modalities are concatenated along the channel dimension and used as input to .

Training and Hyper-parameter Choice. Training occurs in two stages. First the network is trained with only the classification and self supervision losses at a learning rate of for 3K steps. Then, the overall loss function (Eq. 4) is optimised, applying the domain adversarial losses , and reducing the learning rate to for a further 6K steps. The self-supervision hyper-parameter, was chosen by observing the performance on the labelled source domain only, i.e. this has not been optimised for the target domain. Note that while training with self-supervision, half the batch contains corresponding modalities and the other non-corresponding modalities. Only examples with corresponding modalities are used to train for action classification. The domain adversarial hyper-parameter, , was chosen arbitrarily; we show that the results are robust to some variations in this hyper-parameter in an ablation study.

Batch size was set to 128, split equally for source and target samples. All models were trained using the Adam optimiser [22], on an NVIDIA DGX-1 with 8 V100 GPUs. On average, training takes 9 hours. We report the top-1 target accuracy averaged over the last 9 epochs of training, for robustness. We also show that the performance is consistent over training epochs, through accuracy curves.

(a) Target D1
(b) Target D2
(c) Target D3
Figure 5: Accuracy on target during training epochs. Solid line is MM-SADA and dotted line is source-only performance.

4.2 Baselines

We first evaluate the impact of domain shift between source and target by testing using a multi-modal source-only model (MM source-only), trained with no access to unlabelled target data. Additionally, we compare to 3 baselines for unsupervised domain adaptation as follows:

  • [label=–,leftmargin=*]

  • AdaBN [27]: Batch Normalisation layers are updated with target domain statistics.

  • Maximum Mean Discrepancy (MMD): The multiple kernel implementation of the commonly used domain discrepancy measure MMD is used as a baseline [31]. This directly replaces the adversarial alignment with separate discrepancy measures applied to individual modalities.

  • Maximum Classifier Discrepancy (MCD) [43]: Alignment through classifier disagreement is used. We use two multi-modal classification heads as separate classifiers. The classifiers are trained to maximise prediction disagreement on the target domain, implemented as L1 loss, finding examples out of support from the source domain. We use a GRL to optimise the feature extractors.

Additionally, as an upper limit, we also report the supervised target domain results. This is a model trained on labelled target data and only offers an understanding of the upper limit of various approaches. We highlight these results in the tables to avoid confusion.

4.3 Results

D2 D1 D3 D1 D1 D2 D3 D2 D1 D3 D2 D3 Mean
Source-only 42.5 44.3 42.0 56.3 41.2 46.5 45.5
MM-SADA (Self-Supervised only) 41.8 49.7 47.7 57.4 40.3 50.6 47.9
MM-SADA (Adversarial only) 46.5 51.0 50.0 53.7 43.5 51.5 49.4
MM-SADA (Adversarial only) 46.9 50.2 50.2 53.6 44.7 50.8 49.4
MM-SADA 45.8 52.1 50.4 56.9 43.5 51.9 50.1
MM-SADA 48.2 50.9 49.5 56.1 44.2 52.7 50.3
Table 2: Ablation of our method, showing the contribution of the various loss functions (Eq 4). When , modality adversarial is not utilised. When , self-supervision is not utilised.
Self-Supervision D2 D1 D3 D1 D1 D2 D3 D2 D1 D3 D2 D3 Mean
Sync. 44.2 50.2 48.0 54.6 41.0 49.4 47.9
Seg. Corr. 41.8 49.7 47.7 57.4 40.3 50.6 47.9
Table 3: Comparision of different self-supervision to find modality correspondence, determining modality synchrony vs. determining whether modality samples come from the same segment. The two approaches perform comparably.

First we compare our proposed method MM-SADA to the various domain alignment techniques in Table 1. We show that our method outperforms batch-based [27] (by 3.1%), classifier discrepancy [43] (by 3%) and discrepancy minimisation alignment [31] (by 3.5%) methods. The improvement is consistent for all pairs of domains. Additionally, it significantly improves on the source-only baseline by up to 7.5% in 5 out of 6 cases. For a single pair, , all baselines under-perform compared to source-only. Ours has a slight drop (-0.2%) but outperforms other alignment approaches. We will revisit this case in the ablation study.

Figure 6: Robustness of the average top-1 accuracy over all pairs of domains for various on the target domain.

Figure 5 shows the top-1 accuracy on the target during training (solid lines) vs source-only training without domain adaptation (dotted lines). Training without adaptation has consistently lower accuracy, except for our failure case , showing the stability and robustness of our method during training, with minimal fluctuations due to stochastic optimisation on batches. This is essential for UDA as no target labels can be used for early stopping.

(a) RGB
(b) Flow
Figure 7: Qualitative results for MM-SADA and source-only, showing success and failure cases.
Figure 8: t-SNE plots of RGB and Flow feature spaces produced by source-only, self-supervised alignment and our proposed model MM-SADA. target is shown in red and source in blue. Our method better aligns both modalities.
Figure 7: Qualitative results for MM-SADA and source-only, showing success and failure cases.

4.4 Ablation Study

Next, we compare the individual contributions of different components of MM-SADA. We report these results in Table 3. The self-supervised component on its own gives a improvement over no adaption. This shows that self-supervision can learn features common to both source and target domains, adapting the domains. Importantly, this on average outperforms the three baselines in Table 1. Adversarial alignment per modality gives a further improvement as this encourages the source and target distributions to overlap, removing domain specific features from each modality. Compared to adversarial alignment only, our method improves in 5 of the 6 domains and by up to 3.2%.

For the single pair noted earlier, , self-supervision alone outperforms ‘source-only’ and all other methods reported in Table 1 by 1.1%. However when combined with domain adaptation using , the overall performance of MM-SADA reported in Table 1 cannot beat the baseline. In Table 3, we show that when halving the contribution of adversarial component to , MM-SADA can achieve 56.9% outperforming the source-only baseline. Therefore self-supervision can improve performance where marginal alignment domain adaptation techniques fail.

Figure 6 plots the performance of MM-SADA as changes. Note that can be chosen by observing the performance of self-supervision on source-domain labels, while requires access to target data. We show that our approach is robust to various values of , with even higher accuracy at than those reported in Table 3.

We also compare two approaches for multi-modal self-supervision (Table 3). The first, which has been used to report all results above, learns the correspondence of RGB and Flow within the same action segment. We refer to this as ‘Seg. Corr.’. The second learns time-synchronised RGB and Flow data, which we call ‘Sync’. The two approaches are comparable in performance overall, with no difference on average over the domain pairs. This shows the potential of the self-supervision to utilise a number of possible tasks for training.

4.5 Qualitative Results

Figure 8 shows qualitative results of our method relative to source-only performance, with three success cases and one failure case for two pairs of domains. Without adaptation, models cannot utilise appropriate visual cues in the target environment, i.e. appearance of chopping board and knife or sink and tap, therefore the model fails to predict cut and wash. Both adapted and non-adapted models struggle with ambiguous examples where different actions are occurring using both hands.

Figure 8 shows the t-SNE [55] visualisation of the RGB (left) and Flow (right) feature spaces . Several observations are worth noting from this figure. First, Flow shows higher overlap between source and target features pre-alignment (first row). This shows that Flow is more robust to environmental changes. Second, self-supervision alone (second row) changes the feature space by separating the features into clusters, that are potentially class-relevant. This is most evident for on the RGB modality (second row third column). However, alone this feature space still shows domain gaps, particularly for RGB features. Third, our proposed MM-SADA (third row) aligns the marginal distributions of source and target domains.

5 Conclusion and Future Work

We proposed a multi-modal domain adaptation approach for fine-grained action recognition utilising multi-modal self-supervision and adversarial training per modality. We show that the self-supervision task of predicting the correspondence of multiple modalities is an effective domain adaptation method. On its own, this can outperform domain alignment methods [31, 43], by jointly optimising for the self-supervised task over both domains. Together with adversarial training, the proposed approach outperforms non-adapated models by . We conclude that aligning individual modalities whilst learning a self-supervision task on source and target domains can improve the ability of action recognition models to transfer to unlabelled environments.

Future work will focus on utilising more modalities, such as audio, to aid domain adaptation as well as exploring additional self-supervised tasks for adaptation, trained individually as well as for multi-task self-supervision.

Acknowledgement Research supported by EPSRC LOCATE (EP/N033779/1) and EPSRC Doctoral Training Partnershipts (DTP). The authors acknowledge and value the use of the ESPRC funded Tier 2 facility, JADE.

References

  • [1] R. Arandjelovic and A. Zisserman (2017) Look, listen and learn. In

    International Conference on Computer Vision (CVPR)

    ,
    Cited by: §1, §2, §3.4.
  • [2] R. Arandjelovic and A. Zisserman (2018) Objects that sound. In European Conference on Computer Vision (ECCV), Cited by: §3.4.
  • [3] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira (2006) Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NIPS), Cited by: §3.1.
  • [4] L. Cao, Z. Liu, and T. S. Huang (2010) Cross-dataset action detection. In

    Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.
  • [5] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, §2, §4.1.
  • [6] M. Chen, Z. Kira, G. AlRegib, J. Yoo, R. Chen, and J. Zheng (2019-10) Temporal attentive alignment for large-scale video domain adaptation. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.
  • [7] D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, and W. Price (2018) Scaling egocentric vision: the epic-kitchens dataset. In European Conference on Computer Vision (ECCV), Cited by: Multi-Modal Domain Adaptation for Fine-Grained Action Recognition, Figure 2, 3rd item, §1, §1, §2, §4.1, §4.1.
  • [8] D. Damen, W. Price, E. Kazakos, G. M. Farinella, and A. Furnari (2019) EPIC-kitchens - 2019 challenges report. Online Report. Cited by: §2.
  • [9] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [10] N. Faraji Davar, T. de Campos, D. Windridge, J. Kittler, and W. Christmas (2011) Domain adaptation in the context of sport video action recognition. In Domain Adaptation Workshop, in conjunction with NIPS, Cited by: §2.
  • [11] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019-10) SlowFast networks for video recognition. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [12] B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017)

    Self-supervised video representation learning with odd-one-out networks

    .
    In International Conference on Computer Vision (CVPR), Cited by: §2.
  • [13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks.

    The Journal of Machine Learning Research

    17 (1), pp. 2096–2030.
    Cited by: §1, §2, §2, §4.1.
  • [14] M. Ghifary, W. B. Kleijn, and M. Zhang (2014) Domain adaptive neural networks for object recognition. In

    Pacific Rim International Conference on Artificial Intelligence

    ,
    Cited by: §1, §2.
  • [15] B. Gong, Y. Shi, F. Sha, and K. Grauman (2012) Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  • [16] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, and M. Mueller-Freitag (2017) The” something something” video database for learning and evaluating visual common sense.. In International Conference on Computer Vision (ICCV), Cited by: Figure 2, §1.
  • [17] H. Huang, Q. Huang, and P. Krahenbuhl (2018-09) Domain transfer through deep activation matching. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [18] A. Jamal, V. P. Namboodiri, D. Deodhare, and K. Venkatesh (2018) Deep domain adaptation in action space. In British Machine Vision Conference (BMVC), Cited by: §1, §2, §4.1.
  • [19] S. Ji, W. Xu, M. Yang, and K. Yu (2013)

    3D convolutional neural networks for human action recognition

    .
    Pattern Analysis and Machine Intelligence (PAMI) 35 (1), pp. 221–231. Cited by: §2.
  • [20] B. Jiang, M. Wang, W. Gan, W. Wu, and J. Yan (2019-10) STM: spatiotemporal and motion encoding for action recognition. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [22] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [23] Y. Kong, Z. Ding, J. Li, and Y. Fu (2017) Deeply learned view-invariant features for cross-view action recognition. IEEE Transactions on Image Processing 26 (6). Cited by: §2.
  • [24] B. Korbar, D. Tran, and L. Torresani (2018) Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems (Neurips), Cited by: §2.
  • [25] H. Kuehne, A. Arslan, and T. Serre (2014) The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities. In Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 2.
  • [26] J. Li, Y. Wong, Q. Zhao, and M. Kankanhalli (2018) Unsupervised learning of view-invariant action representations. In Advances in Neural Information Processing Systems (Neurips), Cited by: §2.
  • [27] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu (2018)

    Adaptive batch normalization for practical domain adaptation

    .
    Pattern Recognition 80, pp. 109–117. Cited by: 3rd item, 1st item, §4.1, §4.3, Table 1.
  • [28] Y. Li, M. Liu, and J. M. Rehg (2018-09) In the eye of beholder: joint learning of gaze and actions in first person video. In European Conference on Computer Vision (ECCV), Cited by: Figure 2.
  • [29] J. Lin, C. Gan, and S. Han (2019-10) TSM: temporal shift module for efficient video understanding. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [30] M. Liu, H. Liu, and C. Chen (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition 68, pp. 346–362. Cited by: §2.
  • [31] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, Cited by: 3rd item, §1, §2, 2nd item, §4.3, Table 1, §5.
  • [32] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017)

    Deep transfer learning with joint adaptation networks

    .
    In International Conference on Machine Learning, Cited by: §1, §2.
  • [33] J. C. Niebles, C. Chen, and L. Fei-Fei (2010) Modeling temporal structure of decomposable motion segments for activity classification. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [34] A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [35] H. Pirsiavash and D. Ramanan (2012) Detecting activities of daily living in first-person camera views. In Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 2.
  • [36] H. Pirsiavash and D. Ramanan (2012) Detecting activities of daily living in first-person camera views. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [37] F. Qi, X. Yang, and C. Xu (2018) A unified framework for multimodal domain adaptation. In ACM Multimedia Conference on Multimedia Conference, Cited by: §2.
  • [38] H. Rahmani and A. Mian (2015) Learning a non-linear knowledge transfer model for cross-view action recognition. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [39] K. K. Reddy and M. Shah (2013) Recognizing 50 human action categories of web videos. Machine Vision and Applications. Cited by: §2.
  • [40] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele (2012) A Database for Fine Grained Activity Detection of Cooking Activities. In Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 2.
  • [41] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele (2015) Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision 119 (3), pp. 346–373. Cited by: §1, §1.
  • [42] K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In The European Conference on Computer Vision (ECCV), Cited by: §4.1.
  • [43] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, 3rd item, §4.3, Table 1, §5.
  • [44] G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari (2018) Actor and observer: joint modeling of first and third-person videos. In Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 2, §1, §2.
  • [45] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In European Conference on Computer Vision (ECCV), Cited by: Figure 2, §1, §1.
  • [46] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.
  • [47] K. Sohn, S. Liu, G. Zhong, X. Yu, M. Yang, and M. Chandraker (2017-10)

    Unsupervised domain adaptation for face recognition in unlabeled videos

    .
    In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [48] S. Stein and S. J. McKenna (2013) Combining embedded accelerometers with computer vision for recognizing food preparation activities. In ACM International Joint Conference on Pervasive and Ubiquitous Computing, Cited by: §1.
  • [49] S. Stein and S. McKenna (2013) Combining Embedded Accelerometers with Computer Vision for Recognizing Food Preparation Activities. In ACM International Joint Conference on Pervasive and Ubiquitous Computing, Cited by: Figure 2.
  • [50] B. Sun and K. Saenko (2016) Deep coral: correlation alignment for deep domain adaptation. In European Conference on Computer Vision (ECCV), Cited by: §1, §2.
  • [51] Y. Sun, E. Tzeng, T. Darrell, and A. A. Efros (2019) Unsupervised domain adaptation through self-supervision. arXiv preprint arXiv:1909.11825. Cited by: §2.
  • [52] A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias.. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [53] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [54] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [55] L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: §4.5.
  • [56] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  • [57] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV), Cited by: §2, §4.1.
  • [58] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman (2018) Learning and using the arrow of time. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [59] Y. Zhang, P. David, and B. Gong (2017-10) Curriculum domain adaptation for semantic segmentation of urban scenes. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [60] J. Zhao and C. Snoek (2019) Dance with flow: two-in-one stream action detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [61] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang (2018) Camera style adaptation for person re-identification. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [62] B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In European Conference on Computer Vision (ECCV), Cited by: §2, §2.
  • [63] F. Zhu and L. Shao (2013) Enhancing action recognition by cross-domain dictionary learning.. In British Machine Vision Conference (BMVC), Cited by: §2.
  • [64] Y. Zou, Z. Yu, B.V.K. Vijaya Kumar, and J. Wang (2018-09) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In European Conference on Computer Vision (ECCV), Cited by: §2.