Adversarial Pyramid Network for Video Domain Generalization

by   Zhiyu Yao, et al.
Tsinghua University

This paper introduces a new research problem of video domain generalization (video DG) where most state-of-the-art action recognition networks degenerate due to the lack of exposure to the target domains of divergent distributions. While recent advances in video understanding focus on capturing the temporal relations of the long-term video context, we observe that the global temporal features are less generalizable in the video DG settings. The reason is that videos from other unseen domains may have unexpected absence, misalignment, or scale transformation of the temporal relations, which is known as the temporal domain shift. Therefore, the video DG is even more challenging than the image DG, which is also under-explored, because of the entanglement of the spatial and temporal domain shifts. This finding has led us to view the key to video DG as how to effectively learn the local-relation features of different time scales that are more generalizable, and how to exploit them along with the global-relation features to maintain the discriminability. This paper presents the Adversarial Pyramid Network (APN), which captures the local-relation, global-relation, and multilayer cross-relation features progressively. This pyramid network not only improves the feature transferability from the view of representation learning, but also enhances the diversity and quality of the new data points that can bridge different domains when it is integrated with an improved version of the image DG adversarial data augmentation method. We construct four video DG benchmarks: UCF-HMDB, Something-Something, PKU-MMD, and NTU, in which the source and target domains are divided according to different datasets, different consequences of actions, or different camera views. The APN consistently outperforms previous action recognition models over all benchmarks.


page 4

page 8


Adversarial Bipartite Graph Learning for Video Domain Adaptation

Domain adaptation techniques, which focus on adapting models between dis...

DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding

Many of the leading approaches for video understanding are data-hungry a...

End-to-end Video-level Representation Learning for Action Recognition

From the frame/clip-level feature learning to the video-level representa...

Aligning Correlation Information for Domain Adaptation in Action Recognition

Domain adaptation (DA) approaches address domain shift and enable networ...

Visualizing the Passage of Time with Video Temporal Pyramids

What can we learn about a scene by watching it for months or years? A vi...

Video Self-Stitching Graph Network for Temporal Action Localization

Temporal action localization (TAL) in videos is a challenging task, espe...

Beyond Gaussian Pyramid: Multi-skip Feature Stacking for Action Recognition

Most state-of-the-art action feature extractors involve differential ope...

1 Introduction

Figure 1: We view the key to solving video DG as how to exploit local and global temporal cues to overcome the space-time domain shift. While the global temporal relations appear to be less generalizable, the local temporal relations can increase feature transferability, but may also lead to incorrect generalizations without the help of global temporal relations. In this example, we may recognize a video of playing basketball from the unseen target domain via a shared sub-action dribbling (bottom right). However, through the local temporal feature of running, which is less discriminative from a classification perspective, we may misrecognize a video of playing football (bottom left) as playing basketball. This work balances the transferability against discriminability between the local and global temporal features.

Improving the transferability of video recognition models has not been fully investigated before, but it is significant in real-world applications where datasets consist of a limited number of population sources. Previous deep networks achieve competitive results in the intra-domain setting [32, 35, 2, 33, 39], assuming that training and test videos are independently and identically distributed (i.i.d.). We find that the performance of these models degenerates since the i.i.d. assumption might be violated as the domain shift is on. In this paper, we name this inter-domain experiment setting as the video domain generalization (video DG) problem, where models are trained on one source domain and evaluated on unseen target domains with the same label set. Unlike the extensively studied video domain adaptation problem [3, 11], in video DG, not only labels but also data of the target domains are unavailable during training.

The essential question of video DG is how to mitigate the domain shift when the target distributions are totally unknown. The overall domain shift can be viewed as an entanglement of the spatial and temporal domain shifts. The spatial domain shift is caused by the variations of the appearance in video frames, which can be partly solved by previous image DG methods [31, 1, 5]. This paper focuses more on the entangled spatial and temporal domain shifts, which are more complex, since we need to consider the misalignment of temporal features across domains further.

We claim that the key to this problem is to effectively capture the local temporal relations between frames and leverage them as the generalization bridges. Our motivation is that the sub-actions of videos under the same category might not be identical in different domains, but there must be an intersection of several local temporal features. For example, in Figure 1, videos of playing basketball from different domains may share multiple sub-actions such as layup and dribbling (strong local-relation features). But we notice that different categories may also share intersections of local temporal cues, e.g. playing basketball and playing football may share the same sub-action running (weak local-relation features). Although we don’t know what the strong local-relation features are, we know that they are necessarily highly correlated with the discriminative representations of the video, so we can cover the unknown target distributions by augmenting the training dataset via these shared local temporal cues. Otherwise, if we augment the dataset through the less discriminative weak local-relation features, it may lead to false generalization results.

In this paper, we present the Adversarial Pyramid Network (APN) to solve video DG, which is empirically established on the observation that the local-relation features improve the transferability while the global-relation features ensure the discriminability. The APN trades off the impact of these features in the processes of representation learning and the cross-domain data augmentation. At different pyramid levels, the APN captures local-relation features at different time scales and then aligns them to the global-relation features, thereby weakening the impact of less discriminative local-relation features that might lead to false generalization results. As a result, the APN learns multilayer cross-relation features with different levels of transferability and discriminability. Further, the multilayer cross-relation features can be used to expand the original Adversarial Data Augmentation (ADA) method [31] specifically to video data, so as to solve the spatial and temporal domain shifts together. We observe that only if the multilayer cross-relation features are combined, can the ADA method achieve a notable improvement on video DG.

To sum up, this paper has the following contributions:

  • A new problem: We introduce a new research problem of video DG and show that most previous video recognition networks degenerate in such settings as the spatiotemporal domain shift is on.

  • A deep network that learns more generalizable video representation: We provide a pilot study of video DG and propose the new APN model that learns a pyramid of temporally cross-relation features that transfer well to other unseen domains. It is established on our key findings that the local-relation features are more generalizable while the global-relation features are more discriminative.

  • An extended ADA method for video data: Tightly depending on the pyramid of cross-relation features, we adapt the image ADA method [31] to space-time, along with a different minimax training procedure. We show that only with the relational feature pyramid, can the ADA be effectively used to improve video DG.

  • New video DG benchmarks: We design four video DG benchmarks, based on five existing video datasets that are widely used in the standard, early and multi-view action recognition. These benchmarks cover different video DG scenarios with (1) only spatial domain shift; (2) entangled spatial and temporal domain shifts. Our approach achieves the best results consistently.

2 Related Work

This work is closely related to the deep learning methods in video action recognition and image DG, and is also related to the recent advances in video domain adaptation.

Video Action Recognition.

Many deep models for video action recognition are based on 2D CNNs [23, 13, 6, 32, 39, 25, 12, 18] and 3D CNNs [26, 27, 2, 20, 36, 33, 28]. Among all these models, our APN model is most related to the Temporal Relation Network (TRN) [39], which learns local temporal features from different lengths and different combinations of short video snippets and then ensembles them together to get a global sequence-level feature. However, we find that the performance of TRN deteriorates in the video DG settings. We also find that TRN cannot further benefit from the modern DG method [31].

Our work is distinct from TRN in two perspectives. First, our approach is fully motivated from the view of domain generalization. Thus we consider the cross-relation features in addition to the local relational features to trade off the discriminability and transferability. Second, we introduce a new approach that effectively adapts the previous ADA method to video DG. Different from TRN, our approach leverages a pyramid architecture based on the Transformer self-attention mechanism [30] to progressively fuse temporal relations at different time scales. From this perspective, our work is also related to recognition models that incorporate attention blocks [8, 33, 38, 34]. Our model organizes the attention blocks in a pyramid framework, which is driven by a clear motivation and validated by extensive experiments

Image Domain Generalization.

Previous approaches in DG are mostly designed for image data [16, 17, 22, 31, 1, 5], which can be divided into two groups: feature-based methods and data-based methods. Feature-based methods focus on extracting invariant cross-domain representations. Li et al. [16] introduced a prior distribution on the feature representation via adversarial learning and a Maximum Mean Discrepancy (MMD) regularizer. Li et al. [17] proposed a meta-learning approach to train a domain-invariant feature extractor. Data-based methods connect the source domain distributions and the unseen target distributions by expanding the training dataset. Shankar et al. [22] augmented the source domain with domain-guided perturbations of the input instances. Volpi et al. [31] proposed the ADA method, which augments the source domain with adversarial examples. Unlike all the above models, our approach is an early work for video DG, which extends the basic ADA method by particularly considering how to mitigate the temporal domain shift that is entangled with the spatial domain shift.

Image and Video Domain Adaptation.

Different from our DG problem, in domain adaptation (DA), the target distributions (without labels) are still available during training. While most previous work mainly focuses on solving the image DA problem [37, 7, 29, 19], some recent models were proposed to solve the video DA problem [11, 3]. They are related to our work due to the domain shift exists in both video DA and DG problems. However, these models close the representation distances across domains given the target distributions, and thus is not applicable to the DG settings. In contrast, the video DG problem is more challenging and has not been well explored.

3 Preliminary: Adv. Data Augmentation

In this section, we will show how the Adversarial Data Augmentation (ADA) method [31] solves the image DG problem. The ADA method works in an iterative minimax training procedure: it generates adversarial examples that aim to fool the current discriminative network, and then appends these examples to the original data minibatch for image recognition. The ADA focuses on the following worst-case problem around the source distribution :


where is the set of weights of the entire model. indicates a source data point with its label. is the categorical cross-entropy loss.

The ADA method denotes by the distance metric around the source distribution that characterizes the set of unknown populations we wish to generalize to. The perturbed new data distribution should be diverse enough but not deviate far from , i.e. . is defined by the Wasserstein distance on the semantic space. Consider the transportation cost from to :


which is denoted by if . By taking as the output of the last hidden layer, the distance of two data points from the original space is defined as


Thus, the worst-case formulation over data distributions can be defined as a surrogate loss [31]:


where is a hyper-parameter of the transport cost penalty and is a data point from the source distribution .

The training procedure of ADA has two separate stages: a data augmentation stage and a minimization stage with respect to . The data augmentation stage has two alternated training phases: a maximization phase with respect to Eq. (4) and an online minimization phase of on the augmented dataset. In the -th maximization phase, the new data point is generated by maximizing the perturbation over the source data with a factor of :


The original image ADA method is not suitable for video DG in two ways. First, it defines the transportation cost at the activation of the last hidden layer according to Eq. (3), which could be less generalizable on video data. In contrast, we define at different network levels to trade off between the transferability of the local temporal features and the discriminability of the global temporal features; thereby leading to more diverse new data points. Second, the training procedure of image ADA no longer fits the video DG problem. See Section 4.3 for a more detailed discussion.

4 Adversarial Pyramid Network

Figure 2: Our proposed APN model. It captures the local-relation, global-relation, and multilayer cross-relation features progressively. It then uses the cross-relation features, which strike a compromise between transferability and discriminability, to generate new spatiotemporal adversarial examples. Our model improves video DG from both representation learning and data augmentation.

Our Adversarial Pyramid Network (APN) consists of three end-to-end trainable components (Figure 2): a CNN-based frame encoder, an attention-based relational feature pyramid, and a new video ADA training procedure. In this section, we will first introduce the building blocks of these network components, and then describe the details of the relational feature pyramid. At last, we will present the extended ADA method for video data based on the relational feature pyramid, along with its minimax training procedure.

4.1 Building Blocks

Frame Encoder.

Given a video sequence, we divide it uniformly into segments, and then randomly sample one frame from each segment. We use a CNN to extract a feature from each sampled frame. Here, is the activation of the last hidden layer of the ResNet-50 model [10] (other network backbones can also be applicable) followed by a -dimensional fully-connected layer and a Dropout layer with a dropout rate. We then concatenate at the time dimension and obtain .

Multi-Scale Temporal Relations.

The idea of capturing multi-scale temporal relations is initially inspired by TRN [39], which has been proven to be effective for video understanding. But different from TRN, we innovatively propose an attention-based feature pyramid to progressively calculate frame relations within and across frame features of different time scales, which will be shown essential for generating effective spatiotemporal adversarial examples.

Attention Block.

Our pyramid architecture consists of multiple Transformer attention blocks [30]. Each block has three multi-head attention layers and two fully-connected layers followed by the layer normalization [15] (more network details in the supplementary materials). Typically, an attention block takes three inputs: the queries, the keys, and the values. Here, as we share common inputs for keys and values, it is denoted by .

4.2 Relational Feature Pyramid

By stacking the attention blocks, the APN learns the relational feature pyramid. It has three pyramid levels. At the first level, we learn the within-relation features, including the global temporal cues to summarize the overall patterns across the entire video, and the local temporal cues to bridge different video domains. At the second level, we learn cross-relation features by aligning the more domain-generalizable local relations to the more category-specific global relations, as we want to balance the feature transferability against the discriminability. At the last level, we fuse the cross-relation features and the global features to make the final predictions. We then use the level-II and level-III features to generate spatiotemporal adversarial examples.

Pyramid Level I: Within-Relation Attention.

The first pyramid level is applied to the output of the frame encoder to extract the within-relation features, which can be divided into the global one and the local ones. By taking as inputs , the concatenated frame features over , we have the global feature that reflects the overall temporal cues of the video:

We then learn local relational features which are assumed to represent the common knowledge that can mitigate the temporal domain shift. Concretely, for each , we first randomly sample consecutive items from the feature set , and concatenate them at the time dimension to obtain . Next, we use the attention block to obtain the local relational feature from an -frame video snippet:

As a result, the first pyramid level provides a set of local relational features in addition to the global relational feature .

Pyramid Level II: Cross-Relation Attention.

Recalling the showcase of basketball (vs. football) in Figure 1, we may realize that local temporal cues can be more transferable but may also lead to false generalization. Thus, before generating the spatiotemporal adversarial examples to augment the training dataset, we need to constrain the learned local relations to more category-specific latent distributions. To this end, we use the second pyramid level to align the local features to the global feature , and conduct the cross-relation attention such that


where we use as the queries and the as the keys and the values. has the dimensions of and has the dimensions of . In other words, the cross-relation features have the same size as in the spatiotemporal latent space. Then we ensemble all these cross-relation features of different time scales together into the overall level-II cross-relation feature:


which not only contains a variety of transferable knowledge but also keeps the category-specific knowledge. We thus use as the first component to generate new data points.

Pyramid Level III: Cross-Relation Sum.

In the last level, we aggregate the cross-relation feature and the global relational feature for the final classification and further data augmentation. We explored several aggregation approaches such as attention, concatenation, and so forth. Through experiments, we find that the element-wise sum operation is the most effective:

While the cross-relation attention block at the previous pyramid level focuses on finding more discriminative local-relation features by aligning them with the global-relation feature, the cross-relation sum operation here strengthens the influence of the global-relation feature. Thus, these two types of features are complementary to each other, leading the subsequent video ADA method to generate more diverse and representative adversarial examples.

1:source video dataset
2:learned APN weights
4:repeat for each training minibatch until convergence
5:     Sample uniformly from video dataset
6:      from extract pyramid features from the source data
7:      minimize the classification loss of the source data
9:     for  do for each maximization phase
10:          from
11:          generate adversarial examples at pyramid level-II
12:          from
13:          generate adversarial examples at pyramid level-III
14:     end for
15:      from minimize the classification loss of level-II adv. examples
16:      from minimize the classification loss of level-III adv. examples
17:until Convergence
Algorithm 1 Training Procedure of Adversarial Pyramid Network (APN)

4.3 Video ADA with Relational Feature Pyramid

Video ADA has its own challenge in spatiotemporal modeling compared with the original image ADA method. Besides using the within-relation and cross-relation feature pyramid to learn more generalizable video representations, we also use the multilayer cross-relation features to generate new perturbed data points that help bridge distributions across domains. First, we use the cross-relation feature of pyramid level-II to generate adversarial examples by maximizing the following loss with respect to the input data:


where is generated from the source data point ,

is the classifier, which can be a multilayer perceptrons (MLP) or a couple of convolutional layers. Here, we simply use a fully-connected layer. As the original ADA method,

is the categorical cross-entropy loss, and is the transportation cost measured by the mean squared loss. We obtain the level-II adversarial examples by Eq. (5).

As we have mentioned, reflects more local temporal relations, being computed from cross-relation attention. We suppose that the local temporal relations are more generalizable but may lead to too diverse data augmentation results. As a compensate, we use

, which reflects more global temporal relations by a residual connection from

, to generate the level-III adversarial examples

. Similarly, we maximize the following loss function:


Notably, unlike in the original image ADA method, these spatiotemporal adversarial examples have a time dimension of , which is equal to the number of video segments.

Training Procedure.

We show the minimax training procedure of our video ADA method that is based on the APN model in Algorithm 1. It has two major differences from the image ADA method. Generally, they have different frameworks. The image ADA method has two separate training stages: it first generates new data in minimax training phases to augment the dataset times, and subsequently learn the model by minimizing the classification loss over the augmented dataset (see [31]). Such a two-stage training strategy is not suitable for video data. Deep networks for video recognition commonly use frame sampling approaches to obtain different combinations of frames over the whole video, thus, enabling the ensemble learning and effectively avoid over-fitting. However, with the growth of the maximization phases, the original ADA method is more and more likely to sample the augmented examples to add the perturbation to, whose anchor frames are fixed during the entire training period. To solve this problem, we combine the two separate training stages of image ADA, generating adversarial examples and optimizing the classification loss over the augmented data on-the-fly (line 13-14 in Alg. 1).

The second characteristic of our video ADA method is that we use both the relational feature pyramid rather than only the features of the last hidden layer to generate new data points (line 7-12 in Alg. 1). Empirically, we find that the level-II adversarial examples reflect more local relations and thus more diverse to cover other population sources. On the other hand, the level-III adversarial examples are generated from features that are more close to the global relations due to the element-wise sum operation, and thus more akin to the category-specific distributions. That is why we use the pyramid of cross-relation features for video ADA.

5 Experiments

We evaluate the APN on four video DG benchmarks. The first two benchmarks represent entangled spatial and temporal domain shifts. In UCF-HMDB, the source and target domains are divided according to different datasets that are collected from diverse sources. In Something-Something, domains are divided according to different consequences of the actions, such as “Doing something” in the source and “Pretending to do something” in the target. The other two benchmarks, PKU-MMD and NTU, represent a remarkable spatial domain shift since the source and target domains are naturally divided according to different camera views. In this section, we will show the overall implementation details and then analyze the experiment results on each benchmark.

5.1 Implementation Details

We divide the source domain videos into a training set () and a validation set (), following the training and validation protocol from image DG [7]. For the network hyper-parameters, the number of video segments is set to for UCF-HMDB and Something-Something, and for PKU-MMD and NTU. The dimension of the frame feature is set to . As for the hyper-parameters of the data augmentation process, we set the adversarial perturbation factor to , the number of maximization phases to , and train models with to make the final predictions at test time using the ensemble strategy of the original ADA method [31]. We use the SGD optimizer with a base learning rate and decay it by every epochs. We use the random cropping and horizontal flipping for the input frames at training time, and use the center cropping at test time. Our models converge in less than hours on GPUs for epochs with a minibatch size of video clips.

5.2 Compared Models

We compare our models with well-known or state-of-the-art video action recognition models: TSN [32], TRN [39], TSM [18], I3D [2], and Non-local [33]. For a fair comparison, TSN, TSM, TRN, and our models all use the ResNet-50 [10]

backbone that is pretrained on ImageNet. I3D and Non-local use the 3D ResNet-50 backbone with

input frames and are pretrained on the Kinetics dataset. We also include baseline models by combining the above networks with the original image ADA method [31], whose details can be found in the supplementary materials.

5.3 Ucf-Hmdb

In this benchmark, we select overlapping categories shared by UCF101 [24] and HMDB51 [14], consisting of videos. We take one dataset as the training/validation set and the other as the test set. A detailed list of the selected categories can be found in the supplementary material.


Table 1 compares the video DG results of all compared models on UCF-HMDB. Our final APN (the one trained with space-time adversarial examples) achieves the best performance , exceeding the second place (Non-local without ADA) by 3.92%.

TSN 51.68 / 51.61 68.80 / 68.50 60.24 / 60.06
TRN 52.83 / 53.28 70.13 / 70.36 61.48 / 61.82
TSM 52.61 / 52.06 69.39 / 68.64 61.00 / 60.35
I3D 52.50 / 52.11 68.94 / 66.80 60.72 / 59.46
Non-local 54.38 / 53.83 71.58 / 70.83 62.98 / 62.33
APN 54.21 / 57.90 71.43 / 73.57 62.82 / 65.74
Table 1: Video DG results in classification accuracy (%) on UCF-HMDB. The two results in each column are respectively obtained with models trained with/without ADA. We use to indicate positive improvement by ADA and for the negative improvement.
Model U H H U Model U H H U Model U H H U
TRN 52.83 70.13 TRN-ATTN 53.33 70.93 APN 54.21 71.43
TRN + ADA 53.28 70.36 TRN-ATTN + ADA 53.72 71.77 APN + ADA 55.61 72.47
TRN + ADA* 52.56 69.68 TRN-ATTN + ADA* 52.17 70.28 APN + ADA* 57.90 73.57
Table 2: Comparing APN with TRN. U means UCF and H means HMDB. TRN-ATTN is the model that applies the attention blocks to the and features of TRN (details of these features can be found in [39]). ADA indicates generating adversarial examples using the activation of the last hidden layer. ADA* is generating examples using multilayer features. APN + ADA* is our final proposed model.

Are global temporal features generalizable?

Although the TSN and TSM models have been proven competitive for the standard video classification, they underperform in video DG (see Table 1). Further, we implement a baseline model over APN by using the global-relation feature instead of the cross-relation features or for generating adversarial examples, which obtains a 53.37% accuracy from UCF to HMDB, and 71.83% in turn. We notice that it only has a slight improvement over the vanilla APN without ADA, indicating that the global temporal features are not generalizable enough and cannot work well with adversarial video data augmentation.

Are cross-relation features generalizable?

Even without ADA training, the APN model alone still outperforms other compared models but Non-local by 0.16%. Non-local works better than I3D due to the self-attention block, which adaptively learns local and global temporal relations across a video clip of frames. This result partly verifies our findings that the cross-relation features are more generalizable. However, Non-local cannot achieve further improvement from ADA training. Only APN and TRN enable effective video ADA, increasing by 2.92% and , respectively. We may conclude that while video DG benefits from the cross-relation features from the representation learning perspective, it benefits more from the data perspective by using multilayer cross-relation features for video ADA.

Comparing with TRN.

TRN is our most important baseline model since it also captures local temporal relations at multiple time scales, which tend to be generalizable. To investigate the necessity of our cross-relation operations, we train a new baseline model named TRN-ATTN by applying the within-relation attention block to the and features of TRN (please look them up in [39]). As shown in Table 2, our APN models consistently and remarkably outperform the TRN and TRN-ATTN models, validating the rationality of the relational feature pyramid in both representation learning and the generation of adversarial examples.

Why using relational feature pyramid for video ADA?

We also observe from Table 2 that both the TRN and TRN-ATTN models degenerate when multilayer features are used for ADA. On the contrary, the multilayer cross-relation features of APN enable more effective ADA training and thus further improve video DG. We argue that it is because the level-II feature represents more local relations through cross-relation attention, thus increasing the diversity of the new data points; while the level-III feature reflects global relations more directly through the cross-relation sum operation, thus resulting in more representative spatiotemporal adversarial examples. As a comparison, by using or separately for generating the spatiotemporal adversarial examples, we obtain from UCF to HMDB, and in turn. From these results, we can see that and are complementary to each other and both necessary for video ADA.

5.4 Something-Something

We construct this video DG benchmark by selecting basic categories from the Something-Something dataset [9]. Under each basic category such as “Tearing”, there are two sub-categories with different consequences of the actions such as “Tearing something” and “Pretending to be tearing something that is not tearable”. We put different sub-categories to different domains. The domain shift in this context entangles the appearance and motion variations. The full category list can be found in the supplementary material. As a result, the source domain has videos and the target domain has around videos.

Model Source Target Target Source
TSN 35.47 / 35.39 22.96 / 22.56
TRN 37.15 / 37.24 23.82 / 24.10
TSM 36.74 / 36.38 23.22 / 23.10
I3D 31.20 / 30.54 22.98 / 21.73
Non-local 33.60 / 33.21 23.68 / 23.16
APN 38.24 / 40.60 24.87 / 27.37
Table 3: Video DG results () on Something-Something. In each column, the two values are obtained with/without ADA.


Due to the extremely complicated variations of the object appearance and motion cues across domains, none of the compared models shows a great generalization ability on this challenging benchmark (see Table 3). But still, the APN outperforms other models significantly, exceeding the second place (TRN with ADA) by 3.32% on average. Note that both I3D and Non-local underperform on this benchmark than on the UCF-HMDB. The reason is that the Kinetics pretrained models are less effective for this dataset than for sports videos. In Figure 3, we show a showcase of the classification result on the target set.

Figure 3: A showcase of the video DG results on Something-Something. The first row shows training data from the source domain. The second row shows the test data from the target domains. The green bars indicate making correct predictions and the orange bars incorrect ones. The length of the bar denotes the confidence.

5.5 PKU-MMD and NTU

The PKU-MMD dataset [4] and NTU dataset [21] are both large-scale benchmarks for multi-view human action understanding, where we can easily exploit the spatial domain shift across different camera views to build the video DG task. The PKU-MMD contains about trimmed short video clips of action categories such as “Drinking water” and “Sitting down”, which are recorded from three camera views (Left, Center, Right). The NTU dataset contains around videos of action categories such as “Kicking something” and “Standing up”. It has subjects in viewpoints, which can be grouped into domains according to the camera angle. From Table 4 and Table 5, the APN consistently achieves the best results, on average exceeding the second place (TRN with ADA) by 4.52% and 3.69% respectively on these benchmarks. Figure 4 gives a showcase of the classification result on the NTU target set.

Model L C, R C L, R R L, C
TSN 38.65 / 37.37 31.52 / 30.61 39.94 / 39.36
TRN 40.10 / 40.58 33.65 / 34.70 42.08 / 41.85
TSM 39.29 / 38.74 32.41 / 31.49 40.52 / 40.23
I3D 37.91 / 36.82 34.21 / 34.08 38.55 / 38.31
Non-local 39.46 / 38.67 35.23 / 34.95 39.22 / 38.13
APN 42.73 / 44.85 35.29 / 39.72 43.30 / 46.11
Table 4: Video DG results () on PKU-MMD. Here, L (left), C (center), and R (right) indicate different viewpoint domains. In each column, the two values are obtained with/without ADA.
Model L C, R C L, R R L, C
TSN 37.48 / 35.39 59.63 / 57.25 44.36 / 42.43
TRN 40.89 / 41.64 60.18 / 60.74 48.34 / 48.95
TSM 37.67 / 37.28 59.69 / 58.91 46.17 / 45.79
I3D 41.76 / 41.67 51.93 / 50.50 42.29 / 42.10
Non-local 42.54 / 41.87 53.67 / 53.51 43.78 / 43.61
APN 41.47 / 44.33 60.99 / 65.11 49.36 / 52.96
Table 5: Video DG results () on NTU. As above, the two values in each column are obtained with/without ADA.
Figure 4: Two showcases of the video DG results on NTU. The first row shows training data from the source domain. The second row shows the test data from the target domains. The green bars indicate making correct predictions and the orange bars incorrect ones. The length of the bar denotes the classification confidence.

6 Conclusion

In this paper, we introduced a new problem of video DG, where models are trained on one source domain and evaluated on different unseen domains. We found that most action recognition networks underperform in such settings with the entangled spatial and temporal domain shifts. To solve this problem, we proposed the Adversarial Pyramid Network, which progressively learns generalizable and discriminative video representations at different pyramid levels. We then used the feature pyramid to generate adversarial examples in space-time, and thus derived a new adversarial video data augmentation method. We constructed four video DG benchmarks with different kinds of spatial and temporal domain shifts. Our approach was shown to consistently achieve the best results over these benchmarks.


  • [1] F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi (2019) Domain generalization by solving jigsaw puzzles. In CVPR, Cited by: §1, §2.
  • [2] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §1, §2, §5.2.
  • [3] M. Chen, Z. Kira, and G. AlRegib (2019) Temporal attentive alignment for video domain adaptation. In ICCV, Cited by: §1, §2.
  • [4] L. Chunhui, H. Yueyu, L. Yanghao, S. Sijie, and L. Jiaying (2017) PKU-mmd: a large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475. Cited by: §5.5.
  • [5] Q. Dou, D. C. Castro, K. Kamnitsas, and B. Glocker (2019) Domain generalization via model-agnostic learning of semantic features. In NeurIPS, Cited by: §1, §2.
  • [6] C. Feichtenhofer, A. Pinz, and R. Wildes (2016) Spatiotemporal residual networks for video action recognition. In NeurIPS, Cited by: §2.
  • [7] M. Ghifary, W. B. Kleijn, M. Zhang, and D. Balduzzi (2015)

    Domain generalization for object recognition with multi-task autoencoders

    In ICCV, Cited by: §2, §5.1.
  • [8] R. Girdhar and D. Ramanan. (2017) Attentional pooling for action recognition. In NeurIPS, Cited by: §2.
  • [9] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The “something something” video database for learning and evaluating visual common sense.. In ICCV, Cited by: §5.4.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.1, §5.2.
  • [11] A. Jamal, V. P. Namboodiri, D. Deodhare, and K. S. Venkatesh (2018) Deep domain adaptation in action space. In BMVC, Cited by: §1, §2.
  • [12] B. Jiang, M. Wang, W. Gan, W. Wu, and J. Yan (2019) STM: spatiotemporal and motion encoding for action recognition. In ICCV, Cited by: §2.
  • [13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014)

    Large-scale video classification with convolutional neural networks

    In CVPR, Cited by: §2.
  • [14] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In ICCV, Cited by: §5.3.
  • [15] J. Lei Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.1.
  • [16] H. Li, S. Jialin Pan, S. Wang, and A. C. Kot (2018) Domain generalization with adversarial feature learning. In CVPR, Cited by: §2.
  • [17] Y. Li, Y. Yang, W. Zhou, and T. M. Hospedales (2019) Feature-critic networks for heterogeneous domain generalization. In ICML, Cited by: §2.
  • [18] J. Lin, C. Gan, and S. Han (2019) TSM: temporal shift module for efficient video understanding. In ICCV, Cited by: §2, §5.2.
  • [19] M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In NeurIPS, Cited by: §2.
  • [20] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, Cited by: §2.
  • [21] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016) NTU rgb+d: a large scale dataset for 3d human activity analysis. In CVPR, Cited by: §5.5.
  • [22] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi (2018) Generalizing across domains via cross-gradient training. In ICLR, Cited by: §2.
  • [23] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, Cited by: §2.
  • [24] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402. External Links: Link, 1212.0402 Cited by: §5.3.
  • [25] S. Sun, Z. Kuang, W. Ouyang, L. Sheng, and W. Zhang (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In CVPR, Cited by: §2.
  • [26] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2014) C3D: generic features for video analysis. In BMVC, Cited by: §2.
  • [27] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, Cited by: §2.
  • [28] D. Tran, H. Wang, L. Torresani, and M. Feiszli (2019) Video classification with channel-separated convolutional networks. In ICCV, Cited by: §2.
  • [29] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §2.
  • [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §2, §4.1.
  • [31] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese (2018) Generalizing to unseen domains via adversarial data augmentation. In NeurIPS, Cited by: Adversarial Pyramid Network for Video Domain Generalization, 3rd item, §1, §1, §2, §2, §3, §3, §4.3, §5.1, §5.2.
  • [32] L. Wang, Y. Xiong, and et al. (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, Cited by: §1, §2, §5.2.
  • [33] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §1, §2, §2, §5.2.
  • [34] Y. Wang, L. Jiang, M. Yang, L. Li, M. Long, and L. Fei-Fei (2019) Eidetic 3d LSTM: a model for video prediction and beyond. In ICLR, Cited by: §2.
  • [35] Y. Wang, M. Long, J. Wang, and P. S. Yu (2017) Spatiotemporal pyramid network for video action recognition.. In CVPR, Cited by: §1.
  • [36] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning for video understanding. In ECCV, Cited by: §2.
  • [37] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In NeurIPS, Cited by: §2.
  • [38] L. Zhang, G. Zhu, L. Mei, P. Shen, S. A. A. Shah, and M. Bennamoun (2018) Attention in convolutional lstm for gesture recognition. In NeurIPS, Cited by: §2.
  • [39] B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In ECCV, Cited by: §1, §2, §4.1, §5.2, §5.3, Table 2.