With the advent of multimedia streaming (Luo et al., 2019; Chen et al., 2020; Wang et al., 2020, 2019) and gaming data, automatically recognizing and understanding human actions and events in videos have become increasingly important, especially for practical tasks such as video retrieval (Kwak et al., 2018), surveillance (Ouyang et al., 2018), and recommendation (Wei et al., 2019b, a)
. Over the past decades, great efforts have been made to boost the recognition performance with deep learning for different purposes including appearances and short-term motions learning(Simonyan and Zisserman, 2014; Tran et al., 2015), temporal structure modeling (Wang et al., 2016), and human skeleton and pose embedding (Rahmani and Bennamoun, 2017; Liu et al., 2017; Yan et al., 2019). While effective, deep learning enables machine recognition at a great cost of labeling large-scale data.
To relieve the burden of tedious and expensive labeling, one alternative is to transfer knowledge from the existing annotated training data (i.e. source domain) to the unlabeled or partially labeled test data (i.e. target domain). However, the source and target sets are commonly constructed under varying conditions such as illuminations, camera poses and backgrounds, leading to a huge domain shift. For instance, the Gameplay-Kinetics (Chen et al., 2019) dataset is built under the challenging “Synthetic-to-Real” protocol, where the training videos are synthesized by game engines and the test samples are collected from real scenes. In this case, the domain discrepancy between the source and target domains inevitably leads to a severe degradation of the model generalization performance.
To combat the above dilemmas, domain adaptation (DA) approaches have been investigated to mitigate the domain gap by aligning the distributions across the domains (Gretton et al., 2012; Yan et al., 2017; Baktashmotlagh et al., 2014) or learning domain-invariant representations (Ganin et al., 2016; Zhao et al., 2019). While the notion of domain adaptation has been widely exploited in the past, the resulting techniques are mostly designed to cope with still images rather than the videos. These image-level DA methods could hardly achieve a good performance on the video recognition tasks as they don’t take into account the temporal dependency of the frames when minimizing the discrepancy between the domains.
Lately, video domain adaptation techniques (Jamal et al., 2018; Chen et al., 2019; Pan et al., 2019) have emerged to address the domain shift in videos using adversarial learning. By segmenting the source and target videos into a set of fixed-length action clips, DAAA (Jamal et al., 2018) directly matches the segment representations from different domains with the 3D-CNN (Tran et al., 2015) feature extractor. TAN (Chen et al., 2019) weights the source and target segments with a proposed temporal attention mechanism, forcing the model to attend the temporal features of low domain discrepancy. Different from the prior work that mainly concentrates on intra-domain interactions, TcoN (Pan et al., 2019) proposes a cross-domain co-attention module to measure the affinity of the segment-pairs from source and target domains and further highlight the key segments shared by both domains.
Nevertheless, existing adversarial video domain adaptation methodologies are limited in three aspects. First, when data distributions embody complex structures like videos, there is no guarantee for the two distributions to become sufficiently similar when the discriminator is fully confused, as illustrated in Figure 1. Second, existing algorithms perform asymmetrically at the training and test stages. For instance, TcoN takes as input the source and target pairs and calculates the cross-domain attention scores at training stage, but inferences are done only based on the target data at the test time. This discrepancy unavoidably causes the exposure bias and deteriorates the model performance. Third, utilizing a general domain classifier for adversarial learning is only able to match marginal distributions (Gong et al., 2013), and so does not align the class-conditional distributions (Long et al., 2018; Zhao et al., 2019). The video recognition models trained in this manner are hereby less likely to achieve the class-wise alignment.
To address the above-mentioned issues, in this paper, we take a more feasible strategy, i.e., to construct a domain-agnostic video classifier instead of pursuing with domain-invariant feature learning. In the proposed Adversarial Bipartite Graph (ABG) framework as illustrated in Figure 2, the video classifier is explicitly exposed to the mixed cross-domain representations, which preserves the temporal correlations across the domains modeled with a network topology of the bipartite graph. In particular, the source and target frames are sampled as heterogeneous vertexes of the bipartite graph, and the edges connecting the two types of nodes measure their similarity. Through message-passing, each vertex aggregates the features of its heterogeneous neighbors, making those from the similar source and target frames to be evenly mixed in the shared subspace. The proposed strategy performs symmetrically during the training and test phases, which successfully addresses the exposure bias issue.
Moreover, as the proposed framework is agnostic to the choices of frame aggregation, four different aggregation mechanisms are investigated, followed by a conditional adversarial module to preserve the class-specific consistency across the domains. The source labels and the target predictions are embedded as vectors which provide semantic cues for the domain classifier. To cope with large domain discrepancy, we additionally apply a video-level bipartite graph on the original model, called Hierarchical ABG. To testify the robustness of the proposed model, we further extend it to a semi-supervised domain adaptation setting (Semi-ABG), by adding the partial edge supervision. Extensive experiments conducted on four benchmark datasets evidence the superiority of the proposed adversarial bipartite framework over the state-of-the-art approaches. Overall, our contributions can be briefly summarized as follows:
We introduce a new Adversarial Bipartite Graph (ABG) framework for unsupervised video domain adaptation, which focuses on recognizing domain-agnostic concepts rather than learning domain-invariant representations. It is further generalized to its hierarchical variant for challenging transfer tasks.
To address the exposure bias issue, the proposed model is trained and tested symmetrically.
The proposed ABG framework is seamlessly equipped with a conditional domain adversarial module which globally aligns the class-conditional distributions from different domains.
We have demonstrated the effectiveness of the proposed strategy through extensive experiments on four large-scale video domain adaptation datasets and released the source code for reference.
2. Related Work
2.1. Video Action Recognition
Activity recognition has been one of the core topics in computer vision areas, with a wide range of real-world applications including video surveillance(Ouyang et al., 2018), environment monitoring and video captioning (Krishna et al., 2017; Duan et al., 2018; Wang et al., 2018; Yang et al., 2017)
. A typical pipeline is leveraging a two-stream convolutional neural network to classify actions based on the individual video frames or local motion vectors(Karpathy et al., 2014; Simonyan and Zisserman, 2014)
. To better capture the action dynamics and gesture changes, later work models the long-term temporal information with recurrent neural networks(Donahue et al., 2017), 3D convolutions (Tran et al., 2015), and multi-scale temporal relation networks (TRN) (Zhou et al., 2018). Another line of work augments the extracted RGB and optical flow features with multi-modal pose representations (Yan et al., 2019), complex object interactions (Ma et al., 2018), and 3D human skeleton (Rahmani and Bennamoun, 2017; Liu et al., 2017), which relieve the view dependency and the noises from different lighting conditions. However, the all above-mentioned work requires expensive annotations and could barely generalize to an unseen circumstance, which greatly hinders the feasibility in practice.
2.2. Domain Adaptation
Unsupervised Domain Adaptation (UDA) tackles such a limitation by trying to transfer knowledge from a labeled source domain to an unlabeled target domain. The discrepancy between the two domains refers to the domain shift (Luo et al., 2020b, a), which is addressed by minimizing a distribution distance such as Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) with its variants (Yan et al., 2017) and/or learning domain-invariant representations with adversarial learning (Ganin et al., 2016) recently. Alternatively, an emerging line of work incorporates graph neural networks (GNN) (Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018) to bridge the domain gap at a manifold level, learning the intra-domain correlations in a transductive way. Very recently, GCAN (Ma et al., 2019) constructs a densely-connected instance graph for the source and target nodes, and assigns pseudo labels for the target samples for aligning class centroids from different domains. While effective, existing graph-based work fails to model the inter-domain interactions, which makes it far from optimal.
2.3. Video Domain Adaptation
Despite of the fact that domain adaptation has made great progress in a broader set of image recognition tasks, it is barely investigated for transferring knowledge across the videos. Early efforts (Tang et al., 2016; Xu et al., 2016) on video domain adaptation utilize a shallow model, that employs collective matrix factorization or PCA to learn a common latent semantic space for the source and target domains. Of late, the focus has shifted to the deep models (Jamal et al., 2018; Chen et al., 2019; Pan et al., 2019). Jamal et al. projected the pre-extracted C3D (Tran et al., 2015) representations of the source and target videos to a Grassmann manifold, and performed domain adaptation with adaptive kernels and adversarial learning. To extend the idea of modeling actions on the latent subspace, an end-to-end Deep Adversarial Action Adaptation (DAAA) (Jamal et al., 2018) is derived to learn the source and target video clips in the same temporal order. Zhang et al. transferred from the trimmed video domain to the untrimmed video domain with Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) for action localization, yet without tasking into consideration any frame-level feature alignment. Chen et al. proposed a temporal attentive adversarial adaptation network (TAN), which leverages the entropy of the domain predictor to attend the local temporal features of low domain discrepancy. Pan et al. (Pan et al., 2019) designed a co-attention module to minimize the domain discrepancy, which concentrates on the key segments shared by the both domains. Nevertheless, prior work is vulnerable and unreliable due to the overfitting and exposure bias issues. Instead, the proposed adversarial bipartite graph model is capable of learning domain-agnostic concepts and aligning class-conditional distributions locally and globally.
In this section, we first formulate the task of unsupervised video domain adaptation, and then elaborate the details of the derived Adversarial Bipartite Graph (ABG) framework and its hierarchical variant (HABG). To testify the robustness of the proposed model, it is further generalized to a semi-supervised setting (Semi-ABG) in Section 3.3.
3.1. Problem Formulation
Give a labeled source video collection and an unlabeled target video set containing and videos respectively, the aim is to design a transfer network for predicting the labels of the unlabeled target videos. The source and target domains are of different distributions yet they share the same label space , where is the number of classes. Each source video or target video consists of frames, i.e., and , where indicate the features of the -th frame, and is the feature dimension of the vectors. For constructing each mini-batch, we forward source features and target features to update the proposed model.
3.2. Adversarial Bipartite Graph Learning (ABG)
To model the data affinity across two domains, it is natural to formulate the problem with a bipartite graph, whose vertices can be divided into two disjoint and independent sets, with the edges connecting the vertices from different sets. As discussed in Section 3.2, the general pipeline of ABG consists of (1) mixing the similar source and target features with the frame bipartite graph; (2) aggregating frame features into the global video representations; (3) aligning class-conditional distributions with adversarial learning; and (4) classifying the obtained source and target representations. To enhance the model capacity for difficult transfer tasks, we design a hierarchical structure HABG, incorporated in the video bipartite graph, as detailed in Section 3.2.
Frame-level Bipartite Graph. Let the frame-level directed bipartite graph be , where the cross-domain edge feature map represents the node affinity between the pairs of source and target frames. The source vertex set and target vertex set , with the vertex feature dimension, are expected to dynamically aggregate information across the domains based on the learned edge features, thus closing the domain gap gradually. The propagation rules for the cross-domain edge update and node update are elaborated as follows.
Frame Edge Update. To calculate the similarity between the source and target frames, the normalized edge matrix is defined as,
the sigmoid function, andthe norm. and
are the augmented tensors of the source and target vertexes, with the dimensions being expanded by repeating.is the frame-level metric network parameterized by , which computes the similarity scores between the source and target frames. To ease the impact of the number of cross-domain neighbors, the row normalization and column normalization are adopted on the edge feature map .
Frame Node Update. The generic rule to update node features can be formulated as follows,
where is the concatenation operation, and is a node update network for both source and target nodes. The node embedding is initialized with the extracted representation from the backbone embedding model, i.e., , .
Frame Aggregation. To group the sampled frames into a unified video representation and capture appearance and temporal dynamics, the frame aggregation is applied on the learned source and target node embeddings. As the proposed framework is agnostic to the choices of frame aggregation, we examine multiple aggregation functions, including a symmetric average pooling function which is invariant to the order of frames, two memory based modules to capture the temporal information among frames, and a temporal relation network to explore the multi-scale temporal dynamics.
Mean Average Pooling. By viewing the video as a collection of key frames, the video representation can be obtained by averaging the frame features temporally. Hence, each source video representation and target video representation are computed as,
Memory Based Aggregators. Considering the temporal characteristics in human actions and events, two memory based aggregators, i.e., LSTM and GRU, are tested to construct the -th source representation , and -th target representation as:
with the output hidden states, and the last step.
Temporal Relation Network (TRN). Inspired by (Zhou et al., 2018; Chen et al., 2019), we further build up on a fine-grained relationship among the multi-scale video segments. In particular, the temporal relation network (TRN) (Zhou et al., 2018) is able to preserve the short-term (e.g., 2-frame relation), and long-term (e.g., 5-frame relation) action dynamics, which potentially expands the temporal information that the learned video features could convey. The multi-scale temporal relations for the source data and the target data are defined as the composite functions below,
with the indicates the 2-frame local relation function. Note that the multi-scale function is the sum of the local relation scores from 2-frame to -frame. The and are fully connected layers, fusing the features of different ordered frames.
Video-level Bipartite Graph. For difficult transfer tasks, we additionally apply a video-level bipartite graph on top of the frame aggregation network, fusing the source and target data hierarchically. It allows video features to be grouped into tighter clusters which improves classification performance. Similarly, we construct the video-level directed bipartite graph , where the indicates the node affinity among the source and target videos. The source node set and target node set , with the feature dimension of video nodes, are learned through message passing as defined below.
Video Edge Update. To calculate the similarity between the source and target frames, the normalized edge matrix is defined as,
with the sigmoid function, the norm. and are the augmented tensors of the source and target vertices, with the dimensions being expanded by repeating. is the video-level metric network parameterized by , computing the correlations among the source-target video pairs.
Video Node Update. The generic rule to update the node features can be formulated as follows,
where is the concatenation operation and is a node update network for the both source and target nodes. The node embeddings are initialized with the aggregated features i.e., , .
Video Classification. To predict the labels for the source and target samples, we construct a video classier based on the aggregated video features for the ABG structure and the video vertex features for the HABG, respectively. Since the source data is labeled, the classifier is trained to minimize the negative log likelihood loss for each mini-batch,
Instead of the supervised loss, for the unlabeled target data, a soft entropy based loss is adopted to alleviate the uncertainty of the predictions:
Conditional Adversarial Learning. Besides leveraging bipartite graph neural networks to fuse the source and target neighbors, a conditional adversarial module is applied to align the class-conditional distributions. To achieve this, the module is composed of a label embedding function and a domain classifier . The label embedding function projects the -th source video label and the -th target frame predictions into the latent vectors and , providing the domain-invariant semantic cues for the domain classifier. The domain classifier is then conditioned on the classes for which the samples may belong to, and trained to discriminate between the features coming from the source or target data. The bipartite graphs are viewed as the feature generator to fool the discriminator. The adversarial objective function for the conditional adversarial module is formulated as:
Consequently, the learned features will be more discriminative and aligned when the two-player mini-max game reaches an equilibrium.
3.3. Semi-supervised ABG and HABG
To verify the robustness of the proposed ABG and HABG structure, we further extend them to a semi-supervised setting. In this circumstance, part of the target labels in a mini-batch are available for training. Here, we denote and as the indices of the labeled and unlabeled target data, respectively. To fully take advantage of the partial target supervision, the classification objective functions are modified accordingly,
with . Moreover, the edge maps learned from either the frame-level or video-level bipartite graphs are able to be partially supervised. The newly added edge supervision is a binary cross entropy loss, which can be formulated as,
where is the Kronecker delta function that is equal to one when , and zero otherwise. indicates the element from the -th row and -th column of the frame-level edge map, and represents the element from the -th row and -th column of the video-level edge map.
Our ultimate goal is to learn the optimal parameters for the proposed model,
with being the learnable parameters of the frame aggregation module, and , , and being the loss coefficients respectively. Notably, for the UDA setting, we have, , and . The overall algorithm is provided in supplementary material.
|Video Length||21 Seconds||33 Seconds||39 Seconds||10 Seconds|
|Training Videos||UCF: 482 / HMDB: 350||UCF:1,438 / HMDB: 840||UCF: 601 / Olympic: 250||Kinetics: 43,378 / Gameplay: 2,625|
|Validation Videos||UCF: 189 / HMDB: 571||UCF: 360 / HMDB: 350||UCF: 240 / Olympic: 54||Kinetics: 3,246 / Gameplay: 749|
We compare and contrast our proposed approach with the existing domain adaptation approaches on four benchmark datasets, i.e., the UCF-HMDB, UCF-HMDB, UCF-Olympic and Kinetics-Gameplay
. For fair comparison, we follow the dataset partition and feature extraction strategies from(Chen et al., 2019)
, that utilizes the ResNet101 model pre-trained on ImageNet as the frame-level feature extractor. The statistics of the four datasets are summarized in Table1. The UCF-HMDB and UCF-HMDB are the overlapped subsets of two large-scale action recognition datasets, i.e., the UCF101 (Soomro et al., 2012) and HMDB51 (Kuehne et al., 2011), covering 5 and 12 highly relevant categories respectively. The UCF-Olympic selects the shared 6 classes from the UCF101 and Olympic Sports Datasets (Niebles et al., 2010), including Basketball, Clearn and Jerk, Diving, Pole Vault, Tennis and Discus Throw. The Kinetics-Gameplay is the most challenging cross-domain dataset, with a large domain gap between the synthetic videos and real-world videos. The dataset is build by selecting 30 shared categories between Gameplay (Chen et al., 2019) and one of the largest public video datasets Kinetics-600 (Kay et al., 2017). Each category may also correspond to multiple categories in both dataset, which poses another challenge of class imbalance.
We compare our approach with several state-of-the-art video domain adaptation methods, image domain adaptation approaches, single-domain action recognition models, and a basic ResNet-101 classification model pre-trained on the ImageNet dataset. Single-domain action recognition models include the 3D ConvNets (C3D) (Tran et al., 2015) and Temporal Segment Networks (TSN) (Wang et al., 2016), which are pre-trained on the source domain and tested on the target domain. Four classical image-level domain adaptation methods, i.e., Domain-Adversarial Neural Network (DANN) (Ganin et al., 2016), Joint Adaptation Network (JAN) (Long et al., 2017)
, Adaptive Batch Normalization (AdaBN) (Li et al., 2018) and Maximum Classifier Discrepancy (MCD) (Saito et al., 2018) are adjusted to align the distributions of video features with the frame aggregation module. As for non-deep video domain adaptation, we compare the proposed HABG method with Many-to-One (Xu et al., 2016) Encoder, two variants of Action Modeling on Latent Subspace (AMLS) (Jamal et al., 2018), i.e., the Subspace Alignment (AMLS-SA) and Geodesic Flow Kernel (AMLS-GFK). For deep video domain adaptation methods, we adopt the Deep Adversarial Action Adaptation (DAAA) (Jamal et al., 2018), Temporal Adversarial Adaptation Network (TAN) (Chen et al., 2019), Temporal Attentive Adversarial Adaptation Network (TAN) (Chen et al., 2019) and Temporal Co-attention Network (TCoN) (Pan et al., 2019) for comparison.
4.3. Implementation Details
Our source code is based on PyTorch(Paszke et al., 2019), which is available in a Githu repository111https://github.com/Luoyadan/MM2020_ABG for reference. All experiments are conducted on two servers with two GeForce GTX 2080 Ti GPUs.
4.3.1. Video Pre-processing
Following the standard protocol used in (Chen et al., 2019), we sample a fixed-number of frames with an equal spacing from each video for training, and encode each frame with the Resnet-101 (He et al., 2016) pre-trained on ImageNet into a 2048-D vector, i.e., . For fair comparison, we set to 5 in our experiments.
4.3.2. Module Architecture.
The edge update networks and project the affinity map from and dimensions to 1-dim scores, which are composed of the two convolutional layers, batch normalization, LeakyReLU and edge dropout. The node update networks and map the -dim and -dim concatenation of node features and the neighbors’ features to the -dim and -dim vectors, respectively. and include two convolutional layers, batch normalization, LeakyReLU and node dropout.
4.3.3. Parameter Settings.
The hidden size, the feature dimension of frame nodes (i.e. ) and the feature dimension of video nodes (i.e.
) are fixed to 512. The total number of training epochsis 60 for Kinetics-Gameplay dataset, and 30 for the rest of datasets. The batch size , for the source data and target data are set to 128. The stochastic gradient optimizer (SGD) is used as the optimizer with a momentum of 0.9 and weight decay of . The learning rate is initiated as then decayed as the number of epoch increases, which follows the rule used in (Ganin et al., 2016; Chen et al., 2019). The loss coefficients and are empirically fixed at 0.1 and 1 for semi-supervised experiments. The dropout rate is set to 0.2.
4.4. Comparisons with State-of-The-Art
Under the unsupervised domain adaptation protocol, we compare the proposed ABG method with multiple baseline approaches on UCF-HMDB, UCF-Olympic and UCF-HMDB datasets. With different backbone networks, the comparison results achieved from the relatively small datasets are reported in Table 2. Table 3 presents the results on the full UCF-HMDB dataset using various frame aggregation strategies. It is observed that the proposed ABG framework is superior to all the compared image- and video-level domain adaptation methods in most cases, especially achieving a significant performance boost on the large-scale testbed. Notably, Source Only indicates the backbone model pretrained on the source domain and tested on the target domain. Target Only denotes the backbone model trained and tested on the target domain. From Table 2, it is demonstrated that the deep video DA methods (line 6-10) generally outperforms the non-deep video DA approaches (line 3-5) and the classification models without DA (line 1-2). Among the deep TCoN and TANand over DAAA on the UCF-Olympic dataset. With the same backbone of TAN, our ABG model performs comparably without relying on the frame attention or complex frame aggregation strategies. This phenomenon is also observed in the large UCF-HMDB dataset, in which the proposed ABG with the average pooling boosts the performance by up to and over the state-of-the-art TAN on the UCFHMDB and HMDBUCF tasks, respectively. As shown in Table 3, average pooling (AvgPool) and LSTM suits the proposed model better among other frame aggregation functions. We infer the reasons behind is the frame bipartite graph has already fused similar frames regardless the order, which weakens the power of multi-scaled TRN aggregation. Notably, it is observed that AdaBN surpasses the most of image domain adaptation methods. As it separates the batch normalization layer for source and target data, AdaBN minimizes the risk of being overfitting to the source domain, which provides a strong support to our statement discussed in Section 1. To further investigate the detailed performance of the proposed ABG with respect to specific classes, four confusion matrices are provided in Figure 3.
|Many-to-One (Xu et al., 2016)||Action Bank||82.00||82.00||87.00||75.00|
|AMLS-SA (Jamal et al., 2018)||C3D||90.25||94.40||83.92||86.07|
|AMLS-GFK (Jamal et al., 2018)||C3D||89.53||95.36||84.65||86.44|
|DAAA (Jamal et al., 2018)||TSN||-||88.36||88.37||86.25|
|DAAA (Jamal et al., 2018)||C3D||-||-||91.60||89.96|
|TCoN (Pan et al., 2019)||TSN||-||93.01||93.91||91.65|
|TAN (Chen et al., 2019)||ResNet-101||99.33||99.47||98.15||92.92|
|DANN (Ganin et al., 2016)||71.11||70.00||70.83||75.28||75.13||75.83||75.13||76.36|
|JAN (Long et al., 2017)||71.39||70.56||72.50||74.72||77.58||77.58||77.75||79.36|
|AdaBN (Li et al., 2018)||75.56||74.17||74.72||72.22||76.36||77.41||74.96||77.41|
|MCD (Saito et al., 2018)||71.67||70.00||74.44||73.89||76.18||68.30||78.81||79.34|
|TAN (Chen et al., 2019)||71.11||70.00||70.83||77.22||76.36||70.75||76.89||80.56|
|TAN (Chen et al., 2019)||71.94||70.00||69.72||78.33||76.36||70.75||77.23||81.79|
|DANN (Ganin et al., 2016)||49.67||56.74||60.48||65.02||33.24||40.05||40.44||43.38|
|JAN (Long et al., 2017)||47.40||53.94||60.35||61.28||32.69||22.95||41.31||32.75|
|AdaBN (Li et al., 2018)||55.54||59.95||64.62||67.56||44.82||47.67||47.73||48.00|
|MCD (Saito et al., 2018)||47.53||52.60||57.68||59.95||36.29||40.08||41.13||42.05|
|TAN (Chen et al., 2019)||57.14||60.08||64.09||65.02||42.39||44.82||44.39||45.66|
|TAN (Chen et al., 2019)||56.61||62.35||63.02||63.95||43.25||44.27||44.02||42.05|
4.5. Semi-supervised Learning
To study the robustness of the proposed algorithm, we extend the unsupervised domain adaptation to a semi-supervised setting, where a part of target labels are available for training. Extensive experiments are conducted on the most challenging “Synthetic-to-Real” testbed, i.e., the Kinetics-Gameplay dataset, on which the results with varying ratios of target labels are reported in Table 4. The seen ratio of target labels ranges from to . Similarly, Source Only / Target only represents the backbone model trained with source / target data only. All image-level domain adaptation methods (line 2-5), basic classification model (line 1,10) and our models (line 8-9) utilize AvgPool as the frame aggregation function. TAN and TAN use the TRN aggregator, since TRN is the major part of their works. It can bee seen that the overall performance is lower on the GameplayKinetics compared to KineticsGameplay, due to the insufficient samples in the source domain. The proposed Semi-ABG and its hierarchical variant Semi-HABG achieve higher recognition accuracy on both two transfer tasks, since the integrated graphs help to propagate the label information and take a full advantage of the supervision. The respective recognition accuracies of the proposed HABG on two transfer takss are improved by up to and over the state-of-the-art TAN with only of target labels available.
4.6. Ablation Study
To investigate the validity of the derived modules and objective functions, we compare the four variants of ABG model on the full UCF-HMDB dataset. The comparison results are summarized in Table 5. Removing the bipartite graphs, the ABG w/o Graph suffers a drop dramatically compared with the full model. The ABG w/o is the variant without the conditional adversarial learning, decreasing the recognition accuracy by and on average for the UCFHMDB and HMDBUCF tasks, respectively. The ABG w/o refers to the variant without the entropy loss for target data, which triggers a slight decrease on the model performance. HABG is the hierarchical variant of the plain ABG model, performing better on the challenging UCFHMDB transfer task, which is consistent with the findings in semi-supervised experiments. The ABG model is more versatile and suitable for small datasets and easier transfer tasks such as HMDBUCF.
|ABG w/o Graph||71.39||73.89||74.96||74.61|
4.7. Parameter Sensitivity
To study the effect of the loss coefficients, we conduct the experiments on the UCF-HMDB dataset with the varying values of and . The and are utilized to reconcile the adversarial loss and the entropy loss, respectively. We compare the proposed ABG model and its variant HABG integrated with the identical AvgPool frame aggregation function. As plotted in Figure 5, the average accuracies for both ABG and HABG models become quite stable when reaching sufficiently large loss coefficients. This indicates that our framework is robust with respect to loss coefficients.
To shed a qualitative light on evaluating the proposed model with various aggregation functions, we conduct the experiments on the HMDBUCF task, and visualize the features with t-SNE in Figure 4. The features are extracted from the last layer of ABG and the baseline models, including DANN, JAN, MCD and TAN. Different colors indicate different classes. Circles represent the source videos and triangles represent the target videos. It is clearly shown that the features from ABG achieve the tighter clusters compared to the baselines.
In this work, we propose a bipartite graph learning framework for unsupervised and semi-supervised video domain adaptation tasks. Different the existing approaches which learn domain-invariant features, we construct a domain-agnostic classifier by leveraging the bipartite graphs to combine the similar source and target features at the training and test time, which helps with reducing the exposure bias. Experiments evidence effectiveness of our proposed approach over the state-of-the-art methods, improving their performance by up to 39.6 in a semi-supervised setting.
Acknowledgements.This work was partially supported by ARC DP 190102353.
Domain adaptation on the statistical manifold.
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2481–2488. Cited by: §1.
- Temporal attentive alignment for large-scale video domain adaptation. CoRR abs/1907.12743. External Links: Cited by: §1, §1, §2.3, §3.2, §4.1, §4.2, §4.3.1, §4.3.3, Table 2, Table 3, Table 4.
- CANZSL: cycle-consistent adversarial networks for zero-shot learning from natural language. In IEEE Winter Conference on Applications of Computer Vision, WACV, pp. 863–872. Cited by: §1.
- Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 677–691. Cited by: §2.1.
- Weakly supervised dense event captioning in videos. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 3063–3073. Cited by: §2.1.
Domain-adversarial training of neural networks.
Journal of Machine Learning Research17, pp. 59:1–59:35. Cited by: §1, §2.2, §4.2, §4.3.3, Table 3, Table 4.
- Connecting the dots with landmarks: discriminatively learning domain-invariant features for unsupervised domain adaptation. In Proc. Int. Conference on Machine Learning (ICML), pp. 222–230. Cited by: §1.
- A kernel two-sample test. Journal of Machine Learning Research 13, pp. 723–773. Cited by: §1, §2.2, §2.3.
- Inductive representation learning on large graphs. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 1024–1034. Cited by: §2.2.
- Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.3.1.
- Deep domain adaptation in action space. In Proc. British Machine Vision Conference (BMVC), pp. 264. Cited by: §1, §2.3, §4.2, Table 2.
- Large-scale video classification with convolutional neural networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732. Cited by: §2.1.
- The kinetics human action video dataset. CoRR abs/1705.06950. External Links: Cited by: §4.1.
- Semi-supervised classification with graph convolutional networks. In Proc. Int. Conference on Learning Representations (ICLR), Cited by: §2.2.
- Dense-captioning events in videos. In Proc. Int. Conference on Computer Vision (ICCV), pp. 706–715. Cited by: §2.1.
- HMDB: A large video database for human motion recognition. In Proc. Int. Conference on Computer Vision (ICCV), pp. 2556–2563. Cited by: §4.1.
- Interactive story maker: tagged video retrieval system for video re-creation service. In Proc. ACM International Conference on Multimedia (MM), pp. 1270–1271. Cited by: §1.
- Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80, pp. 109–117. Cited by: §4.2, Table 3, Table 4.
- Global context-aware attention LSTM networks for 3d action recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3671–3680. Cited by: §1, §2.1.
- Conditional adversarial domain adaptation. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 1647–1657. Cited by: §1.
- Deep transfer learning with joint adaptation networks. In Proc. Int. Conference on Machine Learning (ICML), pp. 2208–2217. Cited by: §4.2, Table 3, Table 4.
Learning from the past: continual meta-learning with bayesian graph neural networks.
Proc. AAAI Conference on Artificial Intelligence, pp. 5021–5028. Cited by: §2.2.
Curiosity-driven reinforcement learning for diverse visual paragraph generation. In Proc. ACM International Conference on Multimedia (MM), pp. 2341–2350. Cited by: §1.
- Progressive graph learning for open-set domain adaptation. In Proc. International Conference on Machine Learning, ICML, Cited by: §2.2.
- Attend and interact: higher-order object interactions for video understanding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6790–6800. Cited by: §2.1.
- GCAN: graph convolutional adversarial network for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8266–8276. Cited by: §2.2.
- Modeling temporal structure of decomposable motion segments for activity classification. In Proc. European Conference on Computer Vision (ECCV), pp. 392–405. Cited by: §4.1.
- Video-based person re-identification via self-paced learning and deep reinforcement learning framework. In Proc. ACM International Conference on Multimedia (MM), pp. 1562–1570. Cited by: §1, §2.1.
- Adversarial cross-domain action recognition with co-attention. CoRR abs/1912.10405. External Links: Cited by: §1, §2.3, §4.2, Table 2.
- PyTorch: an imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 8024–8035. Cited by: §4.3.
- Learning action recognition model from depth and skeleton videos. In Proc. Int. Conference on Computer Vision (ICCV), pp. 5833–5842. Cited by: §1, §2.1.
- Maximum classifier discrepancy for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3723–3732. Cited by: §4.2, Table 3, Table 4.
- Two-stream convolutional networks for action recognition in videos. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pp. 568–576. Cited by: §1, §2.1.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402. External Links: Cited by: §4.1.
- Cross-domain action recognition via collective matrix factorization with graph laplacian regularization. Image and Vision Computing 55, pp. 119–126. Cited by: §2.3.
- Learning spatiotemporal features with 3d convolutional networks. In Proc. Int. Conference on Computer Vision (ICCV), pp. 4489–4497. Cited by: §1, §1, §2.1, §2.3, §4.2.
- Graph attention networks. In Proc. Int. Conference on Learning Representations (ICLR), Cited by: §2.2.
- Hierarchical memory modelling for video captioning. In Proc. ACM International Conference on Multimedia (MM), pp. 63–71. Cited by: §2.1.
- Temporal segment networks: towards good practices for deep action recognition. In Proc. European Conference on Computer Vision (ECCV), pp. 20–36. Cited by: §1, §4.2.
- Deep collaborative discrete hashing with semantic-invariant structure. In Proc. International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, pp. 905–908. Cited by: §1.
Human consensus-oriented image captioning. In Proc. International Joint Conference on Artificial Intelligence, IJCAI, pp. 659–665. Cited by: §1.
- Personalized hashtag recommendation for micro-videos. In Proc. ACM International Conference on Multimedia (MM), pp. 1446–1454. Cited by: §1.
- MMGCN: multi-modal graph convolution network for personalized recommendation of micro-video. In Proc. ACM International Conference on Multimedia (MM), pp. 1437–1445. Cited by: §1.
- Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition. Image and Vision Computing 55, pp. 127–137. Cited by: §2.3, §4.2, Table 2.
- PA3D: pose-action 3d machine for video recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7922–7931. Cited by: §1, §2.1.
- Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 945–954. Cited by: §1, §2.2.
- Catching the temporal regions-of-interest for video captioning. In Proc. ACM International Conference on Multimedia (MM), pp. 146–153. Cited by: §2.1.
- On learning invariant representations for domain adaptation. In Proc. Int. Conference on Machine Learning (ICML), pp. 7523–7532. Cited by: §1, §1.
- Temporal relational reasoning in videos. In Proc. European Conference on Computer Vision (ECCV), pp. 831–846. Cited by: §2.1, §3.2.