Conditional Extreme Value Theory for Open Set Video Domain Adaptation

09/01/2021 ∙ by Zhuoxiao Chen, et al. ∙ The University of Queensland 0

With the advent of media streaming, video action recognition has become progressively important for various applications, yet at the high expense of requiring large-scale data labelling. To overcome the problem of expensive data labelling, domain adaptation techniques have been proposed that transfers knowledge from fully labelled data (i.e., source domain) to unlabelled data (i.e., target domain). The majority of video domain adaptation algorithms are proposed for closed-set scenarios in which all the classes are shared among the domains. In this work, we propose an open-set video domain adaptation approach to mitigate the domain discrepancy between the source and target data, allowing the target data to contain additional classes that do not belong to the source domain. Different from previous works, which only focus on improving accuracy for shared classes, we aim to jointly enhance the alignment of shared classes and recognition of unknown samples. Towards this goal, class-conditional extreme value theory is applied to enhance the unknown recognition. Specifically, the entropy values of target samples are modelled as generalised extreme value distributions, which allows separating unknown samples lying in the tail of the distribution. To alleviate the negative transfer issue, weights computed by the distance from the sample entropy to the threshold are leveraged in adversarial learning in the sense that confident source and target samples are aligned, and unconfident samples are pushed away. The proposed method has been thoroughly evaluated on both small-scale and large-scale cross-domain video datasets and achieved the state-of-the-art performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the emergence of copious streaming media data, dynamically recognising and comprehending human actions and occurrence in online videos have become progressively essential, particularly for tasks like video content recommendation (Wei et al., 2019a, b), surveillance (Ouyang et al., 2018), and video retrieval (Kwak et al., 2018)

. Although supervised learning techniques

(Simonyan and Zisserman, 2014; Tran et al., 2015; Wang et al., 2016; Liu et al., 2017; Rahmani and Bennamoun, 2017; Yan et al., 2019) are beneficial for the tasks above, they lead to high expenses of labelling massive amounts of training data. The economical solution could be utilising a learner trained on existing labelled datasets to directly infer the labels of target datasets, yet there is often a domain shift between two datasets. Caused by the varying lighting conditions, camera angles and backgrounds, the domain shift triggers the performance drops of the learner. For example, synthetic video clips cropped from action-adventure games could be plentifully labelled and exploited, but inevitably has a huge domain shift from real-world videos such as action movie clips or sports video recordings. To address the issue of domain shift, unsupervised domain adaptation (UDA) techniques are introduced to align distributions between existing labelled data (source domain) and unlabelled data (target domain). To this end, existing UDA approaches either minimise the distribution distance across the domains (Gretton et al., 2012; Yan et al., 2017; Baktashmotlagh et al., 2013; Erfani et al., 2016; Baktashmotlagh et al., 2017) or learn the domain-invariant representations (Chen et al., 2019; Rahman et al., 2020a; Wang et al., 2021b).

In the same vein, the video-based UDA methods aim to align the features at different levels such as frame, video, or temporal relation (Chen et al., 2019; Jamal et al., 2018; Luo et al., 2020a). However, existing video-based UDA methodologies fail to address an open-set scenario when target samples come from unknown classes that are not seen during training, and can cause negative transfer across domains. Thus, the ability to recognise the unknown classes and reject them from the domain alignment pipeline is essential to the open-set unsupervised domain adaptation (OUDA) task. Moreover, the existing OUDA frameworks are mainly evaluated on still image recognition datasets which are not effective enough to identify unknown samples when applied on video recognition benchmark datasets (Zhao et al., 2019; Luo et al., 2020b; Baktashmotlagh et al., 2019; Busto and Gall, 2017; Saito et al., 2018b; Fang et al., 2019, 2021).

To overcome the above-mentioned limitations, we propose to intensify unknown recognition in open-set video domain adaptation. Our proposed framework consists of three modules. The first module is the Class-conditional Extreme Value Theory (CEVT) module that fits the entropy values of target samples to a set of generalised extreme value (GEV) distributions, where unknown samples can be efficiently identified as they lie on the tail of the distribution (Geng et al., 2018; Kotz and Nadarajah, 2000; Zhang and Patel, 2017; Oza and Patel, 2019)

. Samples are fitted into the multiple class-conditional GEVs depending on the model’s confidence in predicting those samples. For example, videos predicted as ”pull-up” and ”golf” are fitted into different GEVs. Then, we adaptively set a collection of thresholds for each GEV to split known and unknown samples. These fitted class-conditional GEVs with thresholds are employed in the other two modules. The second module is the class-conditional weighted domain adversarial learning pipeline to achieve the distribution alignment among shared classes and separate unknown classes. The weight of each sample is calculated by distance from entropy value to the threshold, which denotes the likelihood of belonging to the shared class or unknown class. At the inference stage, we have the third module of open-set recognition to classify samples with higher entropy than the threshold as the unknown class. This module in conjunction with class-conditional GEVs, is more robust to correctly classify hard classes than the typical approach of setting a global threshold. For example, in most of the existing OUDA approaches, the classifier predicts difficult ”push-up” samples with the highest probability as ”push-up” and with a lower probability as ”pull-up”. Subsequently, the entropy values for those samples are high, resulting in all the ”push-up” samples getting rejected by the global threshold. However, in our framework, we fit all samples predicted as ”push-up” to a GEV first, and then, we can efficiently separate ”push-up” samples and unknown samples. This framework is particularly effective for complex video sets that the model encounters more challenging training and inference due to the complex spatio-temporal composition of video features. In general, our contributions are summarised as follows:

  • We propose a new Class-conditional Extreme Value Theory (CEVT) based framework for unsupervised video open-set domain adaptation that concentrates on domain-invariant representation learning via weighted domain adversarial learning.

  • We investigate a new research direction of open-set video DA and introduce the CEVT model to solve the problem.

  • Our proposed framework based on class-conditional extreme value theory is effective on both open-set recognition and adversarial weight generation, and it accurately recognises the unknown samples.

  • We conducted extensive experiments to demonstrate the effectiveness of the proposed method on both small and large scale cross-domain video datasets and showed that the CEVT based framework achieves state-of-the-art performance. We released the source code of our proposed approach for reference: https://github.com/zhuoxiao-chen/CEVT.

Figure 1.

The overall architecture of the proposed CEVT model. Different colours indicate different classes. Class-conditional EVT in the bottom right of the figure is the probability density function (PDF) of the Generalised Extreme Value (GEV) distribution for three different entropy groups.

2. Related Work

2.1. Video Action Recognition

Video action recognition is becoming increasingly important in the field of computer vision with many real-world applications, such as video surveillance

(Ouyang et al., 2018), video captioning and environment monitoring (Duan et al., 2018; Krishna et al., 2017; Wang et al., 2018; Yang et al., 2017)

. To classify actions according to individual video frames or local motion vectors, a typical process employs a two-stream convolutional neural network

(Karpathy et al., 2014; Simonyan and Zisserman, 2014). Some works utilise attention (Long et al., 2018; Ma et al., 2018), 3D convolutions (Tran et al., 2015)

, recurrent neural networks

(Donahue et al., 2017), and temporal relation modules (Zhou et al., 2018) to better extract long-term temporal features. Another branch of work, including 3D human skeleton recognition (Liu et al., 2017; Rahmani and Bennamoun, 2017), complex object interactions (Ma et al., 2018) and pose representations (Yan et al., 2019), supplements the extracted RGB and optical flow features to alleviate the view dependency and noise caused by various lighting conditions. However, the above work necessitates costly annotations and could hardly be extended to an unseen situation, which significantly impedes the practical feasibility.

2.2. Domain Adaptation

To overcome such a limitation, Unsupervised Domain Adaptation (UDA) attempts to transfer knowledge from a labelled source domain to an unlabelled target domain. To tackle the domain shift referred to as the discrepancy of two domains, there are mainly two types of approaches. One is the discrepancy-based method that aims to minimise the distribution distance between two domains (Gretton et al., 2012; Yan et al., 2017; Baktashmotlagh et al., 2014, 2016; Rahman et al., 2019, 2020b). The other is the adversary-based method that learns the domain-invariant representation (Ganin et al., 2016; Wang et al., 2020; Moghimifar et al., 2020). In addition, adversarial generative and self-supervision-based methods are also investigated by researchers (Zhao et al., 2020). Recently, existing work has extended the UDA for harder video-based datasets. AMLS (Jamal et al., 2018) applies pre-extracted C3D (Tran et al., 2015) features to a Grassmann manifold derived from PCA and utilises adaptive kernels and adversarial learning to perform UDA. TAN (Chen et al., 2019) attempts to simultaneously align and learn temporal dynamics with entropy-based attention. Using the topology property of the bipartite graph network, ABG (Luo et al., 2020a) explicitly models source-target interactions to learn a domain-agnostic video classifier. Nevertheless, all methods mentioned above assume that the source domain and target domain share the same label set, which is not realistic in real-world scenarios. To address such issue, OUDA that assumes the target domain contains unknown classes, has made efforts at both theoretical and experimental level (Zhao et al., 2019; Luo et al., 2020b; Baktashmotlagh et al., 2019; Busto and Gall, 2017; Bucci et al., 2020; Saito et al., 2018b). DMD (Wang et al., 2021a) attempts to perform OUDA for videos, but fails to evaluate open recognition with the appropriate metric. Despite significant progress in a broader set of video classification and OUDA, domain adaptation has received little attention for knowledge transfer across videos under the open-set setting.

3. Methodology

In this section, we first give a formal definition of the Open-set Unsupervised Video Domain Adaptation (OUVDA), and then go through the details of our proposed CEVT framework, illustrated in Figure 1.

3.1. Problem Formulation

We are given a labelled source video set and an unlabelled target video collection , where and are the number of videos in each domain, respectively. Video samples in the source domain and the target domain

are drawn from different probability distributions,

i.e., . The two domains share common classes as the known classes. There is the additional class in the target domain not shared with the source domain , which is regarded as the , i.e., the unknown class. Each source video or target video is composed of frames, i.e., and , where represent the dimensional feature vector of k-th frame in i-th source video and j-th target video, respectively. The primary goal of our method is to learn a classifier: for predicting the labels of unlabelled videos in target domain and a collection of functions to recognise samples from the unknown class in target domain.

3.2. Source Classification

We first feed both source videos and target videos into ResNet (He et al., 2015) to obtain the frame-level features, i.e., and for source and target domain, respectively. Then the frame-level features are transformed into video-level features, i.e., and by frame aggregation techniques, with . Without loss of generality, we utilise the mean Average Pooling (AvgPool), which is to produce a unified video representation by temporal averaging of the frame features. Thus, each source video-level feature and target video-level feature are defined as below:

(1)

Next, the aggregated features with labels from the source domain are fed into the source classifier , which is trained to minimise the cross entropy loss,

(2)

The parameters of the source classifier are shared with the target classifier , which is used to predict classes for target samples. Figure 1 shows the entire source classification process by a green array that starts from the videos on the left to the classification loss .

3.3. Entropy-based Weights for Domain Adversarial Learning

To align the distributions of the video-level source and target features, we propose a novel class-conditional EVT to generate conditional weights for domain adversarial learning by fitting the entropy values of target samples. The weighted domain adversarial learning can effectively align the known samples from both domains and separate the unknown samples from the target domain simultaneously by assigning instance-level weight to each sample. The samples which come by high probability from known classes are assigned a large weight. Conversely, samples that are likely to be unknown are given a small weight. Finally, all the features are multiplied by their weights and fed into the standard domain adversarial learning module with the Gradient Reversal Layer (GRL), as shown in Figure 1.

Class-conditional Extreme Value Theory. To obtain the weight for each sample, we feed target samples into target classifier to get the predictions for shared classes. Then, the entropy value of each target sample can be computed from its prediction,

(3)

where denotes the prediction of class by the classifier. Then, the entropy values of target samples are partitioned into entropy groups, i.e., . Target samples predicted to be from -th class are allocated to the group . The set of class-conditional entropy group is formulated as:

(4)

Next, each group is fitted into a GEV distribution to obtain a set of CDFs of GEV, i.e., , where indicates the CDF function fitted by entropy values in -th group . The CDF of GEV is calculated as,

(5)

where , , and are three parameters of GEV, determined by fitting data. Utilising the class-conditional EVT can complement the lack of class information for entropy.

Class-conditional Weights. After fitting GEVs using entropy values of each entropy group, we set a global threshold for all the CDFs in . Then, a set of class-conditional entropy threshold is computed for each group, denoted as,

(6)

Given the target sample, is classified in -th class, if is much greater or smaller than , meaning this sample is very likely to be known or unknown. Then, we assign the weight as 1 or 0 to . If the is close to , meaning the classifier is unsure about , we assign it a weight which linearly depends on distance from its entropy value to the corresponding class conditional entropy threshold . The interval for linear variation of weight is named as mixture entropy interval, where most knowns or unknowns are mixed within this interval. The class-conditional weight can be formulated as,

(7)
(8)

where is the entropy value of the evenly spaced vector with norm , with each element of the vector being . Two black line segments shown in the left bottom of Figure 1 shows the total entropy interval , on which all the entropy values lie. The mixture entropy interval is , demonstrated by the dashed rectangles. The length of the entropy mixture interval varies depending on the distance from to either 0 or , shown by red and yellow dashed rectangles of different length. The entropy threshold that is close to either 0 or indicates if the -th class is easy or hard, and the distribution of class group

has small variance. Thus, we need a small mixture for those dense group of entropy values to smoothly assign the weights.

Weighted Domain Adversarial Learning. After obtaining weights for all the target samples by the class-conditional EVT technique, we train the domain classifier on the target video features multiplied with instance-level weights. The weighted domain classification loss is calculated by,

(9)

Gradually, with the proposed conditional EVT and weighted domain adversarial learning modules, known samples of both domains are aligned, and unknown samples of the target domain get separated from known samples.

3.4. Entropy Maximisation

To further separate the unknown samples, we utilise entropy maximum loss to progressively increase the entropy values of the overall target samples, defined as,

(10)

With the weighted domain adversarial learning, known target samples become similar to source samples. The entropy values of source samples gradually decreases because the source classifier is fully trained to optimise the cross-entropy loss . Thus, in the target domain, the entropy values of known samples decrease as well when optimising the two losses of and , while the entropy values of unknown samples increase when optimising loss. Eventually, the unknown samples are optimally separated from known samples.

Methods ALL OS OS* UNK HOS
DANN (Ganin et al., 2016) + OSVM (Jain et al., 2014) 66.11 53.41 48.33 83.89 61.33
JAN (Long et al., 2017) + OSVM (Jain et al., 2014) 61.11 51.59 47.78 74.44 58.20
AdaBN (Li et al., 2018) + OSVM (Jain et al., 2014) 60.65 61.35 61.67 59.44 60.54
MCD (Saito et al., 2018a) + OSVM (Jain et al., 2014) 66.67 60.32 57.78 75.56 65.48
TAN (Chen et al., 2019) + OSVM (Jain et al., 2014) 65.28 58.73 56.11 74.44 63.99
TAN (Chen et al., 2019) + OSVM (Jain et al., 2014) 62.25 55.95 53.33 71.67 61.16
OSBP (Saito et al., 2018b) + AvgPool 67.19 55.64 50.83 84.47 63.47
Ours 75.28 61.59 56.11 94.44 70.40
Table 1. Performance comparisons on the UCF→HMDB.
Methods ALL OS OS* UNK HOS
DANN (Ganin et al., 2016) + OSVM (Jain et al., 2014) 64.62 64.61 62.94 74.67 68.31
JAN (Long et al., 2017) + OSVM (Jain et al., 2014) 61.47 64.47 62.91 73.80 67.92
AdaBN (Li et al., 2018) + OSVM (Jain et al., 2014) 62.87 60.86 58.78 73.36 65.27
MCD (Saito et al., 2018a) + OSVM (Jain et al., 2014) 66.73 64.96 63.48 73.80 68.25
TAN (Chen et al., 2019) + OSVM (Jain et al., 2014) 63.40 63.88 61.35 79.04 69.08
TAN (Chen et al., 2019) + OSVM (Jain et al., 2014) 60.60 61.84 58.39 82.53 68.39
OSBP (Saito et al., 2018b) + AvgPool 64.84 59.61 55.26 85.71 67.19
Ours 70.58 69.29 66.79 84.28 74.52
Table 2. Performance comparisons on the HMDB→UCF.

3.5. Optimisation

The ultimate objective is to learn the optimal parameters for the CEVT model,

(11)

with and the coefficients of the entropy maximisation loss and weighted adversarial loss, respectively.

3.6. Inference

In this section, we explain the inference stage of the proposed CEVT after the model parameters are optimised. The inference process is denoted by grey arrays shown in Figure 1. The target videos are fed into the convolution network, frame aggregator, and the target classifier . Then, the predictions are passed into the class-conditional EVT module for open-set recognition. The predicted class of each input target videos is represented as,

(12)

4. Experiments

In this section, we empirically evaluate the performance of the proposed CEVT model on two datasets, UCF-HMDB and UCF-Olympic for unsupervised open-set domain adaptation.

4.1. Datasets

The UCF-HMDB

is the intersected subset covering 12 highly relevant categories of two large-scale video action datasets, the UCF101

(Soomro et al., 2012) and HMDB51 (Kuehne et al., 2011), including Climb, Fencing, Golf, Kick Ball, pull-up, Punch, push-up, Ride Bike, Ride Horse, Shoot Ball, Shoot Bow and Walk. The UCF-Olympic have six common categories from the UCF101 and Olympic Sports Dataset (Niebles et al., 2010), which involves Basketball, Clearn and Jerk, Diving, Pole Vault, Tennis and Discus Throw. These dataset partitioning strategies follow (Chen et al., 2019)

to make a fair comparison. Likewise, we utilise the pre-extracted frame-level features by ResNet101 model pre-trained on ImageNet. In terms of known/unknown category splitting, we select the first half categories as known classes, and all the remaining categories are labelled as unknown, for both

UCF-HMDB and UCF-Olympic.

4.2. Evaluation Metrics

To compare the performance of the proposed CEVT and the baseline methods, we adopt four widely used metrics (Bucci et al., 2020)(Saito et al., 2018b) for evaluating OUDA tasks. The accuracy (ALL) is the correctly predicted target samples over all target samples. OS is the average class accuracy over the classes. OS* is the average class accuracy over the known classes. UNK is the unknown class accuracy. HOS

is the harmonic mean of

OS* and UNK formulated as: . HOS is the most meaningful metric for evaluating OUDA tasks because it can best reflect the balance between OS* and UNK.

4.3. Baselines

We compare our proposed CEVT

with three types of state-of-the-art domain adaptation methods: the close-set method for images, the close-set method for videos and the open-set method for images. The close-set domain adaptation methods for images are extended to align the distributions of aggregated frames from source and target domains, which include Domain-Adversarial Neural Network (

DANN) (Ganin et al., 2016), Joint Adaptation Network (JAN) (Long et al., 2017), Adaptive Batch Normalisation (AdaBN) (Li et al., 2018) and Maximum Classier Discrepancy (MCD) (Saito et al., 2018a). In terms of the close-set approach for videos, Temporal Attentive Adversarial Adaptation Network (TAN) (Chen et al., 2019) and Temporal Adversarial Adaptation Network (TAN) (Chen et al., 2019) are adopted for comparison. We equip the above open-set methods with the OSVM (Jain et al., 2014)

for open recognition. As for the open-set method for images, OUDA by Backpropagation (

OSBP) (Saito et al., 2018b) extended by frame aggregator is made into comparison.

4.4. Implementation Details

All the baselines and our approach are implemented by PyTorch

(Paszke et al., 2019) on one server with two GeForce GTX 2080 Ti GPUs. We follow (Chen et al., 2019)

to sample a specified number of frames with uniform spacing from each video for training and extract a 2048-D feature vector from each frame by the Resnet-101 pre-trained on ImageNet. To ensure a fair comparison, we fix

to 16 for all methods using average pooling, and follow the optimisation strategy in (Chen et al., 2019)

to utilise the stochastic gradient descent (SGD) as the optimiser, and learning-rate-decreasing techniques from

DANN (Ganin et al., 2016), with learning rate, momentum, and weight decay of 0.03, 0.9 and , respectively. The scale of datasets determines the size of the source batch, 32 for UCF-Olympic and 128 for UCF-HMDB. The size of the target batch is computed by multiplying the source batch with the ratio between the source and target datasets. The loss coefficient is set as 1, 10, 0.19, 0.22, and is set as 0.9, 0.7, 1.83, 5, for UCF→HMDB, HMDB→UCF, UCF→Olympic and Olympic→UCF, respectively. The EVT threshold is set as 0.4, 0.45, 0.6 and 0.29 for the above tasks, respectively.

Methods ALL OS OS* UNK HOS
DANN (Ganin et al., 2016) + OSVM (Jain et al., 2014) 83.33 84.93 86.38 80.60 83.39
JAN (Long et al., 2017) + OSVM (Jain et al., 2014) 88.75 84.23 80.46 95.52 87.35
AdaBN (Li et al., 2018) + OSVM (Jain et al., 2014) 84.17 80.09 76.93 89.55 82.76
MCD (Saito et al., 2018a) + OSVM (Jain et al., 2014) 83.75 84.65 85.50 82.09 83.76
TAN (Chen et al., 2019) + OSVM (Jain et al., 2014) 87.92 82.74 78.48 95.52 86.17
TAN (Chen et al., 2019) + OSVM (Jain et al., 2014) 85.83 85.64 85.58 85.82 85.70
OSBP (Saito et al., 2018b) + AvgPool 89.06 86.23 84.31 92.00 87.98
Ours 89.17 87.54 86.38 91.04 88.65
Table 3. Performance comparisons on the Olympic→UCF.
Methods ALL OS OS* UNK HOS
DANN (Ganin et al., 2016) + OSVM (Jain et al., 2014) 94.44 95.33 96.67 91.30 93.91
JAN (Long et al., 2017) + OSVM (Jain et al., 2014) 94.44 96.74 100.00 86.96 93.02
AdaBN (Li et al., 2018) + OSVM (Jain et al., 2014) 87.04 83.86 78.48 100.00 87.95
MCD (Saito et al., 2018a) + OSVM (Jain et al., 2014) 87.04 86.74 86.67 86.96 86.81
TAN (Chen et al., 2019) + OSVM (Jain et al., 2014) 96.30 97.83 100.00 91.30 95.45
TAN (Chen et al., 2019) + OSVM (Jain et al., 2014) 88.89 88.74 87.88 91.03 89.56
OSBP (Saito et al., 2018b) + AvgPool 96.88 95.83 94.44 100.00 97.14
Ours 98.15 97.73 96.97 100.00 98.46
Table 4. Performance comparisons on the UCF→Olympic.

4.5. Comparisons with State-of-The-Art

We clearly report the performance of the proposed CEVT and baseline methods on UCF-HMDB and UCF-Olympic as shown in Table 1, Table 2, Table 3 and Table 4. The proposed CEVT model outperforms all the compared state-of-the-art domain adaptation approaches, improving the HOS by 4.92%, 5.44%, 1.32% and 0.67% on the adaptation task UCF→HMDB, HMDB→UCF, UCF→Olympic and Olympic→UCF, respectively. It is worth noting that the proposed model achieves significant performance boosts for the larger-scale datasets of UCF-Olympic. Also, note that the outstanding performance gain of the proposed framework for the most challenging transfer task, i.e., UCF→HMDB, illustrates the better adaptation ability of our approach. Some methods achieve 100% on OS* or UNK for the UCF→Olympic task because this task is relatively easier than other tasks, and the validation set has limited samples. Also, there is usually a trade-off between these two metrics. For example, JAN achieves 100% on OS* but gets the lowest score (86.95%) on UNK. Conversely, AdaBn achieves 100% on UNK but has the worst performance (78.48%) on OS*. The proposed CEVT is superior to all baselines, as it achieves remarkably high OS* and UNK simultaneously.

(a) DANN + OSVM
(b) OSBP + AvgPool
(c) TAN + OSVM
(d) Ours
Figure 2. The t-SNE visualisation of the learned source and target video representations on the UCF→HMDB task.
Methods UCF→HMDB HMDB→UCF
CEVT w/o & 62.09 71.42
CEVT w/o 64.04 72.01
CEVT w/o 67.41 72.88
CEVT w unweighted 68.03 72.56
CEVT 70.40 74.52
Table 5. The ablation performance (HOS%) of the proposed CEVT model on the UCF-HMDB dataset. ”w” indicates with and ”w/o” indicates without.

4.6. Ablation Study

We use the UCF-HMDB dataset to investigate the performance of the proposed modules of the model CEVT. Table 5 summarises the experimental results among methods with different removed functions. Removing both weighted domain adversarial learning and entropy maximisation learning, the performance of CEVT w/o drop (8.31%) significantly when adapting from UCF to HMDB compared with the complete model. Although there is only source classification and EVT method for unknown recognition left, the dropped performance (62.09% and 71.42%) is still better than most of the baseline methods equipped with OSVM, proving the robustness of EVT in terms of unknown recognition tasks. The CEVT w/o refers to the variant without the entropy maximisation loss, which causes HOS drop (6.36% and 1.09% ) in both adaptation directions. Removing the weighted adversarial learning, referred to as CEVT w/o , result in a performance decrease slightly in either direction. The CEVT w unweighted indicates that the weights of all instances are set as 1 for , which performs worse than weighted . The transfer task from UCF to HMDB is more challenging than the opposite direction. Thus, the proposed modules are more capable of handling challenging adaptation tasks.

4.7. Parameter Sensitivity

To explore the sensitivity of the loss coefficients of the proposed CEVT, we run the experiments on the dataset UCF-HMDB with changing values of and , which are used to adjust the weighted adversarial loss and the entropy maximisation loss, respectively. Even though both and vary in a wide interval, as plotted in Figure 3, the average HOS of the proposed CEVT is very steady for both UCF→HMDB and HMDB→UCF tasks. No matter how the coefficients varies, the fluctuation range of HOS does not exceed . As for the , the fluctuation range is less than . This demonstrates the resistance of our methodology to varying loss coefficients.

Figure 3. Performance (HOS) comparisons of the proposed CEVT with respect to the varying loss coefficients on the UCF→HMDB (shown in the upper row) and HMDB→UCF (shown in the bottom row) adaptation tasks.

4.8. Visualisation

To intuitively show how our model closes the domain shift between source and target domains as well as effectively recognises the unknown samples, we apply the t-SNE (van der Maaten and Hinton, 2008) to visualise the extracted features from baseline models, DANN, OSPB, TAN and our proposed CEVT on the UCF→HMDB task as shown in Figure 2

. Different colours denote different classes, and unknown samples are grey. The source videos are represented by circles, while triangles represent the target videos. Compared to the baseline methods, it is noticeable that the features extracted by CEVT produce tighter clusters, and unknown samples are better clustered to the centre.

5. Conclusion

In this work, we propose a CEVT framework to tackle the problem of open-set unsupervised video domain adaptation. Unlike previous works, we intensify the open recognition to jointly improve the accuracy of the both known and unknown classes. Experiments demonstrate that the proposed algorithm outperforms state-of-the-art methods on both large and small video datasets. Future work includes testing CEVT for still images, and equipping the proposed CEVT with other frame aggregators for videos.

References

  • M. Baktashmotlagh, M. Faraki, T. Drummond, and M. Salzmann (2019) Learning factorized representations for open-set domain adaptation. In Proc. International Conference on Learning Representations, ICLR 2019, Cited by: §1, §2.2.
  • M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann (2013) Unsupervised domain adaptation by domain invariant projection. In Proc. International Conference on Computer Vision, ICCV 2013, pp. 769–776. Cited by: §1.
  • M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann (2014) Domain adaptation on the statistical manifold. In

    Proc. Conference on Computer Vision and Pattern Recognition, CVPR 2014

    ,
    pp. 2481–2488. Cited by: §2.2.
  • M. Baktashmotlagh, M. T. Harandi, and M. Salzmann (2016) Distribution-matching embedding for visual domain adaptation.

    Journal of Machine Learning Research

    17, pp. 108:1–108:30.
    Cited by: §2.2.
  • M. Baktashmotlagh, M. T. Harandi, and M. Salzmann (2017) Learning domain invariant embeddings by matching distributions. In Domain Adaptation in Computer Vision Applications, Advances in Computer Vision and Pattern Recognition, pp. 95–114. Cited by: §1.
  • S. Bucci, M. R. Loghmani, and T. Tommasi (2020) On the effectiveness of image rotation for open set domain adaptation. In Proc. European Conference on Computer Vision, ECCV 2020, pp. 422–438. Cited by: §2.2, §4.2.
  • P. P. Busto and J. Gall (2017) Open set domain adaptation. In Proc. International Conference on Computer Vision, ICCV 2017, pp. 754–763. Cited by: §1, §2.2.
  • M. Chen, Z. Kira, G. Alregib, J. Yoo, R. Chen, and J. Zheng (2019) Temporal attentive alignment for large-scale video domain adaptation. In Proc. International Conference on Computer Vision, ICCV 2019, pp. 6320–6329. Cited by: §1, §1, §2.2, Table 1, Table 2, §4.1, §4.3, §4.4, Table 3, Table 4.
  • J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 677–691. Cited by: §2.1.
  • X. Duan, W. Huang, C. Gan, J. Wang, W. Zhu, and J. Huang (2018) Weakly supervised dense event captioning in videos. In Proc. Conference on Neural Information Processing Systems, NeurIPS 2018, pp. 3063–3073. Cited by: §2.1.
  • S. M. Erfani, M. Baktashmotlagh, M. Moshtaghi, V. Nguyen, C. Leckie, J. Bailey, and K. Ramamohanarao (2016) Robust domain generalisation by enforcing distribution invariance. In

    Proc. International Joint Conference on Artificial Intelligence, IJCAI 2016

    ,
    pp. 1455–1461. Cited by: §1.
  • Z. Fang, J. Lu, A. Liu, F. Liu, and G. Zhang (2021) Learning bounds for open-set learning. In Proc. of the 37th International Conference on Machine Learning, ICML 2021, pp. 3122–3132. Cited by: §1.
  • Z. Fang, J. Lu, F. Liu, J. Xuan, and G. Zhang (2019) Open set domain adaptation: theoretical bound and algorithm. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky (2016) Domain-adversarial training of neural networks. Journal of Machine Learning Research 17, pp. 59:1–59:35. Cited by: §2.2, Table 1, Table 2, §4.3, §4.4, Table 3, Table 4.
  • C. Geng, S. Huang, and S. Chen (2018) Recent advances in open set recognition: A survey. CoRR abs/1811.08581. Cited by: §1.
  • A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola (2012) A kernel two-sample test. Journal of Machine Learning Research 13, pp. 723–773. Cited by: §1, §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. Cited by: §3.2.
  • L. P. Jain, W. J. Scheirer, and T. E. Boult (2014) Multi-class open set recognition using probability of inclusion. In Proc. European Conference on Computer Vision, ECCV 2014, pp. 393–409. Cited by: Table 1, Table 2, §4.3, Table 3, Table 4.
  • A. Jamal, V. P. Namboodiri, D. Deodhare, and K. S. Venkatesh (2018) Deep domain adaptation in action space. In Proc. British Machine Vision Conference, BMVC 2018, pp. 264. Cited by: §1, §2.2.
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li (2014) Large-scale video classification with convolutional neural networks. In Proc. Conference on Computer Vision and Pattern Recognition, CVPR 2014, pp. 1725–1732. Cited by: §2.1.
  • S. Kotz and S. Nadarajah (2000) Extreme value distributions: theory and applications. World Scientific. Cited by: §1.
  • R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles (2017) Dense-captioning events in videos. In Proc. International Conference on Computer Vision, ICCV 2017, pp. 706–715. Cited by: §2.1.
  • H. Kuehne, H. Jhuang, E. Garrote, T. A. Poggio, and T. Serre (2011) HMDB: A large video database for human motion recognition. In Proc. International Conference on Computer Vision, ICCV 2011, pp. 2556–2563. Cited by: §4.1.
  • C. Kwak, M. Han, S. Kim, and G. Hahm (2018) Interactive story maker: tagged video retrieval system for video re-creation service. In Proc. International Conference on Multimedia, MM 2018, pp. 1270–1271. Cited by: §1.
  • Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu (2018)

    Adaptive batch normalization for practical domain adaptation

    .
    Pattern Recognition 80, pp. 109–117. Cited by: Table 1, Table 2, §4.3, Table 3, Table 4.
  • J. Liu, G. Wang, P. Hu, L. Duan, and A. C. Kot (2017) Global context-aware attention LSTM networks for 3d action recognition. In Proc. Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 3671–3680. Cited by: §1, §2.1.
  • M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017) Deep transfer learning with joint adaptation networks. In Proc. International Conference on Machine Learning, ICML 2017, pp. 2208–2217. Cited by: Table 1, Table 2, §4.3, Table 3, Table 4.
  • X. Long, C. Gan, G. de Melo, J. Wu, X. Liu, and S. Wen (2018) Attention clusters: purely attention based local feature integration for video classification. In Proc. Conference on Computer Vision and Pattern Recognition, CVPR 2018, pp. 7834–7843. Cited by: §2.1.
  • Y. Luo, Z. Huang, Z. Wang, Z. Zhang, and M. Baktashmotlagh (2020a) Adversarial bipartite graph learning for video domain adaptation. In Proc. International Conference on Multimedia, MM 2020, pp. 19–27. Cited by: §1, §2.2.
  • Y. Luo, Z. Wang, Z. Huang, and M. Baktashmotlagh (2020b) Progressive graph learning for open-set domain adaptation. In Proc. of the 37th International Conference on Machine Learning, ICML 2020, pp. 6468–6478. Cited by: §1, §2.2.
  • C. Ma, A. Kadav, I. Melvin, Z. Kira, G. AlRegib, and H. P. Graf (2018) Attend and interact: higher-order object interactions for video understanding. In Proc. Conference on Computer Vision and Pattern Recognition, CVPR 2018, pp. 6790–6800. Cited by: §2.1.
  • F. Moghimifar, G. Haffari, and M. Baktashmotlagh (2020) Domain adaptative causality encoder. CoRR abs/2011.13549. Cited by: §2.2.
  • J. C. Niebles, C. Chen, and F. Li (2010) Modeling temporal structure of decomposable motion segments for activity classification. In Proc. European Conference on Computer Vision, ECCV 2010, pp. 392–405. Cited by: §4.1.
  • D. Ouyang, J. Shao, Y. Zhang, Y. Yang, and H. T. Shen (2018)

    Video-based person re-identification via self-paced learning and deep reinforcement learning framework

    .
    In Proc. International Conference on Multimedia, MM 2018, pp. 1562–1570. Cited by: §1, §2.1.
  • P. Oza and V. M. Patel (2019) C2AE: class conditioned auto-encoder for open-set recognition. In Proc. Conference on Computer Vision and Pattern Recognition, CVPR 2019, pp. 2307–2316. Cited by: §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    .
    In Proc. Conference on Neural Information Processing Systems, NeurIPS 2019, pp. 8024–8035. Cited by: §4.4.
  • M. M. Rahman, C. Fookes, M. Baktashmotlagh, and S. Sridharan (2019) Multi-component image translation for deep domain generalization. In Proc. Winter Conference on Applications of Computer Vision, WACV 2019, pp. 579–588. Cited by: §2.2.
  • M. M. Rahman, C. Fookes, M. Baktashmotlagh, and S. Sridharan (2020a) Correlation-aware adversarial domain adaptation and generalization. Pattern Recognition 100, pp. 107124. Cited by: §1.
  • M. M. Rahman, C. Fookes, M. Baktashmotlagh, and S. Sridharan (2020b)

    On minimum discrepancy estimation for deep domain adaptation

    .
    In Domain Adaptation for Visual Understanding, pp. 81–94. Cited by: §2.2.
  • H. Rahmani and M. Bennamoun (2017) Learning action recognition model from depth and skeleton videos. In Proc. International Conference on Computer Vision, ICCV 2017, pp. 5833–5842. Cited by: §1, §2.1.
  • K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018a) Maximum classifier discrepancy for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, pp. 3723–3732. Cited by: Table 1, Table 2, §4.3, Table 3, Table 4.
  • K. Saito, S. Yamamoto, Y. Ushiku, and T. Harada (2018b) Open set domain adaptation by backpropagation. In Proc. European Conference on Computer Vision, ECCV 2018, pp. 156–171. Cited by: §1, §2.2, Table 1, Table 2, §4.2, §4.3, Table 3, Table 4.
  • K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Proc. Conference on Neural Information Processing Systems, NeurIPS 2014, pp. 568–576. Cited by: §1, §2.1.
  • K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402. Cited by: §4.1.
  • D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proc. International Conference on Computer Vision, ICCV 2015, pp. 4489–4497. Cited by: §1, §2.1, §2.2.
  • L. van der Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (86), pp. 2579–2605. Cited by: §4.8.
  • J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan (2018) Hierarchical memory modelling for video captioning. In Proc. International Conference on Multimedia, MM 2018, pp. 63–71. Cited by: §2.1.
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In Proc. European Conference on Computer Vision, ECCV 2016, pp. 20–36. Cited by: §1.
  • Y. Wang, X. Song, Y. Wang, P. Xu, R. Hu, and H. Chai (2021a) Dual metric discriminator for open set video domain adaptation. In Proc. International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, pp. 8198–8202. Cited by: §2.2.
  • Z. Wang, Y. Luo, Z. Huang, and M. Baktashmotlagh (2020) Prototype-matching graph network for heterogeneous domain adaptation. In Proc. International Conference on Multimedia, MM 2020, pp. 2104–2112. Cited by: §2.2.
  • Z. Wang, Y. Luo, R. Qiu, Z. Huang, and M. Baktashmotlagh (2021b) Learning to diversify for single domain generalization. CoRR abs/2108.11726. Cited by: §1.
  • Y. Wei, Z. Cheng, X. Yu, Z. Zhao, L. Zhu, and L. Nie (2019a) Personalized hashtag recommendation for micro-videos. In Proc. International Conference on Multimedia, MM 2019, pp. 1446–1454. Cited by: §1.
  • Y. Wei, X. Wang, L. Nie, X. He, R. Hong, and T. Chua (2019b) MMGCN: multi-modal graph convolution network for personalized recommendation of micro-video. In Proc. International Conference on Multimedia, MM 2019, pp. 1437–1445. Cited by: §1.
  • A. Yan, Y. Wang, Z. Li, and Y. Qiao (2019) PA3D: pose-action 3d machine for video recognition. In Proc. Conference on Computer Vision and Pattern Recognition, CVPR 2019, pp. 7922–7931. Cited by: §1, §2.1.
  • H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo (2017) Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation. In Proc. Conference on Computer Vision and Pattern Recognition, CVPR 2017, pp. 945–954. Cited by: §1, §2.2.
  • Z. Yang, Y. Han, and Z. Wang (2017) Catching the temporal regions-of-interest for video captioning. In Proc. International Conference on Multimedia, MM 2017, pp. 146–153. Cited by: §2.1.
  • H. Zhang and V. M. Patel (2017) Sparse representation-based open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (8), pp. 1690–1696. Cited by: §1.
  • H. Zhao, R. T. des Combes, K. Zhang, and G. J. Gordon (2019) On learning invariant representations for domain adaptation. In Proc. International Conference on Machine Learning, ICML 2019, pp. 7523–7532. Cited by: §1, §2.2.
  • S. Zhao, X. Yue, S. Zhang, B. Li, H. Zhao, B. Wu, R. Krishna, J. E. Gonzalez, A. L. Sangiovanni-Vincentelli, S. A. Seshia, and K. Keutzer (2020) A review of single-source deep unsupervised visual domain adaptation. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §2.2.
  • B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In Proc. European Conference on Computer Vision, ECCV 2018, pp. 831–846. Cited by: §2.1.