Image to Video Domain Adaptation Using Web Supervision

08/05/2019 ∙ by Andrew Kae, et al. ∙ Microsoft 0

Training deep neural networks typically requires large amounts of labeled data which may be scarce or expensive to obtain for a particular target domain. As an alternative, we can leverage webly-supervised data (i.e. results from a public search engine) which are relatively plentiful but may contain noisy results. In this work, we propose a novel two-stage approach to learn a video classifier using webly-supervised data. We argue that learning appearance features and then temporal features sequentially, rather than simultaneously, is an easier optimization for this task. We show this by first learning an image model from web images, which is used to initialize and train a video model. Our model applies domain adaptation to account for potential domain shift present between the source domain (webly-supervised data) and target domain and also accounts for noise by adding a novel attention component. We report results competitive with state-of-the-art for webly-supervised approaches on UCF-101 (while simplifying the training process) and also evaluate on Kinetics for comparison.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Action recognition in videos is a well-studied problem in computer vision with many important applications in areas such as surveillance, search, and human-computer interaction. Training deep neural networks typically requires a large labeled dataset. However, it may be difficult to obtain enough labeled data because it may be too scarce or too expensive to obtain. We can instead leverage

webly-supervised data (i.e. results from a public search engine) which are relatively plentiful but may be noisy.

Figure 1: Given webly-supervised images and videos (source domains), we learn a video classifier for the target domain. The model is learned in a two-stage process by 1) learning an image model (2D-CNN) and 2) transferring the spatial filters to the video model (3D-CNN) to continue training. The model also accounts for domain shift and noise present in the webly-supervised data.

The high-level overview of our model is shown in Figure 1. The noisy web image and web video domains are considered source domains that we want to domain adapt into the target domain. We present a two-stage approach to first learn an image model using a 2D-CNN, transfer the learned spatial weights to a 3D-CNN, and continue training a video model. Since our goal is to learn a video classifier, we can potentially learn from web videos only, but we argue that our proposed two-stage process is more appropriate for learning from noisy, webly-supervised data. Web videos are likely to be noisier than web images since web videos typically contain many frames that are irrelevant to the target concept. Thus it may be easier to learn spatial features first, based on the relatively cleaner web images, and then learn the temporal features afterward. Previous work [25] has also hypothesized that it may be difficult to learn both spatial and temporal features simultaneously. We present empirical results in Section 4 showing that our two-stage process, which separates learning appearance and temporal features, outperforms a model that learns both jointly.

Figure 2: T-SNE Plots. We randomly sampled from the web image (redred points), web video (greengreen points) and target video (blueblue points) (UCF-101 [23]) domains and show the T-SNE [28] plots of 4 actions: balance beam, long jump, surfing, and throw discus. The first row contains the T-SNE plot before domain adaptation using pre-trained RN-34 [11] and the second row shows the same actions after the network has been domain adapted. Plot best viewed in color.

In addition to the challenges of learning the appearance and motion, there are two additional issues with training on webly-supervised data. First, there is potential domain shift between the different domains. For example, comparing web images and videos, many web images are typically high-resolution and shot with high-quality cameras, while web videos are typically lower resolution and may contain motion blur and other artifacts. Second, there may be noise present in webly-supervised data that may degrade performance. For example webly-supervised data may contain either the wrong concept entirely or a mix of relevant and irrelevant concepts (i.e. only a subset of frames in a video may correspond to the target concept).

To account for domain shift, domain adaptation has been successfully used for tasks such as mapping from MNIST [14] to StreetView digits [27, 9], RGB to depth images [27] and webcam to product images [9]. In our work we incorporate an adversarial training component taken from Generative Adversarial Networks (GAN) [10]

. To account for the noise present in webly-supervised data, we incorporate a novel attention component to reduce the effect of irrelevant examples, inspired by attention models for machine translation 


In this work, the target domain consists of curated videos, containing only a single concept or activity. We consider these curated videos to be a separate domain from web images and web videos. We assume there are relatively few irrelevant chunks from videos in the target domain compared to web videos. For example, this setting may be appropriate if the target domain was surveillance videos.

To check whether there is indeed a difference between the separate domains, we extracted embeddings from random images/frames from each domain using ResNet-34 [11] and visualized T-SNE [28] plots for four different action categories from UCF-101 [23]: Balance Beam, Long Jump, Surfing, Throw Discus. The top row corresponds to the embeddings before domain adaptation (DA) for curated video frames (blueblue points), web video frames (greengreen points), and web images (redred points). The bottom row corresponds to the embeddings after our DA (detail in Section 3). In the top row, before DA, there are visibly distinct regions corresponding to the three domains of web images, web videos and curated videos (we used UCF-101 [23] videos), which may indicate domain differences. After DA, the different domains are packed closer together.

To summarize, our contributions include:

  • A novel two-stage approach to first learn spatial weights from a 2D-CNN and then transfer these weights to a 3D-CNN to learn temporal weights.

  • A novel attention component to account for noise present in webly-supervised data.

  • Results competitive with state-of-the-art on UCF101 [23], while simplifying training.

2 Related Work

Webly-Supervised Learning

. Previous work using webly-supervised data include [21, 4, 17, 29]. Gan et al. [7] jointly match images and frames in a pre-processing step before using a classifier while LeadExceed [8] uses multiple steps to filter out noisy images and frames. In contrast, our model does not have pre-processing steps and learns to downweight noisy images as part of model training.

Li et al. [15] use web images to perform domain adaptation and learn a video classifier, but they manually filter out irrelevant web images beforehand whereas we incorporate this step into our model.

There has also been related work in using attention for weakly-supervised learning. Zhuang et al. [32] stack noisy web image features together with the assumption that at least one of the images is correctly labeled. They then learn an attention model to focus on the correctly labeled images. UntrimmedNet [29] generates clip proposals from untrimmed web videos and also incorporates an attention component for focusing on the proposals with the correct action. In contrast, our model learns from both images and videos and ties attention closely with domain adaptation.

Video Classification. 3D-CNN video models such as C3D [24], P3D [19], I3D [3], R(2+1)D  [25] are appealing for video classification since they learn appearance and motion features jointly. I3D [3] uses full 3D filters, while R(2+1)D  [25] and P3D [19] decompose the spatio-temporal convolution into a spatial convolution followed by a temporal convolution. The design of our 3D-CNN is partly inspired by these latter approaches because of this elegant decomposition, which allows us to reuse spatial filters from a conventional 2D CNN. We could potentially use the same bootstrapping technique to inflate 2D to 3D filters as in I3D [3], but initializing and fixing the 2D filters may allow for easier training (more detail in Section 3).

Domain Adaptation. There has been much work in adapting GANs [10] for domain adaptation. Models such as PixelDA [2] learn to generate realistic-looking samples from the source distribution, while others such as DANN [9] learn a domain-invariant feature representation. We adopt this latter approach in our work. Other related works include Adversarial Discriminative Domain Adaptation (ADDA) [27] which learns a piecewise model by pre-training a classifier on the source domain and then adds the adversarial component later. Tzeng et. al [26] learn domain-invariance by incorporating a domain confusion loss (similar to a discriminator loss) and transferring class correlations between domains to preserve class-specific information. Luo et al. [16] propose a similar model to ours but for the supervised setting, and add a semantic-transfer loss to encourage transfer of class-specific information.

The main difference between our model and these approaches is that we use webly-supervised data and assume the source and target domains may contain noisy labels, which is a considerably more difficult yet practical scenario. Lastly there is recent work by Zhang et al. [31] that is similar to our model in that they also have a domain-adversarial component and perform instance weighting to account for noise in the source data. However they use a dual-discriminator approach for instance weighting whereas we use an attention-based component. In addition our model is designed specifically for image to video domain adaptation and classification.

3 Model

Our goal is to learn a video classifier in the target domain by training on the webly-supervised (source) image and video domains. We propose a two-stage approach by first learning an image model using a standard 2D-CNN, transferring the learned spatial weights to a 3D-CNN and then continuing training on videos. We learn a separate model for images and videos since it may be difficult to learn appearance and motion features simultaneously.

Our model should: (1) learn appearance features in the image model and motion features in the video model (2) transfer the learned spatial weights from the image model to the video model properly (3) account for noise present in the webly-supervised images and videos and (4) perform domain adaptation from the webly-supervised domain to the target domain.

The image model shown in Figure 3 is a triplet network that performs both domain adaptation and attention-based filtering of noisy images. The three branches correspond to web images, web video frames, and target video frames (without labels). The image model learns domain invariance between the different domains and also uses an attention component to downweight irrelevant web images and web video frames, with respect to the target video frames. Intuitively we would like to downweight web images/frames that look different from target video frames.

Figure 3: Image Model. Triplet network with branches corresponding to web images, web video frames and target video frames. We add discriminators to enforce domain invariance between the separate domains and add attention components to downweight irrelevant examples. corresponds to the classifiers and corresponds to the losses.

The video model shown in Figure 4 is a Siamese network with branches corresponding to web videos and target videos (without labels). Note that the inputs are now video chunks rather than images.

Figure 4: Video Model. We use a Siamese model with branches corresponding to web video and curated video chunks. We initialize the spatial weights in the 3D-CNN and add an attention component to reduce the noise from irrelevant shots or incorrect labels. corresponds to the classifier and corresponds to the loss.

The spatial weights in the video model are initialized from the image model spatial weights and fixed (as indicated by the dashed lines in Figures 4 and 5). Similar to the image model, the video model also contains domain adaptation and attention components.

3.1 Notation

Let us define following notation:

  • : encoder (either a 2D or 3D CNN) returns

  • : classifier returns predictions among labels

  • : number of webly-supervised image and videos, and curated (target) videos respectively

  • : set of webly-supervised images where is the th image and is its corresponding label where

  • : set of webly-supervised videos where is the th video and is its corresponding label and . Each video consists of frames where is the number of frames in video

  • : set of curated videos where is the th video. Each video consists of frames where is the number of frames in video

3.2 Classification

We use ResNet-34 [11] as the base architecture for both our image and video models, along with the standard softmax cross-entropy loss to train a classifier for both web images and web video frames. The losses are computed as

where the expectations are taken over examples and and are their corresponding webly-supervised labels.

3.3 Domain Adaptation

Our goal is to learn an encoder that can produce feature embeddings that are indistinguishable between different domains. To this end, we use GANs [10] as a way to perform domain adaptation. The discriminator tries to distinguish between embeddings generated from different domains (shown in Figure 3). By optimizing over a mini-max objective, learns embeddings that can eventually “fool” , thus learning a domain-invariant feature representation.

We define our domain-adaptation loss as


Eqn. (1) distinguishes between web images and target frames, Eqn. (2) distinguishes between web video frames and target frames, and Eqn. (3) distinguishes between web images and web video frames. In each term, the first component corresponds to correctly distinguishing between different domains, and the second component tries to “fool” the discriminator . In addition, we use a multi-layer discriminator (similar to  [16]) structured as

where refers to the discriminator at the -th layer, refers to the discriminator output at the -th layer, denotes concatenation, is the CNN embedding from the -th layer, and

is the (relu) activation function. Intuitively we take the encoder outputs from multiple layers, concatenate them and feed them into a discriminator (a binary classifier). We have empirically found this multi-layer discriminator to perform better than the single-layer version.

3.4 Attention

Learning from webly-supervised data is difficult because it is inherently noisy. For example, if we query for a term such as “archery”, we may get some results containing the action of shooting a bow and arrow but we may also get advertisements for a sporting goods store, or product shots of archery equipment, which are likely less relevant for learning to recognize the action itself.

We present a novel approach for filtering noisy data inspired by work from machine translation [1]. The attention component learns to “filter out” or downweight irrelevant images/frames by comparing the images and frames in each source domain batch to the images from the target domain batch. Intuitively, images from the source domain batch that look very different to images in the target domain batch should be given low weight. For example, it is unlikely that an advertisement or product shot is going to look like frames from the target video which we assume is curated and contains only the action. In this way, we jointly learn the relevance of both web images and web videos by comparison to the target videos.

Note that this weighting is similar to the loss update from [31] but that is based on scores from a discriminator whereas our approach is based on learning a similarity function between the different domains. Unlike previous work [7] that performs pre-processing to filter out irrelevant images/frames, our approach learns a model of relevance jointly with other components during training. In our approach we do not need to perform manual filtering or pre-processing and instead the filtering happens jointly with model training. Zhuang [32] learns attention from stacking web image activations together. In contrast our model learns attention through a comparison of the source and target domain, which may provide a more direct signal for inferring the attention weights.

More formally, given a set of web images and their corresponding labels , we compute an attention score for each image such that (note that we compute these attention weights per batch during training). Let denote the CNN embedding for a given image . We compute attention scores as follows

where indicates the similarity between web image and target image and is the attention model. is a matrix with dimension and parameterizes the similarity between the embeddings from different domains. The parameters for are learned along with the rest of the model parameters. We then compute

where consists of the top scores along the th row. In practice we observed better performance when summing over the top scores instead of all scores in the row.

We then compute the image attention weights as

where is a temperature term. is then used to weight the image in the cross-entropy loss. The attention weights for video frames are computed in the same way by comparing to the target video frames.

3.5 Image Model

The image model loss can be rewritten as:


where the is used to weight the batch. We also incorporate the domain adaptation loss from Equation 4 to get a combined loss of


where is a tradeoff parameter between the weighted classification and domain adaptation terms. In practice we use the gradient reversal layer [9]

which multiplies the gradient from the discriminator by a negative constant during backpropagation, allowing us to perform optimization in one step instead of the usual two-step optimization for GANs.

3.6 Video Model

The next step is to transfer the spatial filters learned from the image model to the video model. We assume the spatial filters have been learned appropriately from the images and we want the video model to focus on learning the motion filters. One natural way to capture this intuition is by sequentially arranging the spatial filter followed by the temporal/motion filter, as shown in Figure 5a. In this way we elegantly decompose the spatiotemporal kernel into a spatial filter followed by a temporal filter. Note that this formulation corresponds to the R(2+1)D  [25] architecture.

Figure 5: Spatio-temporal Block. (a) the decomposition of the spatiotemporal block into a 2D spatial filter followed by a 1D temporal filter (corresponds to R(2+1)D  [25]

model) (b) our modified block with an added residual connection. The spatial weights are initialized from the 2D CNN weights and fixed, as indicated by the dashed lines.

Unfortunately, there is a problem with simply placing the temporal block directly after the spatial block as shown in Figure 5a for our use case. The “good” spatial filters (initialized from the image model) are now interleaved with untrained temporal filters, which means it is possible the output distribution of the spatiotemporal blocks can change significantly since the temporal filters still need to be learned (this is related to the problem of covariate shift [12]

in training deep networks). We can reduce this effect somewhat by initializing all the temporal filters to the identity matrix, which will reduce the video model to the image model. However, it is still possible that any slight change to the temporal weights may result in significant distribution changes to the spatiotemporal block, which can result in complicated optimization.

Similar to the motivation of ResNet [11], we propose to alleviate this issue by adding a residual connection (as shown in Figure 5b) and initializing the temporal filters to zero. This may be preferable to the previous approach since it may be easier to optimize to the residual. We have empirically found that we obtain better results by adding the residual connection, as detailed in Section 4.

The video model includes the same domain adaptation and attention components as earlier. The loss for the video domain in Figure 4 is


which has the effect of ignoring or downweighting irrelevant video chunks in a soft way. The combined loss is similar to the image loss


where is a tradeoff parameter.

3.7 Training

Putting all the pieces together, we first learn an image model (shown in Figure 3) using web images, web video frames, and target video frames as inputs. Each input is fed into a 2D CNN where we extract embeddings that are used to compute the domain adaptation and weighted classification losses. We then learn a video model (shown in Figure 4) by initializing the spatial filters from the 2D CNN and continue learning temporal filters from the videos. Similar to the image model, we use a 3D CNN to extract embeddings which are used to compute the domain adaptation, and weighted classification losses.

4 Experiments

4.1 Data

We evaluate our model on a standard benchmark for video classification, UCF-101 [23] and a larger dataset, Kinetics [13]. UCF-101 contains 101 action categories (such as “golf swing” or “playing guitar”) consisting of about 13K video clips while Kinetics is a larger dataset containing 400 action categories and about 300K video clips111Note there is a more recent version of Kinetics with 600 classes but we use the older version with 400 classes due to computational limitations.. Similar to previous webly-supervised approaches [7, 8], we used standard image search engines (e.g. Bing, Google) to collect between 800-900 images (using the “photo” filter in the query) and YouTube to collect between 25-50 videos for each category. For our UCF experiments, the whole dataset consists of about 200K images and video keyframes. We follow the same process for collecting webly-supervised data for Kinetics and used about 400K images and video keyframes222We did not notice a significant improvement when using more data and we wanted to reduce computational overhead..

Since UCF and Kinetics videos are drawn from YouTube, it is possible that there may be some overlap with the webly-supervised images and videos we collected. To remove any potential overlap, we extracted CNN embeddings from keyframes from both the UCF/Kinetics videos and compared them to the web images and videos. We then removed any web image or web video containing keyframes that had cosine similarity above a threshold (we used 0.9) with any UCF/Kinetics keyframe embedding.

4.2 Implementation

We used ResNet-34 [11] as the base network for all experiments. Every image is resized such that the shorter dimension is 256 and then a random crop of 224x224 is extracted. For videos we first resize the video in a similar way and then use the Hecate [22]

tool to extract keyframes and video chunks. For each video chunk, we extract 24 frames, sampling every other frame to obtain a volume size of 12x224x224x3 per chunk. We use a batch size of 32 for images and 10 for videos (note that we are limited by GPU memory in this case). All models are coded in PyTorch 


and trained using stochastic gradient descent with momentum. We use a held out validation set (20K webly-supervised images and frames for UCF-101 and 40K for Kinetics) to choose model hyperparameters. Both the 2D CNN encoders and classifiers in the image model (Figure 

3) and the 3D CNN encoders and classifiers in the video model (Figure 4) have tied weights.

4.3 Results

Our initial hypothesis was there may be a domain difference between images and videos that may be reducing the effectiveness of the model. To test this hypothesis, we trained a binary classifier (using ResNet-34 [11]) to distinguish between web images and web video frames and found that the classifier was over 99% accurate. We hypothesize that compared to web images, web videos tend to be lower resolution and may contain motion blur and compression artifacts not typically found in web images.

Next, we show an example of the weights learned by attention using a batch size of 32 in Figure 6. The weight ( in Equation 3.4) is shown for each web image along with the category of the image (the weights sum to 1). Images that are cartoon-like or contain excessive text tend to receive lower weight since they appear less similar to images from the target domain (UCF-101 [23]

). A failure case of the attention component can be observed for the “CliffDiving” and “Drumming” images in the last row. These images look reasonable but may have received lower batch score due to the extreme perspective and odd color palette, respectively. Also attention does not help for images with the wrong semantic category (e.g. “CliffDiving” in the first row).

Figure 6: Attention Weighting. For a web image batch, we show the weights ( in Equation 3.4) for each image and the category of the image (the weights sum to 1). Images with lower weight in the last row tend to be more cartoon-like or contain excessive text while images with higher weight tend to be more representative of the action. Images such as “CliffDiving” and “Drumming” in the last row appear reasonable and can be considered failure cases since they are assigned lower weight.

Table 1 shows ablation study results on UCF-101 [23]. For each row, the accuracy is averaged over the 3 splits of UCF-101.

Input Arch Features Accuracy (%)
I S App 62.984
F S App 57.955
I + F S App 70.117
I + F (A) T App 71.365
I + F (DA) T App 72.233
I + F (A + DA) T App 72.608
V S App + Temp 72.563
V (A) D App + Temp 74.012
V (DA) D App + Temp 74.288
V (A + DA) D App + Temp 74.876
Table 1: Ablation study on UCF-101. We evaluate the image and video models as well as the DA and attention components. We show the top-1% performance of each model averaged over 3 splits of UCF-101 [23]. Abbreviations are I: web image, F: web video frame, A: attention component, DA: domain adaptation component, V: web video, S: single branch (standard 2D CNN), T: triplet branch, D: dual branch (Siamese), App: appearance, Temp: temporal.
Input Arch Features Accuracy (%)
I + F S App 39.632
I + F (A) T App 41.808
I + F (DA) T App 41.946
I + F (A + DA) T App 42.263
V S App + Temp 42.220
V (A) D App + Temp 42.527
V (DA) D App + Temp 42.506
V (A + DA) D App + Temp 42.817
Table 2: Ablation study on Kinetics.. We evaluate different model components and show the top-1% accuracy of each model. The abbreviations are the same as in Table 1.

The inputs correspond to I: web images, F: web video frames, I + F: web images and video frames together, V: web video chunks. In addition, we train on the different model components A: the attention component, DA: the domain adaptation (adversarial) component, A + DA: both components. The model architectures correspond to S: single branch (i.e. a standard 2D CNN), D: dual branch (i.e. a Siamese network) corresponding to the image model, T: triplet branch corresponding to the video model. Lastly the features correspond to App: appearance (image) features, Temp: temporal (video) features, App + Temp: both appearance and temporal features.

We can see that model a trained with images and video frames together (I+F) outperforms a model trained with image (I) and video frames (F) separately. We verified that simply adding more images or video frames did not improve performance. Next, we can see that adding the domain adaptation (DA) and attention (A) components separately helps improve performance by a small amount. Adding both components leads to the best image model, I+F(A+DA), at 72.61% top-1 accuracy.

The next step is to take the spatial weights of the image model, initialize the video model and then continue training on web videos. The video model V has an accuracy of 72.56%, which is slightly lower compared to the image model accuracy of 72.61%. This may be due to irrelevant frames and noise present in the webly videos that are unaccounted for. Adding both the domain adaptation and attention components leads to our best performance of 74.88% top-1 accuracy on UCF-101 for the video model V(A+DA).

We also explored a couple variations of training the video model V. We first initialized V from ImageNet 

[5] spatial weights rather than the image model, which resulted in an accuracy of 59.12%. This drop in performance compared to model V (from 72.56% to 59.12%) may vindicate our two-step procedure of training an image model based on web images first, since training on web videos directly led to worse performance. In addition, we explored a variation of the video model V which does not use the residual connection (corresponding to Figure 5a). This model achieves an accuracy of 70.27% which is still lower than V which got 72.13%, which may indicate that adding the residual connection leads to an easier optimization.

We compare our approach to previous work on UCF-101 in Table 3. Among webly-supervised approaches, we are competitive with the state-of-the-art LeadExceed [8] model at 76.3%, vs 74.9% for our model. LeadExceed requires 5 stages of model training/refinement steps, while our model unifies classification and filtering and requires only 2 stages. Our model simplifies the training procedure at the cost of a small drop in accuracy (about 1.4%). We note there is still a large gap between webly-supervised methods and the state-of-the-art methods which directly use the UCF training data.

We also evaluate on the larger Kinetics [13] dataset. The results in Table 2 show similar improvements from adding the attention and domain adaptation components, leading to the best performance of 42.8% accuracy. We also compare against leading methods in Table 4 and note that there is a large gap between our webly-supervised approach and state-of-the-art. We are not aware of other webly-supervised approaches evaluated on Kinetics.

Approach Type Pre Train Acc(%)
UnAtt [15] App IN Web 66.4
Webly [7] App IN Web 69.3
LeadExceed [8] App + Temp IN Web 76.3
Our model App + Temp IN Web 74.9
C3D [24] App + Temp K UCF 82.3
2S [20] App + Temp IN UCF 88.0
R2D-2S [25] App + Temp K UCF 97.3
I3D-2S [3] App + Temp IN+K UCF 98.0
Table 3: UCF-101 Results. Comparison to several top approaches on UCF-101 [23]. Abbreviations are App: appearance, Temp: temporal, Pre: pretraining data, Train: training data, Acc: top-1 accuracy, IN: ImageNet, K: Kinetics.
Approach Pretrained Training Acc(%)
C3D[24] ImageNet Kinetics 57.0
2S [20] ImageNet Kinetics 61.0
R2D-RGB [25] 2S Sports-1M Kinetics 75.4
I3D-2S[3] ImageNet Kinetics 75.7
NL I3D [30] ImageNet Kinetics 77.7
SlowFast [6] None Kinetics 79.8
Our model ImageNet Web 42.8
Table 4: Kinetics-400 Results. Comparison to popular approaches on Kinetics [13] for top-1% accuracy on the validation set. The Two-Stream model is abbreviated as 2S.

5 Conclusion

We have presented a new model for video classification using only webly-supervised data. Our model proceeds in two stages by first learning an image model, transferring the spatial weights to the video model, and continuing training with videos. Our model also incorporates an adversarial component to learn a domain-invariant feature representation between source and target domains and accounts for noise using a novel attention component. We demonstrated performance competitive with state-of-the-art for webly-supervised approaches on UCF-101 [23] while simplifying the training procedure, and also evaluated on the larger Kinetics [13] for comparison.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  • [2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In

    Computer Vision and Pattern Recognition

    , 2017.
  • [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. CVPR, 2017.
  • [4] J. Chen, Y. Cui, G. Ye, D. Liu, and S.-F. Chang. Event-driven semantic concept discovery by exploiting weakly tagged internet images. In Proceedings of International Conference on Multimedia Retrieval, 2014.
  • [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [6] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. CoRR, 2018.
  • [7] C. Gan, C. Sun, L. Duan, and B. Gong. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In European Conference on Computer Vision (ECCV), 2016.
  • [8] C. Gan, T. Yao, K. Yang, Y. Yang, and T. Mei. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In Computer Vision and Pattern Recognition (CVPR), 2016.
  • [9] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks.

    Journal of Machine Learning Research

    , 17(59):1–35, 2016.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [13] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, A. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
  • [14] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
  • [15] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli. Attention transfer from web images for video recognition. ACM Multimedia, 2017.
  • [16] Z. Luo, Y. Zou, J. Hoffman, and L. Fei-Fei. Label efficient learning of transferable representations across domains and tasks. In Conference on Neural Information Processing Systems (NIPS), 2017.
  • [17] S. Ma, S. A. Bargal, J. Zhang, L. Sigal, and S. Sclaroff. Do less and achieve more: Training cnns for action recognition utilizing action images from the web. Pattern Recognition, 2017.
  • [18] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • [19] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017.
  • [20] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [21] B. Singh, X. Han, Z. Wu, V. I. Morariu, and L. S. Davis. Selecting relevant web trained concepts for automated event retrieval. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [22] Y. Song, M. Redi, J. Vallmitjana, and A. Jaimes. To click or not to click: Automatic selection of beautiful thumbnails from videos. In CIKM, 2016.
  • [23] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01, 2012.
  • [24] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the 2015 IEEE lInternational Conference on Computer Vision (ICCV), 2015.
  • [25] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
  • [26] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
  • [27] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2017.
  • [28] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008.
  • [29] L. Wang, Y. Xiong, D. Lin, and L. Van Gool. Untrimmednets for weakly supervised action recognition and detection. In CVPR, 2017.
  • [30] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018.
  • [31] J. Zhang, Z. Ding, W. Li, and P. Ogunbona. Importance weighted adversarial nets for partial domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2018.
  • [32] B. Zhuang, L. Liu, Y. Li, C. Shen, and I. D. Reid.

    Attend in groups: a weakly-supervised deep learning framework for learning from web data.

    In CVPR, 2017.