Being able to recognize every concept in the world is one of the ultimate goals of computer vision. For tasks such as image classification[9, 15], action recognition [12, 23], or semantic segmentation , the performance on many benchmark datasets has practically saturated [25, 8]. However, these tasks can only be considered as solved in the closed world of these datasets, where often the distribution of examples per category is artificially balanced both in the training and test sets. In the real world, this distribution is highly imbalanced, with a few categories covering most of the data (so called head of the distribution), and the rest having only several examples per category (so called tail). As a result, methods developed for many standard benchmark datasets end up being not well suited in the real world.
This problem is especially severe in the action recognition domain, where many actions are extremely rare. Consider the recent AVA dataset  for action detection. To collect this dataset, the authors have extensively annotated all the human actions that appear in the 107.5 hours of movie footage, resulting in a realistic distribution of training and test examples, shown in Figure 1. In particular, the most frequent category stand has 164932 training examples, whereas the rarest category extract has only 7. State-of-the-art methods on this dataset [4, 5] achieve very low performance on most of the tail categories.
Handling long-tail distributions in the training set has been studied in the image domain. The most common strategy is rebalancing the training set [11, 17, 34]. We demonstrate that naively applying this approach to highly imbalanced video datasets leads to a decrease in performance. We then analyze the reasons for this phenomenon and propose an effective alternative. In  the authors learn to transfer information from head to tail categories via meta-learning. Their approach, however, is based on an assumption that a large set of head categories is available for meta-training, which does not hold for AVA. We also propose to transfer information from head to tail categories, but our method does not have such assumptions, is simpler, and more efficient. Finally, Zhao et al.  utilize the WordNet hierarchy  to aid in transferring information between semantically related categories. While promising, this approach is not directly applicable to the domain of actions. Notice that none of these works address the imbalance in the distribution of test examples.
In this paper we study the problem of action detection in the wild, using the AVA dataset as the motivating example. In contrast to previous work, we start by analyzing the data imbalance problem in the test distribution. We demonstrate that the standard AP metric used for evaluating action detection is not informative for the tail classes: the overwhelming majority of the test examples are negatives for these categories, thus the false positive rate dominates the score. Based on this observation we proposed an alternative measure: sampled AP. It is computed by taking multiple balanced samples from the test set and averaging the score over the samples. With a sufficient number of samples, this measure preserves the standard AP performance for the head categories, while addressing the aforementioned issue for the categories in the tail. Note that AP often does not allow to make conclusion about performance of a model on the tail categories, whereas SAP is an actionable metric for learning to recognize rare classes
Armed with the new measure, we analyze the problem of detecting the actions in the tail. Firstly, we experiment with naive rebalancing of the training examples. This approach, however, results in a decrease in performance for both head and tail categories. Recall that in AVA the most frequent category has 22560 times more examples than the least frequent one, thus the rebalanced dataset mostly consists of duplicated examples from the tail, not allowing the model to learn a high quality representation within a realistic time budget. To mitigate this issue, we propose to explicitly split the dataset into the head and tail parts and only train our model on the data-rich head categories. We then freeze the model, and learn an action classifier using the whole dataset with balancing, see Figure2. This simple approach improves the performance of the tail categories by a significant margin. We also study several other variants of model finetuning, and demonstrate that they lead to an inferior performance.
To summarize, this work has two main contributions:
We propose a novel metric for action detection: sampled mAP. It preserves the properties of mAP for the categories in the head, while allowing to better analyze the tail categories.
We propose a simple approach which significantly improves the performance of the categories in the tail.
2 Related work
Long tail distributions have been well studied in the image domain. Oversampling  and undersampling  are two most common approaches in the literature. Oversampling rebalances the training set by duplicating the samples of the tail classes, whereas undersampling ignores some of the examples in the head classes. We demonstrate that naively applying such approaches to highly-imbalanced video datasets leads to a decrease in performance. Wang et al.  proposed to use meta-learning to transfer information from head to tail categories. In particular, they learn a model that takes a classifier trained from a few examples and transforms into a large sample classifier. Their method, however, assumes that many large-sample categories are available in the meta-training stage. Our approach is not limited by such assumptions, and is also simpler and more efficient. Recently, Cui et al. 
proposed a two-stage method for transfer learning on fine-grained classification tasks, where the model is first trained on the whole dataset and then fine-tuned on a balanced subset. In contrast, we demonstrate that explicitly separating the training set into head and tail categories and only training the model on the head, as well as freezing the model weights in the balanced training stage leads to better results. With the development of deep learning methods, new kinds of loss functions have been proposed to mitigate data imbalance. In  the authors propose to utilize the information about semantic similarity between categories to aid in transferring information between them. While this approach is promising, it can not be extended to the domain of actions in a straight-forward way. In object detection Lin et al.  proposed focal loss to down-weight the gradients of well-classified examples. It is however, designed for relatively well balanced datasets, such as COCO. Our preliminary results on training human pose categories in AVA with focal loss shows no performance increase. Differently from all the methods above, our study focuses on significantly more imbalanced datasets, such as the AVA dataset for action detection, and addresses data imbalance not only in the training, but also in the test set.
Action recognition is concerned with classifying the actions in videos. Hand-crafted features  were used in the early works where features generated through tracking the pixels are aggregated along the temporal axis. The pre-deep-learning methods later were outperformed by deep learning based models, such as two-stream networks . They separately process image and optical flow inputs with two CNNs, whose outputs are later merged. However, the capacity of these models is limited by only relying on 2D information. The limitation was addressed by  who extended 2D CNN filters with additional dimension which enabled learning spatio-temporal features. Carreira and Zisserman 
further extended the work and introduced Inflated 3D ConvNet(I3D) by integrating 3D filters in state-of-the-art 2D architecture and bootstrapping the 3D filters from 2D filters pretrained on ImageNet. Wang et al. further improved the performance by proposing non-local blocks that integrate information from distant locations in space and time. In this paper, we use non-local I3D model to learn a spatio-temporal feature representation. Notice that the high representational power of these models comes at a price: they can easily overfit to the few training examples of the tail categories. We address this issue by proposing a new training scheme for transferring representations from head to tail classes.
Action detection studied in this paper is the task of spatially localizing the actors with bounding boxes and recognizing their actions. Early action detection methods [14, 21] generate hand crafted features from videos and train SVM classifier. Early deep-learning based action detection models [6, 20, 22, 24, 31] are developed on top of 2-D object detection frameworks, where 2-D appearance features are used for action classification. As a further step, Kalogeiton et al. 
propose to take multiple frames as input and predict and classify short tubelets instead of single bounding boxes. Recently, many works propose to take videos as input and learn spatio-temporal features with 3D convolutional neural network. TCNN uses C3D to extract features, allowing it to achieve a large performance increase. Gu et al.  propose to use an I3D for feature representation, which can take even longer video sequences as input. Recently, several works have increased the performance of this baseline approach. For instance, Feichtenhofer et al.  propose a slowfast network with two pathways, where one processes a video with a high FPS and another with a low FPS. Their model is able to capture long temporal dependencies as well as informative context improving the action detection performance. Girdhar et al.  further propose to use transformer-style architecture to integrate useful information temporally and spatially. Very recently, Zhang et al.  proposed to augment I3Ds with a structured module based on graph convolutions which also achieves strong improvements on AVA.
We begin by introducing the dataset and models used in our study.
3.1 Dataset and metric
We perform our study using the AVA dataset . It consists of hours of raw videos collected from movies and exhaustively annotated with 80 human action categories including human pose, human-object manipulation and human-human interaction. The dataset is split into 211k training and 57k validation clips, each of which typically contains several action instances. The exhaustive labelling of all actions of all persons in all key frames at 1 Hz results in a Zipf’s law type of imbalance across action categories as shown in Figure 1. In particular, a common action, like standing, has 160k training and 43k test examples, whereas a rare action, like point to (an object) only contains 96 training and 32 test instances. Such a distribution of the action categories is desirable, as it represent realistic scenarios. Yet, it makes both training and evaluating the model challenging.
We use frame-based mean average precision(mAP) with intersection-over-union (IOU) threshold 0.5 for evaluation. Following the protocol in , we only evaluate on 60 categories which have more than 25 examples in the validation set. In addition to the standard AP metric, we also show results on our proposed sampled AP which, as discussed in Section 4, is more informative for the categories in the tail.
3.2 Action detection model
We work on action detection task where actors are localized with bounding boxes and actions are recognized based on the three-seconds context of video. The action detection model that we implement takes three second videos as input and passes through the non-local Inflated 3D ConvNet (I3D) to generate video feature representation. At the same time a state-of-the-art actor detector localizes actor on the middle frame of each three second video. We use ROI pooling to extract features for each bounding boxes and recognize actions with one fully connected layer.
3.3 Implementation details
Our method is implemented on the Caffe2 framework. We use 2D ResNet-50 architecture and pretrain it on ImageNet dataset  which is then inflated into 3D ConvNet and fine-tuned on Kinetics dataset . The input of our action detection model is 36 frames from 3 seconds video with 12 fps. The images are first scaled to 272 272, and randomly cropped to 256 256.
We trained our model on 8-GPU machine. For training I3D backbone we use 3 video clips as mini-batch with a total batch size 24. For training linear classifier transformation module, we use 1000 as total batch size. We freeze parameters in batch normalization layers during training and apply drop out layer before final layer. We use drop out rate 0.3. In the first stage, we trained the first 90K iteration with learning rate 0.00125 and further train 10K iterations with learning rate 0.000125. In the second stage, we fine tune the transformation module for 1 epoch with linear decay learning rate from 0.001 to 0.0001.
We visualize the mSAP score (top) and the variance (bottom) with sampling number from 5 to 40 for all test categories. As observed from the figures, our SAP metric is very stable.
4 Measuring the imbalanced world
We now study the problem of performance evaluation of action detection methods in imbalanced datasets. We begin by discussing mAP - the standard measure for action detection, and demonstrating that it is not well suited for measuring the performance on the tail categories. We then propose an alternative metric - mean sampled average precision (mSAP), which addresses the limitations of mAP in a principled way. Finally, we present an empirical analysis of the two metrics, demonstrating that mSAP preserves the mAP’s scores for the categories in the head, while providing more informative scores for the categories in the tail.
4.1 The AP metric in imbalanced sets
Consider a category in the AVA dataset, such as stand. For any given model, the average precision for this category is computed as a mean of precision values over different recall levels, where precision is defined in Eq (1), and recall in Eq (2).
In these equations, stands for true positives - the number of ground truth instances of stand correctly predicted by the model, for false positives - the number of instances other categories classified as stand, and is the number of false negatives - misclassified examples of stand. This measure was originally introduced for information retrieval  to capture how accurate a model is in finding the instances of a certain class in a given collection. It was later adapted by the computer vision community for object , and subsequently, action detection .
The test sets in these datasets were, however, artificially balanced, thus an information retrieval metric was suitable for comparing the performance of detection models on different categories. Indeed, if the proportion of two categories in a set is comparable, then the complexity of retrieving examples of these categories from the set is mainly determined by the ability of the model to recognize them. This, however, is not true for a highly imbalanced dataset, such as AVA. For instance, among the 93994 examples in the validation set, 44449 belongs to the category watch (a person) and only 32 are examples of point to (a person). Suppose we have a model that gives truly random prediction at each recall level. According to the precision function in Equation (1), the proportion of the true positive sample over all positive samples is equal to the ratio of positive samples in the test set. Therefore the AP scores of this random model for watch and point to are 0.473 and 0.0003 respectively. Clearly the model’s capacity on two classes is supposed to be similar. However, we are not able to make this conclusion from the values of AP score. We argue that the main reason is that the precision in Equation (1) is strongly influenced by the size of the pool of positive and negative examples for the category, and not only by the recognition performance of the model.
This observation demonstrates that information retrieval metrics are not suitable for measuring recognition performance in imbalanced datasets. We now propose an improved metric which is explicitly separating the complexity of action recognition from that of retrieving extremely rare instances from large data collections.
|watch (a person)||59.0||59.3||61.6||61.9|
|work on computer||4.2||24.5||63.5||75.4|
|hit (an object)||0.2||0.6||42.6||58.8|
|point to (object)||0.1||0.2||23.2||41.8|
4.2 A better metric: sampled AP
In order to minimize the influence of class imbalance in testing, while still relying on a retrieval-based metric, we propose to construct an independent and balanced retrieval problem for each class. To this end, we randomly sample a subset of negative boxes for a category to obtain a balanced pool of positive and negative examples, and then compute the standard AP on this set. We repeat this process multiple times and average the results to obtain the final sampled AP (SAP) score.
More formally, assume that is the test set and is the set of test samples that belong to class . For each trial , we randomly sample a set of negatives (examples from other categories than ), with from the set of all negative samples . We then use these two sets to compute the average precision . This process is repeated times and the final sampled AP of a class is computed as follows:
Finally, we average the SAP scores for all the classes to obtain the mean sampled average precision (mSAP):
where is the number of classes in the dataset.
This metric has several desired properties. First of all, it directly addresses the issue with the AP discussed in Section 4.1: balancing the number of positive and negative examples in each sample enables the AP score to reflect the recognition capabilities of the model, and not the complexity of retrieving a few instances from a large set of distractors. Secondly, as we will show in the next section, it preserves the original AP scores for the categories in the head of the distribution. Finally, despite the fact that only a subset of the test set is used in each sample, the model is still being evaluated on the whole test set, given random sampling and enough trials .
We now experimentally analyze our proposed metric by comparing the performance of two models on the tail categories of AVA. One model is the baseline from , whereas the other is our model that transfers the representation from head to tail categories (discussed in detail in the next Section). As can be seen from Table 1, the standard AP is capturing the difference between the two variants for the categories in the head, but the performance difference of the tail categories is uniformly close to 0. This experimentally confirms our claim that AP is not a suitable metric for action detection in highly imbalanced datasets. In contrast, our SAP metric is not biased by the distribution of the test examples and provides informative results for all the categories in the tail. At the same time, it preserves the original AP scores for the head categories. Using our proposed metric we can discover that our method improves the recognition performance for the rare point to category with respect to the baseline, whereas the standard AP score did not allow to make any conclusions. In the next section we capitalize on this new source of information to further improve the performance of the tail categories.
We also study the robustness of our proposed metric to the number of samples used in Eq. 3. Figure 4 shows the mean and variance of the scores of the baseline model from  varying the value of . We observe that the metric is relatively stable using as few as 15 samples, allowing to compute it efficiently even on datasets with a large number of categories. We also visualize the per-class SAP score and standard deviation of sampling 15 times for the baseline model in Figure 3 where the classes are sorted by the number of samples in decreasing order. Overall, the values of the standard deviation of both head classes and tail classes are small which proves that our proposed SAP metric is informative as well as practical.
|Methods||Sampled AP||Standard AP|
|All classes||Tail classes||Head classes||All classes||Tail classes||Head classes|
|1st stage all||47.4||51.8||46.1||16.4||7.9||18.9|
|1st stage head||49.1||53.9||47.7||17.6||8.5||20.3|
|Models||Sampled AP||Standard AP|
|All classes||Tail classes||Head classes||All classes||Tail classes||Head classes|
|2nd stage all||47.3||50.9||46.2||16.3||8.2||18.7|
|2nd stage classifier||49.1||53.9||47.7||17.6||8.5||20.3|
|Models||Sampled AP||Standard AP|
|Ours + original distribution||41.6||35.1||14.4||0.4|
|Ours + balanced distribution||49.1||53.9||17.6||8.5|
5 Learning in the tail
Having developed an informative metric for evaluating a model’s performance in the tail, we now focus on studying different strategies for knowledge transfer from the data-rich head to the rare tail categories. We begin with the standard approach of oversampling during training , see Section 5.1. We observe that in the action detection domain this leads to poor performance. To address this issue, we propose an alternative approach, which learns a features representation on the head classes, see Section 5.2. Finally, in Section 5.3 we examine different variants of our approach.
5.1 A simple baseline with standard balancing
Balancing the training set is a simple strategy that is widely used to handle the long-tail distribution problem. Following the design principle of , we balance the training set by duplicating samples of the tail classes so that each class has roughly the same number of samples. We directly train the action detection model with the balanced set. As shown in Table 2, we obtain the mAP score of 11.3 compared to the original baseline with 16.7. Specifically, we observe an overall decrease on both classes with hundreds of thousands of samples and classes with tens of instances. We argue that the decrease in performance is the result of the fact that the model fails to learn a useful representation for this highly redundant set, remembering that in AVA, the categories in the tail are thousands times less frequent than the head categories. Therefore, in the balanced training set the examples of rare categories are oversampled, and most of the batches are not informative. Instead we propose to explicitly split the training set into head and tail parts and use the former to learn a generic representation for action detection. We then transfer this representation to learn to detect rare categories, experimenting with several transfer strategies.
5.2 Our training approach for action detection
The problem of identifying a subset of categories in a dataset that are useful for learning a transferable representation is not trivial. We first observe that the selected classes should be both representative of the whole set of categories and have enough examples to learn a rich feature representation. Therefore, we want to select all the categories for which a model is able to learn a strong representation. In other words, the categories that are defined as tail classes should be those classes which have too few samples for model to learn. We use the difference on the AP score of a baseline model on the training set and test sets as our main criterion for selecting the classes in the head. The intuition is that for the classes which have too few training samples the training AP scores will be even lower than test AP scores, indicating that the model is not able to adapt to them. The split is shown in Figure 5 with head classes marked with orange color and tail classes in blue color.
Given the split into head and tail classes, we first train our model only on the information-rich head categories. Next, in the second stage, we freeze the model, and learn a linear classifier using the whole dataset with balancing. The whole process is shown in the Figure 2. The overall training schema is elaborated in Algorithm 1. We show our mean sampled AP and the standard mAP of the action detection baseline on all action classes, tail classes and head classes respectively in Table 2. As we can see in the table, our proposed method achieves a significant increase of 6.2% mSAP on the tail classes. We compare the SAP performance of our proposed method and the baseline approach on the tail classes in Figure 6. We observe a significant increase on the classes like ”work on a computer” and ”cut” which have a discriminative appearance, but were hard for the baseline to learn due to the imbalance in the training set. In contrast, our proposed transfer learning schema allows to better recognize these classes even from a few examples.
5.3 Ablation of variants of our approach
We now analyze the effectiveness of our approach of excluding the tail classes from the first stage of training in Table 3. To this end we explore the alternative strategy of training the model on all the categories without balancing and then re-learning a linear classifier with balancing. This variant still shows an improvement on the tail categories over the baselines due to balancing in the second stage, but fails to preserve the performance of head categories. This is due to the fact that the tail classes have very few training samples, on which it is hard for the model to learn generalizable features. Our proposed training schema, on the other hand, shows an even larger performance boost on the tail classes and at the same time, allows the model to have similar performance on head classes which shows that we can maximally exploit the model’s capacity by training I3D feature representation only with head classes.
Next, we study different ways of transferring representation from head to tail classes in Table 4. In particular, we analyze two approaches: finetuning the whole model, and only learning a linear classifier on top of a fixed representation. According to our experiment results, both approaches results in an increase in performance on the tail classes, though improvement of our approach is larger. The performance increase is relatively small on standard mAP, however our sampled AP allows to better compare the two variants, showing an improvement of 3%. Note that another good feature of fixing the feature representation during the transfer stage is that it also helps preserve the performance on head classes. We hypothesize that this is due to the fact that updating the I3D on a highly redundant, balanced training set results in it loosing some of its generalization abilities.
Finally, we evaluate how balancing the training data during the second stage influences the performance. Table 5 compares the results on the balanced training set as well as the original training set respectively under our proposed two-stage training schema. As we can see from the table, without balancing the model is biases towards head classes with far more samples, resulting in a significant decrease of 18% mSAP on the tail classes.
In this paper we studied action detection in the real world, where the goal is to learn a model that is able to recognize both the head classes with hundreds of thousands of samples, as well as the tail classes with only a few samples. Different from the other works, we first analyzed the data imbalance problem in the test set. We demonstrated that the standard mAP metric is not suitable for measuring the performance of the tail classes. We then proposed a new, more informative measure - mean sampled average precision (mSAP), which takes balanced samples from the test set and averages the sample scores.
We then used the new measure to study the problem of class imbalance in the training set. We proposed a simple training schema where features are learnt for the head classes and are then transferred to the tail. We also show that training a simple linear classifier with balancing in the second stage on top of fixed representation is both an efficient and effective transfer strategy. Our proposed method can be used to boost the performance of existing action detection models in a simple way.
-  J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
-  Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie. Large scale fine-grained categorization and domain-specific transfer learning. CVPR, pages 4109–4118, 2018.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
-  C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. CoRR, 2018.
R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman.
Video action transformer network.CVPR, 2019.
-  G. Gkioxari and J. Malik. Finding action tubes. In CVPR, 2015.
-  C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, et al. AVA: A video dataset of spatio-temporally localized atomic visual actions. CVPR, 2018.
-  D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In CVPR, pages 5927–5935, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026–1034, 2015.
-  R. Hou, C. Chen, and M. Shah. Tube convolutional neural network (T-CNN) for action detection in videos. In ICCV, 2017.
-  N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. INTELL DATA ANAL, 6(5):429–449, Oct. 2002.
-  S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. TPAMI, 35(1):221–231, 2013.
-  V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid. Action tubelet detector for spatio-temporal action localization. In ICCV, 2017.
-  A. Kläser, M. Marszałek, C. Schmid, and A. Zisserman. Human focused action localization in video. In ECCV, 2010.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
-  T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. ICCV, pages 2999–3007, 2017.
-  C. X. Ling and C. Li. Data mining for direct marketing: Problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, KDD’98, pages 73–79. AAAI Press, 1998.
-  C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. NLE, 16(1):100–103, 2010.
-  G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
-  X. Peng and C. Schmid. Multi-region two-stream r-cnn for action detection. In ECCV, 2016.
-  A. Prest, V. Ferrari, and C. Schmid. Explicit modeling of human-object interactions in realistic videos. TPAMI, 35(4):835–848, 2013.
-  S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. BMVC, 2016.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
-  G. Singh, S. Saha, M. Sapienza, P. H. Torr, and F. Cuzzolin. Online real-time multiple spatiotemporal action localisation and prediction. In ICCV, 2017.
-  J. C. Stroud, D. A. Ross, C. Sun, J. Deng, and R. Sukthankar. D3D: Distilled 3d networks for video action recognition. arXiv preprint arXiv:1812.08249, 2018.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
-  H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1):60–79, 2013.
-  S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, and P. J. Kennedy. Training deep neural networks on imbalanced data sets. IJCNN, pages 4368–4374, 2016.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. CVPR, 2017.
-  Y.-X. Wang, D. Ramanan, and M. Hebert. Learning to model the tail. In NIPS, pages 7029–7039, 2017.
-  P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. In ICCV, 2015.
-  Y. Zhang, P. Tokmakov, M. Hebert, and C. Schmid. A structured model for action detection. CVPR, 2019.
-  H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba. Open vocabulary scene parsing. In ICCV, pages 2002–2010, 2017.
-  Q. Zhong, C. Li, Y. Zhang, H. Sun, S. Yang, D. Xie, and S. Pu. Towards good practices for recognition & detection. In CVPR workshops, 2016.