Localizing the Common Action Among a Few Videos

08/13/2020 ∙ by Pengwan Yang, et al. ∙ 18

This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (i) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (ii) a progressive alignment module that iteratively fuses the support videos into the query branch; and (iii) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.



There are no comments yet.


page 12

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of this paper is to localize the temporal extent of an action in a long untrimmed video. This challenging problem [8, 32]

has witnessed considerable progress thanks to deep learning solutions,

e.g.[37, 12, 26], fueled by the availability of large-scale video datasets containing the start, the end, and the class of the action [17, 3, 6]. Recently, weakly-supervised alternatives have appeared, e.g[43, 34, 31, 47, 1, 25, 24, 18]. They avoid the need for hard to obtain start and end time annotations, but still require hundreds of videos labeled with their action class. In this paper, we also aim for a weakly-supervised setup, but we avoid the need for any action class labels. We propose few-shot common action localization, which determines the start and end of an action in a long untrimmed video based on just a hand-full of trimmed videos containing the same action, without knowing their common class label.

We are inspired by recent works on few-shot object detection [7, 35, 16, 36]. Dong et al[7] start from a few labeled boxes per object and a large pool of unlabeled images. Pseudo-labels for the unlabeled images are utilized to iteratively refine the object detection result. Both Shaban et al[36] and Hu et al[16] further relax the labeling constraint by only requiring a few examples to contain a common object, without the strict need to know their class name. Hu et al[16] introduce two modules to reweigh the influence of each example and to leverage spatial similarity between support and query images. We also require that our few examples contain a common class and we adopt a reweighting module. Different from Hu et al., we have no module to focus on masking objects spatially in images. Instead, we introduce three alternative modules optimized for localizing actions temporally in long untrimmed videos, as illustrated in Figure 1.

Figure 1: Common action localization in an untrimmed query video from three trimmed support videos during inference. The action is localized in the query video based on the common action in the support videos.

We make three contributions in this work. First, we consider common action localization from the few-shot perspective. All we require is that the few trimmed video examples share a common action, which may be obtained from social tags, hash tags or off-the-shelve action classifiers. Second, we propose a network architecture for few-shot common action localization, along with three modules able to align representations from the support videos with the relevant query video segments. The mutual enhancement module strengthens the representations of the query and support representations simultaneously by building upon non-local blocks 

[44]. The progressive alignment module iteratively integrates the support branch into the query branch. Lastly, the pairwise matching module learns to weigh the importance of different support videos. As a third contribution, we reorganize the videos in ActivityNet1.3 [3] and Thumos14 [17] to allow for experimental evaluation of few-shot common action localization in long untrimmed videos containing a single or multiple action instances.

2 Related work

Action localization from many examples. Standard action localization is concerned with finding the start and end times of actions in videos from many training videos with labeled temporal boundaries [2, 37, 9]. A common approach is to employ sliding windows to generate segments and subsequently classify them with action classifiers [37, 11, 42, 5, 46]. Due to the computational cost of sliding windows, several approaches model the temporal evolution of actions and predict an action label at each time step [9, 27, 38, 48]. The R-C3D action localization pipeline [45] encodes the frames with fully-convolutional 3D filters, generates action proposals, then classifies and refines them. In this paper, we adopt the proposal subnet of R-C3D to obtain class-agnostic action proposals. In weakly-supervised localization, the models are learned from training videos without temporal annotations. They only rely on the global action class labels [30, 43, 31]. Different from both standard and weakly-supervised action localization, our common action localization focuses on finding the common action in a long untrimmed query video given a few (or just one) trimmed support videos without knowing the common action class label, making our task class-agnostic. Furthermore, the videos used to train our approach contain actions that are not seen during testing.

Action localization from few examples. Yang et al[46] pioneered few-shot labeled action localization, where a few (or at least one) positive labeled and several negative labeled videos steer the localization via an end-to-end meta-learning strategy. It relies on sliding windows to swipe over the untrimmed query video to generate fixed boundary proposals. Rather then relying on a few positive and many negative action class labels, our approach does not require any predefined positive nor negative action labels, all we require is that the few support videos have the same action in common. Moreover, we propose a network architecture with three modules that predicts proposals of arbitrary length from commonality only.

Action localization from one example. Closest to our work is video re-localization by Feng et al[10], which introduces localization in an untrimmed query video from a single unlabeled support video. They propose a bilinear matching module with gating functions for the localization. Compared to video relocalization, we consider a more general and realistic setting, where more than one support video can be used. Furthermore, we consider untrimmed videos of longer temporal extent and we consider action localization from a single frame. To enable action localization under these challenging settings, we introduce modules that learn to enhance and align the query video with one or more support videos, while furthermore learning to weigh individual support videos. We find that our proposed common action localization formulation obtains better results, both in existing and in our new settings.

Action localization from no examples. Localization has also been investigated from a zero-shot perspective by linking actions to relevant objects [19, 20, 29]. Soomro et al[39] tackle action localization in an unsupervised setting, where no annotations are provided overall. While zero-shot and unsupervised action localization show promise, current approaches are not competitive with (weakly-)supervised alternatives, hence we focus on the few-shot setting.

3 Method

3.1 Problem description

For the task of few-shot common action localization, we are given a set of trimmed support videos , where is small, and an untrimmed query video . Both the support and query videos contain activity class , although its label is not provided. The goal is to learn a function that outputs the temporal segments for activity class in the query video. The function is parametrized by a deep network consisting of a support and query branch. During training, we have access to a set of support-query tuples . During both validation and testing, we are only given a few trimmed support videos with corresponding long untrimmed query video. The data is divided such that .

3.2 Architecture

We propose an end-to-end network to solve the few-shot common action localization problem. A single query video and a few support videos are fed into the backbone, a C3D network [40], to obtain video representations. The weights of the backbone network are shared between the support and query videos. For the query video, a proposal subnet predicts temporal segments of variable length containing potential activities [45]. Let denote the feature representation of the query video for temporal proposal segments, each of dimensionality . Let denote the representations of the support videos, where we split each support video into fixed temporal parts. The main goal of the network is to align the support representations with the relevant query segment representation:


In Equation 1, denotes the temporal segment representations after alignment with the support representations through . In our common localization network, representations are fed to fully-connected layers that perform a binary classification to obtain the likelihood that each proposal segment matches with the support actions, which is followed by a temporal regression to refine the activity start- and end-times for all segments.

In our network, we consider the following: i) the representations of the support videos need to be aligned with the representations of the activity in the query video, ii) not all support videos are equally informative, and iii) common action localization is a support-conditioned localization task, where the activityness of different query segments should be guided by the support videos. We propose three modules, namely mutual enhancement module, progressive alignment module, and pairwise matching module to deal with these considerations.

Figure 2: Modules for aligning representations from the support videos with the relevant query video segments. The mutual enhancement module augments the support and query representations simultaneously through message passing. Then, the progressive alignment module fuses the support into the query branch through recursive use of the basic block. Finally, the pairwise matching module reweighs the fused features according to the similarity between the enhanced query segments and the enhanced support videos.

3.2.1 Mutual enhancement module.

Building on the recent success of the transformer structure [41] and the non-local block [44], which are forms of self-attention, we propose a module which can simultaneously enhance the representations of the support and query videos from each other. The basic block for this module is given as:


where are fully-connected layers, soft denotes the softmax activation, and denotes matrix multiplication. and denote the two inputs. A detailed overview and illustration of the basic block is provided in the supplementary materials. Based on the basic block, we design a mutual enhancement module to learn mutually-enforced representations, both for query proposals and support videos, as shown in Figure 2. The mutual enhancement module has two streams , that are responsible for enhancing query proposals and support videos respectively. The inputs to the mutual enhancement module, and , will be enhanced by each other:


3.2.2 Progressive alignment module.

We also propose a progressive alignment module to achieve a better fusion of representations from query proposals and support videos. The idea behind this module is to reuse the basic block from the mutual enhancement module to integrate the support branch into the query branch. Inspired by the successful application of residual learning [14, 15], we employ a residual block to make the progressive alignment effective:


where , are fully-connected layers, relu denotes the ReLU activation. A detailed overview and illustration of the residual block is provided in the supplementary materials. We first take query proposal representations from the first module as 0-depth outcome . On top, we adopt our basic block to integrate this outcome with which has been recalibrated by our residual block . We perform this operation multiple times in a recursive manner, i.e.:


Where we set in practice. The advantage of a progressive design is that it strengthens the integration of the support branch into the query branch as we increase the number of basic block iterations. By using the same efficient basic blocks as our first module, the computational overhead is small. An illustration of the progressive alignment module is shown in Figure 2.

3.2.3 Pairwise matching module.

In common action localization, a small number of support videos is used. Intuitively, not every support video is equally informative for the query segments. In addition, different query segments should not be treated equally either. To address these intuitions, we add a means to weigh the influence between each support video and each query segment, by introducing a pairwise matching module.

The input for the matching module are all segments of the query video and all support videos. The pair-wise matching is a mapping . To align the two components, we first perform an expansion operation on the query segments, denoted as . Then a pooling is applied over the support videos along the temporal dimension, denoted as . Afterwards, we perform an auto broadcasting operation on , which can broadcast the dimension of from to to align with the dimension of . For query segments and for support videos

, their match is given by the cosine similarity (

) and Euclidean distance along the segment axis:


We combine both distance measures:



denotes the Sigmoid operation. Tensor

can be interpreted as a weight tensor to achieve attention over the and dimensions. is a scalar depicting the similarity between the -th query segment representation and the -th support representation. For the -th query segment representation, corresponds to the weight for different support videos, while for the -th support representation, resembles the weight for different query segments. In the end, we enforce the pairwise matching weight :


where AP denotes an average pooling operation along the support dimension, in other words, AP .

3.3 Optimization

To optimize our network on the training set, we employ both a classification loss and a temporal regression loss. Different than e.g., R-C3D [45]

, our classification task is specifically dependent on the few support videos. Accordingly, the loss function is given as:


where and stand for batch size and the number of proposal segments, while denotes the proposal segment index in a batch,

is the predicted probability of the proposal segment,

is the ground truth label, and represents predicted relative offset to proposals. In the context of this work, the ground truth label is class-agnostic and hence binary (foreground/background), indicating the presence of an action or not. Lastly, represents the coordinate transformation of ground truth segments to proposals.

The above loss function is applied on two parts: the support-agnostic part and the support-conditioned part. All losses for the two parts are optimized jointly. In the support-agnostic part, the foreground/background classification loss predicts whether the proposal contains an activity, or not, and the regression loss optimizes the relative displacement between proposals and ground truths. For the support-conditioned part, the loss predicts whether the proposal has the same common action as the one among the few support videos. The regression loss optimizes the relative displacement between activities and ground truths. We note explicitly that this is done for the training set only.

During inference, the proposal subnet generates proposals for the query video. The proposals are refined by Non-Maximum Suppression (NMS) with a threshold of 0.7. Then the selected proposals are fused with the support videos through the mutual enhancement, progressive alignment, and pairwise matching modules. The obtained representation is fed to the classification subnet to again perform binary classification and the boundaries of the predicted proposals are further refined by the regression layer. Finally, we conduct NMS based on the confidence scores of the refined proposals to remove redundant ones, and the threshold in NMS is set a little bit smaller than the overlap threshold in evaluation ( in this paper).

3.3.1 Optimizing for long videos.

The longer the untrimmed query video, the larger the need for common localization, as manual searching for the activity becomes problematic. In our setup, the length of the input video is set to 768 frames to fit the GPU memory. When the query video is longer than 768 frames, we employ multi-scale segment generation [37]. We apply temporal sliding windows of 256, 512, and 768 frames with 75% overlap. Consequently, we generate a set of candidates as input for the proposal subnet, where H is the total number of sliding windows, and and are the starting time and ending time of the -th segment . All refined proposals of all candidate segments together go through the NMS to remove redundant proposals.

  Common instance   Common multi-instance
 ActivityNet Thumos    ActivityNet Thumos
Video statistics
number of instances 1 1 1.6 14.3
number of frames 266.9 284.6 444.5 5764.2
length (sec) 89.0 11.4 148.2 230.6
number of train videos 10035 3580 6747 1665
number of val+test videos 2483 775 1545 323
Class statistics
number of train actions 160 16 160 16
number of val+test actions 40 4 40 4
Table 1: Overview of the common (multi-)instance datasets. The common instance datasets contain a single target action per video, while the common multi-instance datasets contain more frames and more actions per video, adding to the challenge of few-shot common action localization.

4 Experimental setup

4.1 Datasets

Existing video datasets are usually created for classification [22, 17], temporal localization [3], captioning [4], or summarization [13]. To evaluate few-shot common action localization, we have revised two existing datasets, namely ActivityNet1.3 [3] and Thumos14 [17]. Both datasets come with temporal annotations suitable for our evaluation. We consider both common instance and common multi-instance, where the latter deals with query videos containing multiple instances of the same action.

Common instance. For the revision of ActivityNet1.3, we follow the organization of Feng et al[10]. We divide videos that contain multiple actions into independent videos, with every newly generated video consisting of just one action and background. Next we discard videos longer than 768 frames. We split the remaining videos into three subsets, divided by action classes. We randomly select 80% of the classes for training, 10% of the classes for validation, and the remaining 10% of the classes for testing. Besides ActivityNet, we also revise the Thumos dataset using the same protocol.

ActivityNet Thumos
MEM PAM PMM one-shot five-shot one-shot five-shot
42.4 42.5 37.5 38.4
49.7 52.0 42.3 44.5
51.3 53.6 44.8 46.0
52.5 55.3 47.6 49.6
53.1 56.5 48.7 51.9
Table 2: Module evaluation on ActivityNet and Thumos in the common instance setting. All three modules have a positive mAP effect on the localization performance with only a slight increase in parameters.
(a) Without our modules.
(b) With our modules.
Figure 3: Module evaluation by t-SNE visualization of support and query representations. Colors of query proposals indicate their overlap with the ground truth action, the darker the better. Without our modules (left), both relevant and irrelevant query proposals are near the support videos. Afterwards (right), only relevant proposals remain close to the support videos, highlighting the effectiveness of our modules for localizing common actions among a few videos.

Common multi-instance. Query videos in real applications are usually unconstrained and contain multiple action segments. Therefore, we also split the original videos of ActivityNet1.3 and Thumos14 into three subsets according to their action classes without any other video preprocessing. As a result, we obtain long query videos with multiple action instances. The support videos are still trimmed action videos.

During training, the support videos and query video are randomly paired, while the pairs are fixed for validation and testing. The differences between the common instance and common multi-instance video datasets are highlighted in Table 1.

4.2 Experimental details

We use PyTorch 

[33] for implementation. Our network is trained with Adam [23] with a learning rate of 1e-5 on one Nvidia GTX 1080TI. We use 40k training iterations and learning rate is decayed to 1e-6 after 25k iterations. To be consistent with the training process of our baselines [10, 49], we use the same C3D backbone [40]

. The backbone is pre-trained on Sports-1M 

[21] and is fine-tuned with a class-agnostic proposal loss on the training videos for each dataset. The batch size is set to 1. The proposal score threshold is set as 0.7. The proposal number after NMS is 128 in training and 300 in validation and testing.

4.3 Evaluation

Following [37, 10], we measure the localization performance using (mean) Average Precision. A prediction is correct when it has the correct foreground/background prediction and has a ground truth overlap larger than the overlap threshold. The overlap is set to 0.5 unless specified otherwise.

5 Experimental results

5.1 Ablation study

Module evaluation. We evaluate the effect of the mutual enhancement module (MEM), the progressive alignment module (PAM), and the pairwise matching module (PMM) for our task on the common instance datasets. We report results using one and five support videos in Table 2. To validate the effectiveness of our modules, we compare to our baseline system without any modules. Here the support representations are averaged and added to the query representations. We observe that the progressive alignment module increases over the baseline considerably, showing its efficacy. Adding the pairwise matching on top of the progressive alignment or using the mutual enhancement before the progressive alignment further benefits few-shot common action localization. Combining all three modules works best.

To get insight into the workings of our modules for common action localization, we have analysed the feature distribution before and after the use of our modules. In Figure 3, we show the t-SNE embedding [28] before and after we align the five support videos with the 300 proposals in one query video. We observe that after the use of our modules, the proposals with high overlap are closer to the support videos, indicating our ability to properly distill the correct action locations using only a few support videos. Irrelevant proposals are pushed away from the support videos, which results in a more relevant selection of action locations.

No noise 56.5
1 noisy support video 53.5
2 noisy support videos of different class 51.9
2 noisy support videos of same class 50.6
Table 3: Influence of noisy support videos on common-instance ActivityNet for the five-shot setting. The result shows that our approach is robust to the inclusion of noisy support videos, whether they come from the same or different classes.

Few-shot evaluation. Our common action localization is optimized to work with multiple examples as support. To show this capability, we have measured the effect of gradually increasing the number of support videos, we found that the mAP gradually increases as we enlarge the number of support videos from one to six on common-instance ActivityNet. We obtain an mAP of 53.1 (one shot), 53.8 (two shots), 54.9 (three shots), 55.4 (four shots), 56.5 (five shots), 56.8 (six shots). The results show that our approach obtains high accuracy with only a few support videos. Using more than one support video is beneficial for common action localization in our approach, showing that we indeed learn from using more than one support video. Results stagnate when using more than six examples.

(a) Effect of support video length.
(b) Effect of action ratio in query video.
Figure 4: Ablation studies on the length of the support videos and the action proportion in the query video. Both studies are on common-instance ActivityNet. Left: The longer the support videos, the better we perform, as we can distill more knowledge from the limited provided supervision. Right: High scores can be obtained when the common action is dominant, localization of short actions in long videos remains challenging.

Effect of support video length. We ablate the effect of the length of the support videos on the localization performance in Figure 3(a). We sample 16, 32, 48 and 64 frames for each support video respectively. We find that the result gradually increases with longer support videos, which indicates that temporal information in the support videos is beneficial to our modules for common action localization.

Influence of action proportion in query video. Figure 3(b) shows that for query videos with a dominant action, we can obtain high scores. An open challenge remains localizing very short actions in very long videos.

Figure 5: Qualitative result of predictions by our approach under 1-shot, 3-shot and 5-shot settings. Correct predictions with an overlap larger than 0.5 are marked in green, and incorrect predictions are marked in red. The length and start-end boundary of segment are indicated in frame numbers.

Influence of noisy support videos. To test the robustness of our approach, we have investigated the effect of including noisy support videos in the five-shot setting. The results are shown in Table 3. When one out of five support videos contains the wrong action, the performance drops only 3% from 56.5 to 53.5. The performance drop remains marginal when replacing two of the five support videos with noisy videos. When two noisy support videos are from the same class, the drop is larger, which is to be expected, as this creates a stronger bias towards a distractor class. Overall, we find that our approach is robust to noise for common action localization.

Qualitative results. To visualize the result of our method, we show three cases in Figure 5. For the first example, we can find the common action location from one support video. Adding more support videos provides further context, resulting in a better fit. For the second one, our method can recover the correct prediction only when five support videos are used. As shown in the third case, our method can also handle the multi-instance scenario. We show a query video with three instances. With only one support video, we miss one instance and have low overlap with another. When more support videos are added, we can recover both misses.

5.2 Comparisons with others

To evaluate the effectiveness of our proposed approach for common action localization, we perform three comparative evaluations.

Overlap threshold
0.5 0.6 0.7 0.8 0.9 0.5:0.9
Common instance
Hu et al[16] * 41.0 33.0 27.1 15.9 6.8 24.8
Feng et al[10] 43.5 35.1 27.3 16.2 6.5 25.7
This paper 53.1 40.9 29.8 18.2 8.4 29.5
Common multi-instance
Hu et al[16] * 29.6 23.2 12.7 7.4 3.1 15.2
Feng et al[10] * 31.4 25.5 16.1 8.9 3.2 17.0
This paper 42.1 36.0 18.5 11.1 7.0 22.9
Table 4: One-shot comparison on common instance ActivityNet. Results marked with * obtained with author provided code. In both settings, our approach is preferred across all overlaps, highlighting its effectiveness.
Figure 6: Five-shot comparison. We evaluate our method as well as modified versions of Hu et al[16] and Buch et al[2] on all common instance and multi-instance datasets, we obtain favourable results. Detailed numerical results are provided in supplementary file to facilitate the comparison for the follow-up works. Best viewed in color.

One-shot comparison. For the one-shot evaluation, we compare to the one-shot video re-localization of Feng et al[10] and to Hu et al[16], which focuses on few-shot common object detection. We evaluate on the same setting as Feng et al[10], namely the revised ActivityNet dataset using the one-shot setting (common instance). Note that we both use the C3D base network. To evaluate the image-based approach of Hu et al[16], we use their proposed similarity module on the temporal video proposals, rather than spatial proposals based on author provided code [16]. The results in Table 4 show that across all overlap thresholds, our approach is preferred. At an overlap threshold of 0.5, we obtain an mAP of 53.1 compared to 41.0 for [16] and 43.5 for [10]. It is of interest to note that without our three modules, we obtain only 42.4 (Table 2). This demonstrates that a different training setup or a different model architecture by itself does not benefit common action localization. We attribute our improvement to the better alignment between the support and query representations as a result of our three modules. Next to a comparison on the common instance dataset, we also perform the same experiment on the longer multi-instance ActivityNet variant. In this more challenging setting, our approach again outperforms the baselines. We note that we are not restricted to the one-shot setting, where the baseline by Feng et al[10] is.

ActivityNet Thumos
one-shot five-shot one-shot five-shot
Zhang et al. 45.2 48.5 36.9 38.9
This paper 49.2 52.8 43.0 45.6
Table 5: Localization from images on the common instance datasets. Our method generalizes beyond videos as support input and outperforms Zhang et al[49]

Five-shot comparison. Second, we evaluate the performance of our approach on all datasets in the five-shot setting. We compare to a modified version of SST by Buch et al[2]. We add a fusion layer on top of the original GRU networks in SST to incorporate the support feature, and then choose the proposal with the largest confidence score. SST is used as baseline, because the approach of Feng et al[10] cannot handle more than one support video. We also include another comparison to Hu et al[16]. This time also using their feature reweighting module. The results are shown in Figure 6. We observe that our method performs favorably compared to the two baselines on all datasets, reaffirming the effectiveness of our method. Also note that even when our support videos are noisy (Table 3), we are still better than the baselines without any noise based on Buch et al[2] and Hu et al[16] (39.7 and 45.4 for a threshold of 0.5 on common instance ActivityNet). The large amount of distractor actions in the long videos of common multi-instance Thumos results in lower overall scores, indicating that common action localization is far from a solved problem.

Localization from images. Next to using videos, we can also perform common action localization using images as support. This provides a challenging setting, since any temporal information is lost. We perform localization from support images by inflating the images to create static support videos. We perform a common action localization on common instance ActivityNet and Thumos. We compare to the recent approach of Zhang et al[49], which focuses on video retrieval from images. Results in Table 5 show we obtain favourable results on both datasets, even though our approach is not designed for this setting.

6 Conclusion

In this paper we consider action localization in a query video given a few trimmed support videos that contain a common action, without specifying the label of the action. To tackle this challenging problem, we introduce a new network architecture along with three modules optimized for temporal alignment. The first module focuses on enhancing the representations of the query and support representation simultaneously. The second module progressively integrates the representations of the support branch into the query branch, to distill the common action in the query video. The third module weighs the different support videos to deal with non-informative support examples. Experiments on reorganizations of ActivityNet and Thumos dataset, both with settings containing a single and multiple action instances per video, show that our approach can robustly localize the action which is common amongst support videos in both standard and long untrimmed query videos.


  • [1] P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic (2014) Weakly supervised action labeling in videos under ordering constraints. In ECCV, Cited by: §1.
  • [2] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. Carlos Niebles (2017) Sst: single-stream temporal action proposals. In CVPR, Cited by: §2, Figure 6, §5.2.
  • [3] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles (2015) Activitynet: a large-scale video benchmark for human activity understanding. In CVPR, Cited by: §1, §1, §4.1.
  • [4] D. L. Chen and W. B. Dolan (2011) Collecting highly parallel data for paraphrase evaluation. In ACL, Cited by: §4.1.
  • [5] X. Dai, B. Singh, G. Zhang, L. S. Davis, and Y. Qiu Chen (2017) Temporal context network for activity localization in videos. In ICCV, Cited by: §2.
  • [6] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray (2018) Scaling egocentric vision: the epic-kitchens dataset. In ECCV, Cited by: §1.
  • [7] X. Dong, L. Zheng, F. Ma, Y. Yang, and D. Meng (2018) Few-example object detection with model communication. PAMI. Cited by: §1.
  • [8] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce (2009) Automatic annotation of human actions in video. In ICCV, Cited by: §1.
  • [9] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem (2016) Daps: deep action proposals for action understanding. In ECCV, Cited by: §2.
  • [10] Y. Feng, L. Ma, W. Liu, T. Zhang, and J. Luo (2018) Video re-localization. In ECCV, Cited by: §2, §4.1, §4.2, §4.3, §5.2, §5.2, Table 4.
  • [11] J. Gao, K. Chen, and R. Nevatia (2018) Ctap: complementary temporal action proposal generation. In ECCV, Cited by: §2.
  • [12] J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia (2017) TURN tap: temporal unit regression network for temporal action proposals. In ICCV, Cited by: §1.
  • [13] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool (2014) Creating summaries from user videos. In ECCV, Cited by: §4.1.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.2.2.
  • [15] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §3.2.2.
  • [16] T. Hu, P. Mettes, J. Huang, and C. G. Snoek (2019) SILCO: show a few images, localize the common object. In ICCV, Cited by: §1, Figure 6, §5.2, §5.2, Table 4.
  • [17] H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah (2017) The thumos challenge on action recognition for videos “in the wild”. CVIU. Cited by: §1, §1, §4.1.
  • [18] M. Jain, A. Ghodrati, and C. G. M. Snoek (2020) ActionBytes: learning from trimmed videos to localize actions. In CVPR, Cited by: §1.
  • [19] M. Jain, J. C. van Gemert, T. Mensink, and C. G. M. Snoek (2015) Objects2action: classifying and localizing actions without any video example. In ICCV, Cited by: §2.
  • [20] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid (2017) Joint learning of object and action detectors. In ICCV, Cited by: §2.
  • [21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014)

    Large-scale video classification with convolutional neural networks

    In CVPR, Cited by: §4.2.
  • [22] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman (2017) The kinetics human action video dataset. arXiv. Cited by: §4.1.
  • [23] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv. Cited by: §4.2.
  • [24] H. Kuehne, A. Richard, and J. Gall (2019) A hybrid rnn-hmm approach for weakly supervised temporal action segmentation. arXiv. Cited by: §1.
  • [25] K. Kumar Singh and Y. Jae Lee (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In ICCV, Cited by: §1.
  • [26] T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang (2018) BSN: boundary sensitive network for temporal action proposal generation. In ECCV, Cited by: §1.
  • [27] S. Ma, L. Sigal, and S. Sclaroff (2016) Learning activity progression in lstms for activity detection and early detection. In CVPR, Cited by: §2.
  • [28] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. JMLR. Cited by: §5.1.
  • [29] P. Mettes and C. G. Snoek (2017) Spatial-aware object embeddings for zero-shot localization and classification of actions. In ICCV, Cited by: §2.
  • [30] P. Nguyen, T. Liu, G. Prasad, and B. Han (2018) Weakly supervised action localization by sparse temporal pooling network. In CVPR, Cited by: §2.
  • [31] P. X. Nguyen, D. Ramanan, and C. C. Fowlkes (2019) Weakly-supervised action localization with background modeling. In ICCV, Cited by: §1, §2.
  • [32] D. Oneata, J. Verbeek, and C. Schmid (2013)

    Action and event recognition with fisher vectors on a compact feature set

    In ICCV, Cited by: §1.
  • [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NeurIPS, Cited by: §4.2.
  • [34] S. Paul, S. Roy, and A. K. Roy-Chowdhury (2018) W-talc: weakly-supervised temporal activity localization and classification. In ECCV, Cited by: §1.
  • [35] J. Sawatzky, M. Garbade, and J. Gall (2018) Ex paucis plura: learning affordance segmentation from very few examples. In GCPR, Cited by: §1.
  • [36] A. Shaban, A. Rahimi, S. Gould, B. Boots, and R. Hartley (2019) Learning to find common objects across image collections. In ICCV, Cited by: §1.
  • [37] Z. Shou, D. Wang, and S. Chang (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, Cited by: §1, §2, §3.3.1, §4.3.
  • [38] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao (2016)

    A multi-stream bi-directional recurrent neural network for fine-grained action detection

    In CVPR, Cited by: §2.
  • [39] K. Soomro and M. Shah (2017) Unsupervised action discovery and localization in videos. In ICCV, Cited by: §2.
  • [40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, Cited by: §3.2, §4.2.
  • [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §3.2.1.
  • [42] L. Wang, Y. Qiao, and X. Tang (2014) Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge. Cited by: §2.
  • [43] L. Wang, Y. Xiong, D. Lin, and L. V. Gool (2017) Untrimmednets for weakly supervised action recognition and detection. In CVPR, Cited by: §1, §2.
  • [44] X. Wang, R. Girshick, A. Gupta, and K. He (2018)

    Non-local neural networks

    In CVPR, Cited by: §1, §3.2.1.
  • [45] H. Xu, A. Das, and K. Saenko (2017) R-c3d: region convolutional 3d network for temporal activity detection. In ICCV, Cited by: §2, §3.2, §3.3.
  • [46] H. Yang, X. He, and F. Porikli (2018) One-shot action localization by learning sequence matching network. In CVPR, Cited by: §2, §2.
  • [47] J. Yang and J. Yuan (2017) Common action discovery and localization in unconstrained videos. In ICCV, Cited by: §1.
  • [48] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei (2016) End-to-end learning of action detection from frame glimpses in videos. In CVPR, Cited by: §2.
  • [49] Z. Zhang, Z. Zhao, Z. Lin, J. Song, and D. Cai (2019) Localizing unseen activities in video via image query. In IJCAI, Cited by: §4.2, §5.2, Table 5.