Most modern action recognition models operate by applying a deep CNN over clips of fixed temporal length [39, 5, 43, 49, 9]. Video-level classification is obtained by aggregating the clip-level predictions over the entire video, either in the form of simple averaging or by means of more sophisticated schemes modeling temporal structure [29, 44, 15]. Scoring a clip classifier densely over the entire sequence is a reasonable approach for short videos. However, it becomes computationally impractical for real-world videos that may be up to an hour long, such as some of the sequences in the Sports1M dataset . In addition to the issue of computational cost, long videos often include segments of extended duration that provide irrelevant information for the recognition of the action class. Pooling information from all clips without consideration of their relevance may cause poor video-level classification, as informative clip predictions are outnumbered by uninformative predictions over long unimportant segments.
In this work we propose a simple scheme to address these problems. It consists in training an extremely lightweight network to determine the saliency of a candidate clip. Because the computational cost of this network is more than one order of magnitude lower than the cost of existing 3D CNNs for action recognition [5, 43], it can be evaluated efficiently over all clips of even long videos. We demonstrate that restricting the costly action classifier to run only on the clips identified as the most salient by our model, yields not only significant savings in runtime (up to 15 for Sports1M) but also large gains in video classification accuracy (up to 10.93%) for 6 different strong clip classifiers, as uninformative or ambiguous clips no longer pollute the video-level decision. We refer to our network as SCSampler (Salient Clip Sampler), as it samples a reduced set of salient clips from the video for analysis by the action classifier.
Efficiency is a critical requirement in the design of SCSampler. We presents two main variants of our sampler. The first operates directly on compressed video features [53, 51], thus eliminating the need for costly decoding of the video. The second looks only at the audio channel, which is low-dimensional and can therefore be processed very efficiently. As in recent multimedia work [3, 1, 31, 13], our audio-based sampler exploits the inherent semantic correlation between the audio and the visual elements of a video. We also show that combining our video-based sampler with the audio-based sampler leads to further gains in recognition accuracy at a negligible overhead.
We propose and evaluate three distinct learning objectives for salient clip sampling. Two of them train the sampler to operate optimally with the given clip classifier, while one formulation is classifier-independent. We show that, in some settings, the former lead to improved accuracy, while the benefit of the latter is that it can be used without retraining with any clip classifier, making this model a general and powerful off-the-shelf tool to improve both the runtime and the accuracy of clip-based action classification. Finally, we show that although our sampler is trained over specific action classes in the training set, its benefits extends even to recognition of novel action classes, which suggests that the sampler learns a fairly general notion of clip saliency.
2 Related work
The problem of selecting relevant frames, clips or segments within a video has been investigated for various applications. For example, video summarization [34, 17, 54, 55, 16, 26, 36] and the automatic production of sport highlights [28, 27]
entail creating a much shorter version of the original video by concatenating a small set of snippets corresponding to the most informative or exciting moments. The aim of these systems is to generate a video composite that is pleasing and compelling for the user. Instead the objective of our model is to select a set of segments of fixed duration (i.e., clips) so as to make video-level classification as accurate and as unambiguous as possible.
More closely related to our task is the problem of action localization [20, 38, 37, 52, 58], where the objective is to localize the temporal start and end of each action within a given untrimmed video and to recognize the action class. Action localization is often approached through a two-step mechanism [19, 7, 4, 12, 13, 25], where first an action proposal method identifies candidate action segments, and then a more sophisticated approach validates the class of each candidate and refines its temporal boundaries. Our framework is reminiscent of this two-step solution, as our sampler can be viewed as selecting candidate clips for accurate evaluation by the action classifier. However, several key differences exist between our objective and that of action localization. Our system is aimed at video classification, where the assumption is that each video contains a single action class. Action proposal methods solve the harder problem of finding segments of different lengths and potentially belonging to different classes within the input video.While in action localization the validation model is typically trained using the candidate segments produced by the proposal method, the opposite is true in our scenario: the sampler is learned for a given pretrained clip classifier, which is left unmodified by our approach. Finally, the most fundamental difference is that high efficiency is a critical requirement in the design of our clip sampler. Our sampler must be orders of magnitude faster than the clip classifier to make our approach worthwhile. Conversely, most action proposal or localization methods are based on optical flow [24, 25] or deep action-classifier features [52, 4, 13] that are typically at least as expensive to compute as the output of a clip classifier. For example, the TURN TAP system  is one of the fastest existing action proposal methods and yet, its computational cost exceeds by more than one order of magnitude our scheme. For 60 seconds of untrimmed video, TURN TAP has a cost of 4128 GFLOPS; running densely our clip classifier (MC3-18 ) over the 60 seconds would actually cost less, at 1097 GFLOPs; our sampling scheme lowers the cost down dramatically, to only 168 GFLOPs per 60 seconds.
Our approach belongs to the genre of work that performs video classification by aggregating temporal information from long videos [11, 33, 29, 48, 46, 47, 44, 45, 50, 59]. Our aggregation scheme is very simple, as it merely averages the scores of action classifiers over the selected clips. Yet, we note that that most recent state-of-the-art action classifiers operate precisely under this simple scheme. Examples include Two-Stream Networks , I3D , R(2+1)D , Non-Local Networks , SlowFast . While in this prior work clips are sampled densely or at random, our experiment suggest that our sampling strategy yields significant gains in accuracy over both dense and random sampling and it is as fast as random sampling.
3 Technical approach
Our approach consists in extracting a small set of relevant clips from a video by scoring densely each clip with a lightweight saliency model. We refer to this model as the “sampler”, since it is used to sample clips from the video. We formally define the task in subsection 3.1, proceed to present several variants of learning objectives for the sampler in section 3.2, and finally discuss sampler architecture choices and features in subsection 3.3.
3.1 Problem Formulation
Video classification from clip-level predictions. We assume we are given a pretrained action classifier operating on short, fixed-length clips of RGB frames with spatial resolution
and producing output classification probabilities over a set of action classes. We note that most modern action recognition systems [41, 10, 43, 5] fall under this model and, typically, they constrain the number of frames to span just a handful of seconds in order to keep memory consumption manageable during training and testing. Given a test video of arbitrary length , video-level classification through the clip-classifier f is achieved by first splitting the video into a set of clips with each clip consisting of adjacent frames and where denotes the total number of clips in the video. The splitting is usually done by taking clips every frames in order to have a set of non-overlapping clips that spans the entirety of the video. A final video-level prediction is then computed by aggregating the individual clip-level predictions. In other words, if we denote with aggr the aggregation operator, the video-level classifier is obtained as .
Most often, the aggregator is a simple pooling operator which averages the individual clip scores (i.e., ) [39, 5, 43, 49, 9] but more sophisticated schemes based on RNNs  have also been employed.
Video classification from selected clips In this paper we are interested in scenarios where the videos are untrimmed and may be quite long. In such cases, applying the clip classifier to every clip will result in a very large inference cost. Furthermore, aggregating predictions from the entire video may produce poor action recognition accuracy since in long videos the target action is unlikely to be exhibited in every clip. Thus, our objective is to design a method that can efficiently identify a subset of salient clips in the video (i.e., with ) and to reduce video-level prediction to be computed from this set of clip-level predictions as ( is hyper-parameter studied in our experiments). By constraining the application of the costly classifier to only clips, inference will be efficient even on long videos. Furthermore, by making sure that includes a sample of the most salient clips in , recognition accuracy may improve as irrelevant or ambiguous clips will be discarded from consideration and will be prevented from polluting the video-level prediction. We note that in this work we address the problem of clip selection for a given pretrained clip classifier , which is left unmodified by our method. This renders our approach useful as a post-training procedure to further improve performance of existing classifiers both in terms of inference speed as well as recognition accuracy.
Our clip sampler. In order to achieve our goal we propose a simple solution that consists in learning a highly efficient clip-level saliency model that provides for each clip in the video a “saliency score” in . Specifically, our saliency model takes as input clip features that are fast to compute from the raw clip and that have low dimensionality so that each clip can be analyzed very efficiently. The saliency model is designed to be orders of magnitude faster than , thus enabling the possibility to score on every single clip of the video to find the most salient clips without adding any significant overhead. The set is then obtained as where returns the indices of the top- values in the set. We show that evaluating f on these selected set, i.e., computing ) results in significantly higher accuracy compared to aggregating clip-level prediction over all clips.
In order to learn the sampler , we use a training set of untrimmed video examples, each annotated with a label indicating the action performed in the video: with denoting the -th video and indicating its action label. In our experiments, we use as training set the same set of examples that was used to train the clip classifier . We do so in order to demonstrate that the gains in recognition accuracy are not due to leveraging additional data but rather are truly the result of learning to detect the most salient clips for within each video.
Oracle sampler. In this work we compare our sampler against an “oracle” that makes use of the action label to select the best clips in the video for classification with . The oracle set is formally defined as . Note that is obtained by looking for the clips that yield the highest action classification scores for the ground-truth label under the costly action classifier . In real scenarios the oracle cannot be constructed as it requires knowing the true label and it involves dense application of over the entire video, which defeats the purpose of the sampler. Nevertheless, in this work we use the oracle to obtain an upper bound on the accuracy of the sampler. Furthermore, we apply the oracle to the training set to form pseudo ground-truth data to train our sampler, as discussed in the next subsection.
3.2 Learning Objectives for the Sampler
We consider three choices of learning objectives for the sampler and experimentally compare them in 4.2.1.
3.2.1 Training the sampler as an action classifier
A naïve way to approach the learning of the sampler is to first train a lightweight action classifier on the training set by forming clip examples using the low-dimensional clip features . Note that this is equivalent to assuming that every clip in the training video contains a manifestation of the target action. Then, given a new untrimmed test video , we can compute the saliency score of a clip in the video as the maximum classification score over the classes, i.e., . The rationale behind this choice is that a salient clip is expected to elicit a strong response by the classifier, while irrelevant or ambiguous clips are likely to cause weak predictions for all classes. We refer to this variant of our loss as AC (Action Classification).
3.2.2 Training the sampler as a saliency classifier
One drawback of AC is that the sampler is trained as an action classifier independently from the model and by assuming that all clips are equally relevant. Instead, ideally we would like the sampler to select clips that are most useful to our given f. To achieve this goal we propose to train the sampler as a binary classifier, by using the oracle to assign pseudo ground-truth binary saliency labels to the training clips. Specifically, given training video , we assign saliency label 1 to all the clips and we set label 0 for the others. In other words, we define as salient clips those for which returns a large score (in the top-) for the correct class and train to recognize such clips using a binary cross-entropy loss. Empirically, we found that choosing gives the best results (see Appendix for a study of this parameter). We refer to this variant of our loss as SAL-CL (Saliency Classification).
3.2.3 Training the sampler as a saliency ranker
We can also train the sampler to recognize the relative importance of the clips within a video with respect to the classification output of f for the correct action label. To achieve this goal, we define pseudo ground-truth binary labels for pairs of clips from the same video :
We train by minimizing a ranking loss over these pairs:
where is a margin hyper-parameter. We refer to this variant of our sampler loss as SAL-RANK (Saliency Ranking).
3.3 Sampler Architecture
Due to the tight efficiency requirements, we restrict our sampler to operate on two types of features that can be computed fast from video and that yield a very compact representation to process. The first type of features are obtained directly from the compressed video without the need for decoding. Such representation is more than two orders of magnitude smaller than the decompressed video and it has already been shown to be a suitable input modality for action recognition . We describe in detail these features in subsection 3.3.1. The second type of features are audio features. Recent work [3, 1, 31, 23] has shown that the audio channel provides strong cues about the content of the video. This semantic correlation can be leveraged for various applications including training visual classifiers with audio signal , learning discriminative sound features for acoustic classification from unlabeled video , localizing objects that sound in images [1, 2, 57], separating object sounds , and pretraining action recognition models from audio-video synchronization [31, 23]. In subsection 3.3.2 we discuss how we can leverage the low-dimensional audio modality to find efficiently salient clips in a video.
3.3.1 Visual sampler
Wu et al.  recently introduced an accurate action recognition model directly trained on compressed video. Modern codecs such as MPEG-4 and H.264 represent video in highly compressed form by storing the information in a set of sparse I-frames, each followed by a sequence of P-frames. An I-frame (IF) represents the RGB-frame in a video just as an image. Each I-frame is followed by 11 P-frames, which encode the 11 subsequent frames in terms of motion displacement (MD), and RGB-residual (RGB-R). MDs capture the frame-to-frame 2D motion while RGB-Rs store the remaining difference in RGB values between adjacent frames after having applied the MD field to rewarp the frame. In  it was shown that each of these three modalities (IFs, MDs, RGB-Rs) provide rich information which can be exploited for efficient and accurate action recognition in video. Inspired by this prior work, here we train two separate ResNet-18 networks  as samplers using the learning objectives outlined in the previous subsection. The first ResNet-18 takes as input individual MD frames, which have size : the 2 channels encode the horizontal and vertical motion displacements at a resolution that is 16 times smaller than the original video. The other ResNet-18 is fed individual RGB-Rs of size . At test time we evaluate both networks on all P-frames within the clip and average their scores to obtain a final global saliency score for the clip.
We omit the large ResNet-152 model trained on IF from , as it adds a cost of 3 GFLOPS per clip which far exceeds the computational budget of our application. Instead, we experimented with a more lightweight ShuffleNet architecture  of 26 layers. We trained this model separately on IFs, MDs, and RGB-Rs. At test time we average the predictions of these 3 models over all the I-frames and P-frames (MDs and RGB-Rs) within the clip. We compare all these models in 4.2.2.
3.3.2 Audio sampler
. Specifically, we first extract MEL-spectrograms from audio segments twice as as long as the video-clips, but with stride equal to the video-clip length. This stride is chosen to obtain an audio-based saliency score for every video clip used by the action recognizerf. However, for the audio sampler we use an observation window twice as long as the video clip since we found this to yield better results. A series of 200 time samples is taken within each audio segment and processed using MEL filters. This yields a descriptor of size . This representation is compact and can be analyzed efficiently by the sampler. We treat this descriptor as an image and process it using a VGG network  of 18 layers. The details of the architecture are given in the Appendix.
3.3.3 Combining video and audio saliency
Since audio and video provide correlated but distinct cues, we investigated several schemes for combining the saliency predictions from these two modalities. With AV-convex-score we denote a model that simply combines the audio-based score and the video-based score by means of a convex combination where
is a scalar hyperparameter. The schemeAV-convex-list instead first produces two separate ranked lists by sorting the clips within each video according to the audio sampler and the visual sampler independently. Then the method computes for each clip the weighted average of its ranked position in the two lists according to a convex combination of the two positions. The top- clips according to this measure are finally retrieved. The method AV-intersect-list computes an intersection between the top- clips of the audio sampler and the top- clips of the video sampler. For each video, is progressively increased until the intersection yields exactly clips. In AV-union-list we form a set of clips by selecting -top clips according to the visual sampler (with hyperparameter s.t. ) and by adding to it a set of different clips from the ranked list of the audio sampler. Finally, we also present results for AV-joint-training, where we simply average the audio-based score and the video-based score and then finetune the two networks with respect to this average.
In this section we evaluate the proposed sampling procedure on the large-scale Sports1M and Kinetics datasets. We also present a systematic study to determine the effect of the different design choices in our sampler (e.g., loss, architecture, scheme of audio-video saliency combination) on the performance of the action classifier.
4.1 Large-scale action recognition with SCSampler
4.1.1 Experimental Setup
Action Recognition Networks. Our sampler can be used with any clip-based action classifier f. We demonstrate the general applicability of our approach by evaluating it with 6 popular 3D CNNs for action recognition. For four of these models, an official public implementation with pretrained networks is readily available . These four architectures are described in detail in . They are 18-layer instantiations of the Mixed Convolutional Network (MC3), R(2+1)D, and ResNet3D (R3D), with this last network also in a 34-layer configuration. The other two models are our own implementation of I3D-RGB  and a ResNet3D of 152 layers leveraging group convolutions (R3DGC-152) . These networks are among the state-of-the-art on Kinetics and Sports1M. They are trained on clips of 16 frames, sampled at 16 fps for Sports1M and at 30 fps for Kinetics. Video prediction is obtained by averaging the clip-level predictions on the clips selected by our sampler.
Sampler configuration. In this subsection we present results achieved with the best configuration of our sampler architecture, based on the experimental study that we present in section 4.2. The best configuration is a model that combines the saliency scores of an audio sampler and of a video sampler, using the strategy of AV-union-list. The video sampler is based on two ResNet-18 models trained on MD and RGB-R features, respectively, using the action classification loss (AC). The audio sampler is trained with the saliency ranking loss (SAL-RANK).
Our clip sampler is optimized with respect to the given clip classifier f. Thus, we train a separate clip sampler for each of the 6 architectures in this evaluation. All results are based on sampling clips from the video, since this is the best hyper-parameter value according to our experiments. We compare the action recognition accuracy achieved with our sampler, against the following strategies: choosing clips from the video either randomly or uniformly spaced out, and “dense” evaluation which consists in averaging the clip-level predictions over all non-overlapping clips in the video.
4.1.2 Evaluation on Sports1M
Our approach is designed to operate on long, real-world videos where it is neither feasible nor beneficial to evaluate every single clip. For these reasons, we choose the Sports1M dataset  as a suitable benchmark since its average video length is 5 minutes and 36 seconds, and some of its videos exceed 1 hour. We use the official training/test split. We do not trim the test videos and instead seek the top clips according to our sampler in each video. We stress that our sampling strategy is applied to test videos only. The training videos in Sports1M are also untrimmed. As training on all training clips would be unfeasible, we use the training procedure described in  which consists in selecting from each training video 10 random 2-second segments, from which training clips are formed. We reserve to future work the investigation of whether our sampling can be extended to sample training clips from the full videos.
We present the results in Table 3, which includes for each method the video-level classification accuracy as well as the cumulative runtime (in days) to run the inference on the complete test set using 32 NVIDIA P100 GPUs (this includes the time needed for sampling as well as clip-level action classification). The most direct baselines for our evaluation are “Random” and “Uniform” which use the same number of clips () in each video as SCSampler. “Random” chooses non-overlapping clips randomly from the video, while “Uniform” selects them uniformly spaced out with the first and the last clip taken from the start and the end of the video. It can be seen that compared to these baselines, SCSampler delivers a substantial accuracy gain for all action models, with improvements ranging from 6.47% for R(2+1)D-34 to 10.69% for R(2+1)D-18 with respect to Uniform, which does only marginally better than Random.
Our approach does even better than “Dense” prediction, which averages the action classification predictions over all non-overlapping clips. To the best of our knowledge the accuracy of 76.97% achieved by R3DGC-152 using Dense evaluation is already better than any published result on this benchmark. SCSampler provides an additional gain of 7.01% over this state-of-the-art model, pushing the accuracy to 83.98%. We note that when using R3DGC-152, Dense requires 14 days whereas SCSampler achieves better accuracy and requires only 0.65 days to run inference on the Sports1M test set. Finally, we report also the performance of the “Oracle” , which selects the clips that yield the highest classification score for the true class of the test video. This is not a usable model in practice but it gives us an informative upper bound on the accuracy achievable with an ideal sampler.
4.1.3 Evaluation on Kinetics
We further evaluate the SCSampler on the Kinetics  dataset. Kinetics is a large-scale benchmark for action recognition containing 400 classes and 280K videos (240K for training and 40K for testing), each about 10 seconds long. The results are reported in Table 3. Even though videos in Kinetics are short and thus in principle the recognition model should not benefit from a clip-sampling scheme such as ours, we see that for all architectures SCSampler provides accuracy gains over Random/Uniform selection and Dense evaluation, although the improvements are understandably less substantial than in the case of Sports1M. To the best of our knowledge, the accuracy of 80.23% achieved by R3DGC-152 with our SCSampler is the best reported result so far on this benchmark.
Note that  reports an accuracy of 78.5% using Uniform (instead of the 77.53% we list in Table 3, row 6) but this accuracy is achieved by applying the clip classifier spatially in a fully-convolutional fashion on frames of size 256x256, whereas here we use a single center spatial crop of size 224x224 for all our experiments. Sliding the clip classifier spatially in a fully-convolutional fashion (as in ) raises the accuracy of SCSampler to 80.88%.
4.1.4 Sampling Clips for Unseen Action Classifiers and Novel Classes
While our SCSampler has low computational cost, it adds the procedural overhead of having to train a specialized clip selector for each classifier and each dataset. In this subsection we evaluate the possibility of reusing a clip sampler that was optimized for a classifier on a dataset , for a new classifier on a dataset that contains action classes different from those seen in . In Table 3, we report cross-dataset performance of an SCSampler trained on Kinetics but then used to select clips on Sports1M (and vice-versa). We also include cross-classifier performance obtained by optimizing SCSampler with pseudo-ground truth labels (see section 3.2.2 and 3.2.3) generated by R(2+1)D-18 but then used for video-level prediction with action classifier MC3-18. On the Kinetics test set, using an SCSampler that was trained using the same action classifier (MC3) but a different dataset (Sports1M) causes a drop of about 2% (65.01% vs 67.00%) while training using a different action classifier (R(2+1)D) to generate pseudo-ground truth labels on the the same dataset (Kinetics) causes a degradation of 1.08% (65.92% vs 67.00%). The evaluation on Sports1M shows a similar trend, where cross-dataset accuracy (69.15%) is lower than cross-classifier accuracy (72.05%).
Even in the extreme setting of cross-dataset and cross-classifier, the accuracies achieved with SCSampler are still better than those obtained with Random or Uniform selection. Finally, we note that samplers trained using the loss (section 3.2.1) do not require pseudo-labels and thus are independent of the action classifier by design.
4.2 Assessing Design Choices for SCSampler
In this subsection we evaluate the different choices in the design of SCSampler. Given the many configurations to assess, we make this study more computationally feasible by restricting the evaluation to a subset of Sports1M, which we name miniSports. The dataset is formed by randomly choosing for each class 280 videos from the training set and 69 videos from the test set. This gives us a class-balanced set of 136,360 training videos and 33,603 test videos. All videos are shortened to the same length of 2.75 minutes. For our assessment, we restrict our choice of action classifier to MC3-18, which we retrain on our training set of miniSports. We assess the SCSampler design choices in terms of how they affect the video-level accuracy of MC3-18 on the test set of miniSports, since our aim is to find the best configuration for video classification.
4.2.1 Varying the loss function
We begin by studying the effect of the loss function used for training SCSampler, by considering the three loss variants described in section3.2. For this evaluation, we assess separately the visual sampler and the audio sampler. The video sampler is based on two ResNet-18 networks with MD and RGB-R features, respectively. These 2 networks are pretrained on ImageNet and then finetuned on the training set of miniSport for each of the three different SCSampler loss functions. The audio sampler is our VGG network pretrained for classification on AudioSet  and then finetuned on the training set of miniSports. The results are summarized in Table 4. For the visual sampler, the best performance is achieved when using the Action Classification (AC) loss. Saliency Ranking (SAL-RANK) is the top performing loss for the audio sampler.
4.2.2 Varying the sampler architecture and features
In this subsection we assess different architectures and features for the sampler. For the visual sampler, we use the AC loss and consider two different lightweight architectures: ResNet-18 and ShuffleNet26. Each architecture is trained on each of the 3 types of video-compression features described in section 3.3.1: IF, MD and RGB-R. We also assess performance of combination of these three features by averaging the scores of classifiers based on individual features. The results are reported in Table 5. We can observe that given the same type of input features, ResNet-18 provides much higher accuracy than ShuffeNet-26 at a runtime that is only marginally higher. It can be noticed that MD and RGB-R features seem to be quite complementary: for ResNet-18, MD+RGB-R yields an accuracy of 73.05% whereas these individual features alone achieve an accuracy of only 67.99% and 63.48%. However, adding IF features to MD+RGB-R provides a modest gain in accuracy (74.94 vs 73.05) but impacts noticeably the runtime. Considering these tradeoffs, we adopt ResNet-18 trained on MD+RGB-R as our visual sampler on all subsequent experiments.
We perform a similar ablation study on architecture and features for the audio sampler. Given our VGG audio network pretrained for classification on AudioSet, we train it on miniSport using the following two options: finetuning the entire VGG model vs training a single FC layer on the VGG activation features (conv4_2, pool4, fc1). All audio samplers are trained with the SAL-RANK loss. The results are reported in Table 6. We can see that finetuning the audio sampler gives the best classification accuracy.
|MD + RGB-R||ResNet-18||73.05||20.9|
|MD + RGB-R||ShuffleNet-26||67.89||19.1|
|Audio SCSampler||accuracy (%)||runtime (min)|
|FC trained on VGG-conv4_2||67.03||21.6|
|FC trained on VGG-pool4||67.01||21.4|
|FC trained on VGG-fc1||59.84||21.4|
4.2.3 Combining audio and visual saliency
In this subsection we assess the impact of our different schemes for combining audio-based and video-based saliency scores (see 3.3.3). For this we use the best configurations of our visual and audio sampler. Table 7 shows the video-level action recognition accuracy achieved for the different combination strategies.
Perhaps surprisingly, the best results are achieved with AV-union-list, which is the rather naïve solution of taking clips based on the video sampler and different clips based on the audio sampler ( is the best value when ). The more sophisticated approach of joint training AV-joint-training performs nearly on-par with it. Overall, it is clear that the visual sampler is a better clip selector than the audio sampler. But considering the small cost of audio-based sampling, the accuracy gain provided by AV-union-list over visual only (75.98 vs 73.05) warrants the use of this combination.
|accuracy (%)||runtime (min)|
|Visual SCSampler only||73.05||20.9|
|Audio SCSampler only||67.82||22.0|
|Dense||61.6||2293.5 (38.5 hrs)|
4.2.4 Varying the number of sampled clips ()
Figure 2 shows how video-level classification accuracy changes as we vary the number of sampled clips (). The sampler here is AV-union-list. provides the best accuracy for our sampler. For the Oracle, gives the top result as this method can conveniently select the clip that elicits the highest score for the correct label.
We presented a very simple scheme to boost both the accuracy and the speed of clip-based action classifiers. It leverages a lightweight clip-sampling model to select a small subset of clips for analysis. Experiments show that, despite its simplicity, our clip-sampler yields large accuracy gains and big speedups for 6 different strong action recognizers, and it retains strong performance even when used on novel classes. Future work will investigate strategies for optimal sample-set selection, by taking into account clip redundancies. It would be interesting to extend our sampling scheme to models that employ more sophisticated aggregations than simple averaging, e.g., those that require a set of contiguous clips to capture long-range temporal structure.
R. Arandjelovic and A. Zisserman.
Look, listen and learn.
IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 609–617, 2017.
-  R. Arandjelovic and A. Zisserman. Objects that sound. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pages 451–466, 2018.
-  Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 892–900, 2016.
S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles.
SST: single-stream temporal action proposals.
2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6373–6382, 2017.
-  J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4724–4733, 2017.
-  J. S. Chung and A. Zisserman. Out of time: Automated lip sync in the wild. In Computer Vision - ACCV 2016 Workshops - ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II, pages 251–263, 2016.
-  V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action understanding. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, pages 768–784, 2016.
-  Facebook. Video model zoo. https://github.com/facebookresearch/VMZ, 2018.
-  C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. CoRR, abs/1812.03982, 2018.
-  C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In Advances in neural information processing systems, pages 3468–3476, 2016.
-  A. Gaidon, Z. Harchaoui, and C. Schmid. Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2782–2795, 2013.
-  J. Gao, Z. Yang, C. Sun, K. Chen, and R. Nevatia. TURN TAP: temporal unit regression network for temporal action proposals. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 3648–3656, 2017.
-  R. Gao, R. S. Feris, and K. Grauman. Learning to separate object sounds by watching unlabeled video. In 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2496–2499, 2018.
-  J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
-  R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. Actionvlad: Learning spatio-temporal aggregation for action classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3165–3174, 2017.
-  B. Gong, W. Chao, K. Grauman, and F. Sha. Diverse sequential subset selection for supervised video summarization. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2069–2077, 2014.
-  M. Gygli, H. Grabner, and L. J. V. Gool. Video summarization by learning submodular mixtures of objectives. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3090–3098, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
-  F. C. Heilbron, J. C. Niebles, and B. Ghanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 1914–1923, 2016.
-  M. Jain, J. C. van Gemert, H. Jégou, P. Bouthemy, and C. G. M. Snoek. Action localization with tubelets from motion. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 740–747, 2014.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. Li.
Large-scale video classification with convolutional neural networks.In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 1725–1732, 2014.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
-  B. Korbar, D. Tran, and L. Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 7774–7785, 2018.
-  T. Lin, X. Zhao, and Z. Shou. Temporal convolution based action proposal: Submission to activitynet 2017. CoRR, abs/1707.06750, 2017.
-  T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang. BSN: boundary sensitive network for temporal action proposal generation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, pages 3–21, 2018.
-  B. Mahasseni, M. Lam, and S. Todorovic. Unsupervised video summarization with adversarial lstm networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  M. Merler, D. Joshi, K.-N. C. Mac, Q.-B. Nguyen, S. Hammer, J. Kent, J. Xiong, M. N. Do, J. R. Smith, and R. S. Feris. The excitement of sports: Automatic highlights using audio/visual cues. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
-  M. Merler, K. C. Mac, D. Joshi, Q. Nguyen, S. Hammer, J. Kent, J. Xiong, M. N. Do, J. R. Smith, and R. Feris. Automatic curation of sports highlights using multimodal excitement features. IEEE Transactions on Multimedia, pages 1–1, 2018.
-  J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4694–4702, 2015.
-  J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4694–4702, 2015.
-  A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VI, pages 639–658, 2018.
-  A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound provides supervision for visual learning. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pages 801–816, 2016.
-  H. Pirsiavash and D. Ramanan. Parsing videos of actions with segmental grammars. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 612–619, 2014.
-  D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specific video summarization. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI, pages 540–555, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  S. Shekhar, D. Singal, H. Singh, M. Kedia, and A. Shetty. Show and recall: Learning what makes videos memorable. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2017.
-  Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S. Chang. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1417–1426, 2017.
-  Z. Shou, D. Wang, and S. Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 1049–1058, 2016.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 568–576, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4489–4497, 2015.
-  D. Tran, H. Wang, L. Torresani, and M. Feiszli. Classification with channel-separated convolutional networks. arXiv preprint, arXiv:1904.02811, 2019.
-  D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6450–6459, 2018.
-  G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1510–1517, 2018.
-  J. Wang and A. Cherian. Learning discriminative video representations using adversarial perturbations. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, pages 716–733, 2018.
-  L. Wang, Y. Qiao, and X. Tang. Mofap: A multi-level representation for action recognition. International Journal of Computer Vision, 119(3):254–271, 2016.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 20–36, 2016.
-  X. Wang, A. Farhadi, and A. Gupta. Actions ~ transformations. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2658–2667, 2016.
-  X. Wang, R. B. Girshick, A. Gupta, and K. He. Non-local neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7794–7803, 2018.
-  C. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krähenbühl, and R. B. Girshick. Long-term feature banks for detailed video understanding. CoRR, abs/1812.05038, 2018.
-  C. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl. Compressed video action recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6026–6035, 2018.
-  H. Xu, A. Das, and K. Saenko. R-C3D: region convolutional 3d network for temporal activity detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5794–5803, 2017.
B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang.
Real-time action recognition with enhanced motion vector cnns.In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2718–2726, 2016.
-  K. Zhang, W. Chao, F. Sha, and K. Grauman. Summary transfer: Exemplar-based subset selection for video summarization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 1059–1067, 2016.
K. Zhang, W. Chao, F. Sha, and K. Grauman.
Video summarization with long short-term memory.In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VII, pages 766–782, 2016.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6848–6856, 2018.
-  H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. H. McDermott, and A. Torralba. The sound of pixels. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pages 587–604, 2018.
-  Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. Temporal action detection with structured segment networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2933–2942, 2017.
-  B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pages 831–846, 2018.
Appendix A Video classification networks
In the main paper, we provide an overview of the gains in accuracy and speedup enabled by SCSampler for several video-classification models. In this section, we provide the details of the action classifier architectures used in our experiments and discuss the training procedure used to train these models.
a.1 Architecture details
3D-ResNets (R3D) are residual networks where every convolution is 3D. Mixed-convolution models (MC) are 3D CNNs leveraging residual blocks, where the first convolutional groups use 3D convolutions and the subsequent ones use 2d convolutions. In our experiments we use an MC3 model. R(2+1)D are models that decompose each 3D convolution in a 2D convolution (spatial), followed by 1D convolution (temporal). For further details, please refer to the paper that introduced and compared these models  or the repository  where trained models can be found.
a.2 Training procedure
Sports-1M. For the Sports1M dataset, we use the training procedure described in  for all models except R3DGC-152. Frames are first re-scaled to have resolution , and then each clip is generated by randomly cropping a window of size
at the same location from 16 adjacent frames. We use batch normalization after all convolutional layers, with a batch size of 8 clips per GPU. The models are trained for 100 epochs, with the first 15 epochs used for warm-up during distributed training. Learning rate is set to 0.005 and divided by 10 every 20 epochs.
The R3DGC-152 model is trained according to the training procedure described in .
Kinetics. On Kinetics, the clip classifiers are trained with mini-batches formed by sampling five 16-frame-long clips with temporal jittering. Frames are first resized to resolution , and then each clip is generated by randomly cropping a window of size at the same location from 16 adjacent frames. The models are trained for 45 epochs, with 10 warm-up epochs. The learning rate is set to 0.01 and divided by 10 every 10 epochs as in . R3DGC-152  and R(2+1)D  are finetuned from Sports1M for 14 epochs with the procedure described in .
Appendix B Implementation details for SCSampler
In this section, we give the implementation details of the architectures and describe the training/finetuning procedures of our sampler networks.
b.1 Visual-based sampler
Following Wu et al. , all of our visual samplers are pre-trained on the ILSVRC dataset . The learning rate is set to 0.001 for both Sports1M and Kinetics. As in , the learning rate is reduced when accuracy platoes and pre-trained layers use smaller learning rates. The ShuffleNet0.5  (26 layers) model is pre-trained on ImageNet. We use three groups of group convolutions as this choice is shown to give the best accuracy in . The initial learning rate and the learning rate schedule is the same as that used for ResNet-18.
b.2 Audio-based sampler
We use a VGG model  pretrained on AudioSet  as our backbone network, with MEL spectrograms of size as input. When fine-tuning the network with SAL-RANK, we use an initial learning rate of 0.01 for Sports1M and 0.03 for Kinetics for the first 5 epochs and then divide the learning rate by every 5 epochs. The learning rate of the pre-trained layers is multiplied by a factor of . When finetuning with the SAL-CL loss, we set the learning rate to 0.001 for 10 epochs, and divide it by 10 for 6 additional epochs. When finetuning with AR loss, we start with learning rate 0.001, and divide it by 10 every 5 epochs.
Appendix C Additional studies of design choices for SCSampler
Here we present additional analyses of the design choices and hyperparameter values of SCSampler.
c.1 Label assignment when training SCSampler as a saliency classifier
As discussed in section 3.2.2 of our paper, when training SCSampler with the SAL-CL loss, we assign saliency label 1 to all the training clips and label 0 to the others. Here we study the effect of the hyperparameter . Note that at test time, we evaluated the clip-classifier on the top-clips according to SCSampler. Figure 3 shows the action recognition accuracy obtained with SCSamplers trained for different ratios of positive-vs-negative clips ( is the total number of clips in each training video). Our experiments show that choosing values of between and yields good results. For the video sampler, the best accuracy is achieved at . For the audio-based sampler is the best choice. Training and testing is on our miniSports ablation dataset, with MC3-18 as action recognition model.
c.2 Selecting hyperparameter for AV-union-list
The AV-union-list method (described in section 3.3.3 of our paper) combines the audio-based and the video-based samplers, by selecting -top clips according to the visual sampler (with hyper-parameter s.t. ) and adds a set of different clips from the ranked list of the audio sampler to form a union of size ( is used in this experiment). In figure 4 we analyze the impact of on action classification. The fact that the best value is achieved at suggests that signals from the two samplers are somewhat complementary, the visual sampler provides a more accurate measure of clip saliency.
Appendix D Applying SCSampler every clips
We use the SCSampler to find most salient clips amongst all possible clips in the video by applying our sampler densely over the entire video. While our sampler is quite efficient, further reductions in computational cost can be obtained by running SCSampler every clips in the video. This implies that the final top- clips used by the action classifier will be selected from a subset of clips obtained by applying SCSampler with a stride of clips. As usual, we fix the value of to 10 for SCSampler. Figure 5 shows the results obtained with the best configuration of our SCSampler (see details in 4.2.3) and the R3DGC-152  action classifier on the full Sports1M dataset. We see that we can apply SCSampler with clip-strides of up to before the action recognition accuracy degrades to the level of costly dense predictions. This results in further reduction of computational complexity and runtime, as we only need to apply the sampler to clips.
The complete speed-accuracy tradeoff is illustrated in Fig 6 where we plot average GFLOPs per video vs action recognition accuracy for varying values of , while keeping for SCSampler. The GFLOP count includes the cost of running SCSampler (every clips) as well as the action classifier on the top- clips ranked by our sampler. We also include the baselines Random and Uniform. In the case of Random (i.e., random selection of clips), different speed-accuracy tradeoffs can be obtained by varying , i.e., the number of clips randomly sampled and used by the action classifier. Uniform operates similarly to Random but chooses clips evenly spaced out in the video. We can observe that for nearly all GFLOP settings, SCSampler provides substantial gains in classification accuracy over both Random and Uniform for the same computational cost.