Audiovisual SlowFast Networks for Video Recognition

01/23/2020 ∙ by Fanyi Xiao, et al. ∙ 23

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast extends SlowFast Networks with a Faster Audio pathway that is deeply integrated with its visual counterparts. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we employ DropPathway that randomly drops the Audio pathway during training as a simple and effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization and show that it leads to better audiovisual features. We report state-of-the-art results on four video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to self-supervised tasks, where it improves over prior work. Code will be made available at:



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Joint audiovisual learning is core to human perception. However, most contemporary models for video analysis exploit only the visual signal and ignore the audio signal. For many video understanding tasks, it is obvious that audio could be very helpful. Consider the action “playing saxophone”. One would expect that the unique sound signature would significantly facilitate recognizing the class. Furthermore, visually subtle classes such as “whistling”, where the action itself can be difficult to see in video frames, can be much easier to recognize with the aid of audio signals.

Figure 1: Audiovisual SlowFast Networks extend SlowFast (top two paths) with an Audio pathway (bottom). The network performs integrated audiovisual perception with hierarchical fusion.

This line of thinking is supported by perceptual and neuroscience studies suggesting interesting ways in which visual and audio signals are combined in the brain. A classic example is the McGurk effect [48]111 – when one is listening to an audio clip (e.g., sounding “ba-ba”), alongside watching a video of fabricated lip movements (indicating “va-va”), the sound one perceives changes (in this case from “ba-ba” to “va-va”).

This effect demonstrates that there is tight entanglement between audio and visual signals (known as the multisensory integration process[47, 38, 66, 65]. Importantly, research has suggested this fusion between audio and visual signals happens at a fairly early stage [58, 51].

Given its high potential in facilitating video understanding, researchers have attempted to utilize audio in videos [37, 20, 2, 5, 52, 53, 3, 59, 19]. However, there are a few challenges in making effective use of audio. First, audio does not always correspond to visual frames (e.g., in a “dunking basketball” video, there can be class-unrelated background music playing). Conversely, audio does not always contain information that can help understand the video (e.g., “shaking hands” does not have a particular sound signature). There are also challenges from a technical perspective. Specifically, we identify the incompatibility of “learning dynamics” between the visual and audio pathways – audio pathways generally train much faster than visual ones, which can lead to generalization issues during joint audiovisual training. Due in part to these various difficulties, a principled approach for audiovisual modeling is currently lacking. Many previous methods adopt an ad-hoc scheme that consists of a separate audio network that is integrated with the visual pathway via “late-fusion” [20, 2, 52].

The objective of this paper is to build an architecture for integrated audiovisual perception. We aim to go beyond previous work that performs “late-fusion” of independent audio and visual pathways, to instead learn hierarchies of integrated audiovisual features, enabling unified audiovisual perception. We propose a new architecture, Audiovisual SlowFast Networks (AVSlowFast), to perform fusion at multiple levels (Fig. 1). AVSlowFast Networks build on previous work of SlowFast Networks [14], a class of architectures that has two pathways, of which one (Slow) is designed to capture more static but semantic-rich information whereas the other (Fast) is tasked to capture fast motion. Our architecture extends the Slow and Fast pathways of [14] with a Faster audio pathway, as audio has a much higher sampling rate, that is hierarchically integrated in AVSlowFast. Our key contributions are:

(i) We propose to fuse audio and visual information at multiple levels in the network hierarchy (i.e., hierarchical fusion) so that audio can contribute to the formation of visual concepts at different levels of abstraction. In contrast to late-fusion, this enables the audio signal to participate in the process of forming visual features.

(ii) To overcome the incompatibility of learning dynamics between the visual and audio pathways, we propose DropPathway that randomly drops the Audio pathway during training as a simple and effective regularization technique to tune the pace of the learning process. This enables us to train our joint audiovisual model with hierarchical fusion connections across modalities.

(iii) Inspired by prior work in neuroscience [38], which suggests that there exist

audiovisual mirror neurons

in monkey brains that respond to “any evidence of the action, be it auditory or visual”, we propose to perform audio visual synchronization (AVS) [52, 41, 2, 5] at multiple network layers to learn features that generalize across modalities.

(iv) Unlike many previous approaches which directly take existing image-desgned CNNs (e.g., VGG [63], ResNet [28]) to process audio [20, 29, 8, 78], we employ a ResNet based audio pathway design that is more efficient.

We conduct extensive experiments on multiple action recognition dataset, Kinetics [36], EPIC-kitchen [12] and Charades [61] for action classification, as well as AVA [24] for action detection. We report state-of-the-art results on all datasets and demonstrate the benefits of joint audiovisual recognition. In addition, we show the generalization of AVSlowFast to self-supervised tasks, where it improves over prior work without bells and whistles. Finally, we provide detailed ablation studies to dissect the contribution of various components in AVSlowFast networks.

2 Related Work

Video recognition.

Significant progress has been made in video recognition in recent years, some notable directions are two-stream networks in which one stream processes RGB frames and the other processes optical flow [62, 15, 77], 3D ConvNets as an extension of 2D networks to the spatiotemporal domain [73, 55, 84], and recent SlowFast Networks that have two pathways to process videos at different temporal frequency [14]. Despite all these efforts on harnessing temporal information in videos, research is relatively lacking when it comes to another important information source – audio in video.

Audiovisual activity recognition.

Joint modeling of audio and visual signals has been largely conducted in a “late-fusion” manner in video recognition literature [37, 20]. For example, all the entries that utilize audio in the 2018 ActivityNet challenge report [20] have adopted this paradigm – meaning that there are networks processing visual and audio inputs separately, and then they either concatenate the output features or average the final class scores across modalities. Recently, an interesting audiovisual fusion approach has been proposed [37] using flexible binding windows when fusing audio and visual features. With three identical network streams, this approach fuses audio features with the features from a single RGB frame and optical flow after global average pooling, at the end of the network streams. In contrast, we fuse hierarchically in an integrated architecture, which we show to be beneficial. In addition, unlike methods that directly take existing visual CNNs (e.g., Inception [72], ResNet [28], Inception-ResNet [71], etc.) to process audio, we propose a dedicated audio pathway which we will show to be more effective in experiments.

Other audiovisual tasks.

Audio has also been extensively utilized outside of video recognition, e.g. for learning audiovisual representations in a self-supervised manner [2, 5, 52, 53, 41] by exploiting audio-visual correspondence. While related, the goal is different to ours (learning representations vs. video recognition). Further, as methods discussed above, these approaches typically apply late-fusion on audio and visual features. Other audiovisual tasks that have been studied include audio-visual speech recognition [54, 50], sound-source localization [3, 59], audio-visual source separation [52, 19], and audiovisual question answering [1].

Multi-modal learning.

Researchers have long been interested in developing models that can learn from multiple modalities (e.g., audio, vision, language, etc.). Beyond audio and visual modalities, extensive research has been conducted in various other instantiations of multi-modal learning, including vision and motion [62, 16, 77, 9], vision and language [4, 18], and learning from physiological data [44]. Recently, Wang et al. discussed the difficulty of audiovisual joint training in a learning context [78]. Unlike [78], which requires carefully balancing audio and visual loss terms, we propose DropPathway and a hierarchical AV synchronization strategy to jointly train AVSlowFast from scratch.

3 Audiovisual SlowFast Networks

Inspired by research in neuroscience [7], which suggests that audio and visual signals fuse at multiple cognitive levels, we propose to fuse audio and visual features at multiple stages, from intermediate-level features to high-level semantic concepts. This way, audio can participate in the formation of visual concepts at different levels. AVSlowFast Networks are conceptually simple: SlowFast has Slow and Fast pathways to process visual input (§3.1), and AVSlowFast extends this with an Audio pathway (§3.2).

3.1 SlowFast pathways

We begin by briefly reviewing the SlowFast architecture. The Slow pathway (Fig. 1

, top row) is a convolutional network that processes videos with a large temporal stride (

i.e., it samples one frame out of frames). The primary goal of Slow pathway is to produce features that capture semantic contents of the video, which has a low refreshing rate (semantics do not change all of a sudden). The Fast pathway (Fig. 1, middle row) is another convolutional model with three key properties. First, it has an times higher frame rate (i.e., with temporal stride , ) so that it can capture fast motion information. Second, it preserves fine temporal resolution by avoiding any temporal downsampling. Third, it has a lower channel capacity ( times the Slow pathway channels, where ) as it is demonstrated to be a desired trade-off [14]. We refer readers to [14] for more details.

3.2 Audio pathway

A key property of the Audio pathway is that it has an even finer temporal structure than the Slow and Fast pathways (with waveform sampling rate on the order of kHz). As standard processing, we take log-mel-spectrogram (2-D representation in time and frequency of audio) as input and set the temporal stride to frames, where can be much larger than (e.g., 32 vs. 8). In a sense, it serves as a “Faster” pathway with respect to Slow and Fast pathways. Another notable property of the Audio pathway is its low computation cost. Due to the 1-D nature of audio signals, they are cheap to process. To control this, we set the channels of Audio pathway to  Slow pathway channels. By default, we set to 12. Depending on the specific instantiation, the Audio pathway typically only requires 10% to 20% of the overall computation of AVSlowFast Networks.

3.3 Lateral connections

In addition to the lateral connections between Slow and Fast pathways in [14], we add lateral connections between the Audio, Slow & Fast pathways to fuse audio and visual features. Following [14], lateral connections are added after ResNet “stages” (e.g., pool, res, res, res and pool). However, unlike [14], which has lateral connections after each stage, we found that it is most beneficial to have lateral connections between audio and visual features starting from intermediate levels (we ablate this in Sec. 4.2). Next, we discuss several concrete instantiations of AVSlowFast.

stage Slow pathway Fast pathway Audio pathway
raw clip 364224 364224 80128 (freq.time)
data layer stride 16, 1 stride 2, 1 -
conv 17, 64 57, 8 [91, 19], 32
stride 1, 2 stride 1, 2 stride 1, 1
pool 13 max 13 max -
stride 1, 2 stride 1, 2
res 3 3 3
res 4 4 4
res 6 6 6
res 3 3 3
global average pool, concate, fc
Table 1: An example instantiation of the AVSlowFast network. For Slow & Fast pathways, the dimensions of kernels are denoted by , for temporal, spatial, and channel sizes. For the Audio pathway, kernels are denoted with , , where and are frequency and time. Strides are denoted with temporal stride, spatial stride and frequency stride, time stride for SlowFast and Audio pathways, respectively. In this example, the speed ratios are 8, 32 and the channel ratios are 18, 12 and 16. The backbone is ResNet-50.

3.4 Instantiations

AVSlowFast Networks define a generic class of models that follow same design principles. In this section, we exemplify a specific instantiation in Table 1. We denote spatiotemporal size by for Slow/Fast pathways and for Audio pathway, where is the temporal length, is the height and width of a square spatial crop and is frequency bins for audio.

SlowFast pathways.

For Slow and Fast pathways, we follow the basis instantiation of SlowFast 416, R50 model defined in [14]. It has a Slow pathway that samples frames out from a 64-frame raw clip with a temporal stride . There is no temporal downsampling in the Slow pathway, since input stride is large. Also, it only applies non-degenerate temporal convolutions (temporal stride ) in res and res (see Table 1), as this is more effective.

For the Fast pathway, it has a higher frame rate () and a lower channel capacity (), such that it can better capture motion while trading off model capacity. To preserve fine temporal resolution, the Fast pathway has non-degenerate temporal convolutions in every residual block.

Spatial downsampling is performed with stride 2 convolution in the center (“bottleneck”) filter of the first residual block in each stage of both the Slow and Fast pathway.

Audio pathway.

The Audio pathway takes as input the log-mel-spectrogram representation, which is a 2-D representation with one axis being time and the other one denoting frequency bins. In the instantiation shown in Table 1, we use 128 spectrogram frames (corresponding to 2 seconds of audio) with 80 frequency bins.

Similar to Slow and Fast pathways, the Audio pathway is also based on a ResNet, but with specific design to better fit the audio inputs. First, it does not perform pooling after the initial convolutioal filter (i.e. there is no downsampling layer at stage pool) to preserve information along both temporal and frequency axis. Downsampling in time-frequency space is performed by stride 2 convolution in the center (“bottleneck”) filter of the first residual block in each stage from res to res.

Second, we decompose the 33 convolution filters in res and res into 13 filters for frequency and 31 filters for time. This not only reduces computation, but it also allows the network to treat time and frequency differently (as opposed to 33 filters which implies both axis are equivalent) in early stages. While for spatial filters it is reasonable to perform filtering in and dimensions symmetrically, this might not be optimal for early filtering in time and frequency dimensions, as the statistics of spectrograms are different from natural images, which instead are approximately isotropic and shift-invariant [57, 31].

Lateral connections.

There are many options on how to fuse audio features into the visual pathways. Here, we describe several instantiations and the motivation behind them. Note that this section discusses the lateral connections between Audio pathway and SlowFast pathways. For the fusion connection between the two visual pathways (Slow and Fast), we adopt the temporal strided convolution as it is demonstrated to be most effective in [14].

Figure 2: Fusion connections for AVSlowFast. Left: AFS enforces strong temporal alignment between audio and RGB frames, as audio is fused into the Fast pathway with fine temporal resolution. Center: AFS has higher tolerance on temporal misalignment as audio is fused into the temporally downsampled output of SlowFast fusion. Right: Audiovisual Nonlocal fuses through a Nonlocal block [79], such that audio features are used to select visual features that are deemed important by audio.

(i) AFS: In this approach (Fig. 2 left), the Audio pathway (A) is first fused to the Fast pathway (F), and then fused to the Slow pathway (S). Specifically, audio features are subsampled to the temporal length of the Fast pathway and then fused into the Fast pathway with a sum operation. After that, the resulting features are further subsampled by (e.g.,  4 subsample) and fused with the Slow pathway (as is done in SlowFast). The key property of this fusion method is that it enforces strong temporal alignment between audio and visual features, as audio features are fused into Fast pathway which preserves fine temporal resolution.

(ii) AFS: An alternative way is to fuse the Audio pathway into the output of the SlowFast fusion (Fig. 2 center), which is coarser in temporal resolution. This method imposes a less stringent requirement on temporal alignment between audio and visual features. Note that similar ideas of relaxing the alignment requirement is also explored in [37], but in the context of late fusion of RGB, flow and audio streams.

(iii) Audiovisual Nonlocal: One might also be interested in using audio as a modulating signal to visual features. Specifically, instead of directly summing or concatenating audio features into the visual stream, one might expect audio to play a more subtle role of modulating, through attention mechanisms such as Non-Local (NL) blocks [79], the visual concepts. One example would be audio serving as a probing signal indicating where the interesting event is happening in the video, both spatially and temporally, and then focus the attention of visual pathways on those locations. To materialize this, we adapt NL blocks to take both audio and visual features as inputs (Fig. 2 right). Audio features are then matched to different locations within visual features (along , and axis), and the affinity is used to generate a new visual feature that combines information from locations deemed important by audio features.

3.5 Joint audiovisual training

Unlike to SlowFast, AVSlowFast trains with multiple modalities. As noted in Sec. 1, this leads to challenging training dynamics (i.e., different training speed of audio and visual pathways). To tackle this, we propose two training strategies that enable joint training of AVSlowFast.

Figure 3: Training procedure on Kinetics for Audio-only (red) vs. SlowFast (green) networks. We show the top-1 training error (dash) and validation error (solid). The curves show single-crop errors; the video accuracy is 24.8% vs. 75.6%. The audio network converges after around 3 fewer iterations compared to the visual.


We discuss a possible reason for why many previous video classification approaches employ audio in an ad-hoc manner (i.e., late fusion). By analyzing the model training dynamics we observe the following. First, audio and visual pathways are very different in terms of their “learning speed”. Taking the curves in Fig. 3 as an example, the green curve is for training a visual-only SlowFast model, whereas the red curve is for training an Audio-only model. It shows that the Audio-only model requires fewer training iterations before it starts to overfit (at

70 epochs, which is

of visual model training epochs).

As we will show by experiments, this discrepancy on learning pace leads to strong overfitting if we naively train both modalities jointly. To unlock the potential of joint training, we propose a simple strategy of randomly dropping the Audio pathway during training (referred to as DropPathway

). Specifically, at each training iteration, we drop the Audio pathway altogether with probability

. This way, we slow down

the learning of the Audio pathway and make its learning dynamics more compatible with its visual counterpart. When dropping the audio pathway, we simply feed zero tensors into visual pathways (we also explored feeding running average of audio features, and found similar results, possibly due to BN). Our ablation studies in the next section will show the effect of DropPathway, demonstrating that this simple strategy provides good generalization and is essential for jointly training AVSlowFast.

Hierarchical audiovisual synchronization.

As noted in Sec. 2, temporal synchronization (that comes for free) between audio and visual sources has been explored as self-supervisory signal to learn feature representations [53, 5, 2, 11, 52, 41, 25]. In this work, we use audiovisual synchronization to encourage the network to produce feature representations that are generalizable across modalities (inspired by the audiovisual mirror neurons in primate vision [38]

). Specifically, we add an auxiliary task to classify whether a pair of audio and visual frames are

in-sync or not [41, 52] and adopt a curriculum schedule used in [41] that starts with easy negatives (audio and visual frames come from different clips), and transition into a mix of easy and hard (audio and visual frames are from the same clip, but with a temporal shift) after 50% of training epochs.

One notable difference of our approach to previous work is that unlike previous work which has a single synchronization loss at the network output (since these works adopt a “late-fusion” strategy), we add multiple losses to each of the fusion junctions, since AVSlowFast has fusion connections at multiple levels. As we will show in our experiments, this leads to better audiovisual features learned by AVSlowFast.

4 Experiments: Action Classification

We evaluate our approach on four video recognition datasets using standard evaluation protocols. For the action classification experiments in this section we use the Kinetics-400 [36], EPIC-Kitchens [12] and Charades [61]. For action detection experiments, we use the challenging AVA dataset [24], which will be covered in Sec. 5.

Datasets. Kinetics-400 [36] is a large-scale video dataset of 240k training videos and 20k validation videos in 400 action categories. Results on Kinetics are reported as top-1 and top-5 classification accuracy (%).

The EPIC-Kitchens dataset [12] consists of egocentric videos of daily activities recorded in various kitchen environments. It has 39k segments in 432 videos. For each segment, the task is to predict a verb (e.g., “turn-on”), a noun (e.g., “switch”), and an action by combining the two (“turn on switch”). Performance is measured as top-1 and top-5 accuracy. We use the train/val split following [6]. Test set results are obtained by submitting to the evaluation server.

Charades [61] is a dataset of 9.8k training videos and 1.8k validation videos in 157 classes. Each video has multiple labels of activities spanning 30 seconds. Performance is measured in mean Average Precision (mAP).

Audio pathway. Following previous work [2, 3, 41, 42], we extract log-mel-spectrograms from the raw audio waveform to serve as the input to Audio pathway. Specifically, we sample audio data with 16 kHz sampling rate, then compute a spectrogram with window size of 32ms and step size of 16ms. The length of audio input is exactly matched to the duration spanned by RGB frames. For example, under 30 FPS, for AVSlowFast with 88 frames (2 secs) input, we sample 128 frames (2 secs) in log-mel-spectrogram.

Training. We train our AVSlowFast models on Kinetics from scratch without any pre-training. We use synchronous SGD optimizer and follow the training recipe (e.g., learning rate, weight decay, warm-up, etc) used in [14]. Given a training video, we randomly sample frames with stride and extract the corresponding log-mel-spectrogram. For video frames, we randomly crop 224224 pixels from a video, applying horizontal flip at random, and resize it to a shorter side sampled in [256, 320] pixels [63, 79].

Inference. Following previous work [79, 14], we uniformly sample 10 clips from a video along its temporal axis. For each clip, we resize the shorter spatial side to 256 pixels and take 3 crops of 256256 along the longer side to cover the spatial dimensions. Video-level predictions are computed by averaging softmax scores. We report the actual inference-time computation as in [14], by listing the FLOPs per spacetime “view” of spatial size 256 (temporal clip with spatial crop) at inference and the number of views (i.e. 30 for 10 temporal clips each with 3 spatial crops).

Full training and inference details for Kinetics, EPIC and Charades are in appendices A.4, A.5, and A.6, respectively.

model inputs pretrain top-1 top-5 KS GFLOPsviews
I3D [9] V 72.1 90.3 N/A 108  N/A
Nonlocal [79], R101 V 77.7 93.3 359  30
R(2+1)D [75] V - 72.0 90.0 152  115
R(2+1)D [75] V+F - 73.9 90.9 304  115
I3D [9] V+F - 71.6 90.0 216  N/A
ECO [86] V - 70.0 89.4 N/A  N/A
S3D [83] V - 69.4 89.1 66.4  N/A
ARTNet [76] V - 69.2 88.3 23.5  250
STC [13] V - 68.7 88.5 N/A  N/A
ip-CSN-152 [74] V - 77.8 92.8 109  30
3-stream late fusion [8] A+V+F 74.9 91.6 N/A  N/A
3-stream LSTM [8] A+V+F 77.1 93.2 N/A  N/A
3-stream SATT [8] A+V+F 77.7 93.2 N/A  N/A
GBlend [78] V - 76.4 92.1 N/A  N/A
GBlend [78] A+V - 77.7 93.0 N/A  N/A
SlowFast, R50 [14] V - 75.6 92.0 80.5 36  30
AVSlowFast, R50 A+V - 77.0 92.7 83.7 40  30
SlowFast, R101 [14] V - 77.9 93.2 82.7 106  30
AVSlowFast, R101 A+V - 78.8 93.6 85.0 129  30
Table 2: AVSlowFast results on Kinetics. AVSlowFast and SlowFast instantiations are with 416 and 88 inputs for R50/R101, and without NL blocks. “N/A” indicates the numbers are not available for us. “KS” refers to top-1 accuracy on Kinetics-Sounds dataset [2]

, which is a subset of 34 Kinetics classes. “pretrain” refers to ImageNet pretraining.

4.1 Main Results


We present action recognition results of AVSlowFast on Kinetics in Table 2. First, we compare AVSlowFast with SlowFast and see a margin of 1.4% top-1 for R50 and 0.9% top-1 for R101, given the same network backbone and input size. This demonstrates the effectiveness of the audio stream despite its modest cost of only 10%20% of the overall computation. Comparatively going deeper from R50 to R101 increases computation by 194% for a slightly higher gain in accuracy.

AVSlowFast compares favorably to existing methods that utilize various modalities, i.e., audio (A), visual frames (V) and optical flow (F). Adding optical flow streams can bring similar gains for doubling computation (R(2+1)D in Table 2). When comparing to other methods that also utilize audio [78, 8], despite building upon a stronger baseline (our SlowFast baseline is as good as GBlend’s [78] final AV model: 77.9% vs. 77.7%), AVSlowFast is still able to bring complementary benefits from audio. We refer readers to appendix A.3 for more comparisons to GBlend.

Furthermore, as Kinetics is a visual-heavy dataset (for many classes e.g. “writing” audio is not useful), to better study audiovisual learning, [2] proposes “Kinetics-Sounds” as a subset of 34 Kinetics classes that are potentially manifested both visually and aurally (example classes include “blowing nose” and “playing drums”). We test both SlowFast and AVSlowFast on Kinetics-Sounds in “KS” column of Table 2. As expected, the gain from SlowFast to AVSlowFast is stronger on Kinetics-Sounds – for R50/R101, gains doubled to +3.2%/+2.3%, showing the potential of audio on relevant data.

A comparison of our Audio pathway for Audio-only classification on Kinetics is provided in appendix A.2, and a class-level analysis for Kinetics in appendix A.3.

verbs nouns actions
top-1 top-5 top-1 top-5 top-1 top-5
 3D CNN [81] 49.8 80.6 26.1 51.3 19.0 37.8
 LFB [81] 52.6 81.2 31.8 56.8 22.8 41.1
 SlowFast [14] 55.8 83.1 27.4 52.1 21.9 39.7
AVSlowFast 58.7 83.6 31.7 58.4 24.2 43.6
test s1 (seen)
 LFB [81] 60.0 88.4 45.0 71.8 32.7 55.3
 FBK-HUPBA [67] 63.3 89.0 44.8 69.9 35.5 57.2
 HF-TSN [68] 57.6 87.8 39.9 65.4 28.1 48.6
 EPIC-Fusion [37] 64.8 90.7 46.0 71.3 34.8 56.7
AVSlowFast 65.7 89.5 46.4 71.7 35.9 57.8
test s2 (unseen)
 LFB [81] 50.9 77.6 31.5 57.8 21.2 39.4
 FBK-HUPBA [67] 49.4 77.5 27.1 52.0 20.3 37.6
 HF-TSN [68] 42.4 75.8 25.2 49.0 16.9 33.3
 EPIC-Fusion [37] 52.7 79.9 27.9 53.8 19.1 36.5
AVSlowFast 55.8 81.7 32.7 58.9 24.0 43.2
Table 3: EPIC-Kitchens validation and test set results. Backbone: 88, R101 (without NL blocks).
model pretrain mAP GFLOPsviews
CoViAR, R-50 [82] ImageNet 21.9 N/A
Asyn-TF, VGG16 [60] ImageNet 22.4 N/A
MultiScale TRN [85] ImageNet 25.2 N/A
Nonlocal, R101 [79] ImageNet+Kinetics 37.5 544  30
STRG, R101+NL [80] ImageNet+Kinetics 39.7 630  30
Timeception [32] Kinetics-400 41.1 N/AN/A
LFB, +NL [81] Kinetics-400 42.5 529   30
SlowFast Kinetics 42.5 234  30
SlowFast+Audio Kinetics 42.8 -
AVSlowFast Kinetics 43.7 278  30
Table 4: Comparison with the state-of-the-art on Charades. SlowFast and AVSlowFast are with R101+NL backbone and 168 sampling. “SlowFast+Audio” refers to applying late-fusion.
connection top-1 top-5 GFLOPs
AFS 75.3 91.8 51.4
AFS 77.0 92.7 39.8
AV Nonlocal 77.2 92.9 39.9
(a) Audiovisual fusion connection.
top-1 top-5 GFLOPs
1/8 76.0 92.5 36.0
1/4 76.6 92.7 36.8
1/2 77.0 92.7 39.8
1 75.9 92.4 51.9
(b) Audio channels.
top-1 top-5
- 75.2 91.8
0.2 76.0 92.5
0.5 76.7 92.7
0.8 77.0 92.7
(c) DropPathway rate .
AVS top-1 top-5
- 76.4 92.5
res 76.7 92.8
res 76.9 92.9
res 77.0 92.7
(d) Hierarchical AV sync.
Table 5: Ablations on AVSlowFast design on Kinetics-400. We show top-1 and top-5 classification accuracy (%), as well as computational complexity measured in GFLOPs for a single clip input of spatial size 256. Backbone: 416, R-50.

EPIC-Kitchens. Next, we compare to state-of-the-art methods on EPIC-Kitchens in Table 3. First, AVSlowFast advances SlowFast with strong margins of +2.9%/+4.3%/+2.3% for verb/noun/action, which clearly demonstrates the benefits of audio in egocentric video recognition. Second as system-level comparison, AVSlowFast exhibits higher performance in all three categories (verb/noun/action) and two test sets (seen/unseen), to the related, previous best EPIC-Fusion [37]. We observe larger performance gains on the unseen split (i.e., novel kitchen scenes) of the test set (+3.1%/+4.8%/+4.9% for verb/noun/action), which demonstrates good generalization of our method. Comparing to LFB [81], that uses an object detector to localize objects, AVSlowFast achieve similar performance as for nouns (objects) on both the seen and unseen test sets, whereas SlowFast without audio is largely lacking behind LFB (-4.4% compared to LFB on val noun), which is intuitive as sound can be beneficial for recognizing objects. Overall, we echo the findings in [37] that audio is a very useful signal for egocentric video recognition and our AVSlowFast Networks makes good use of it.


We test the effectiveness of AVSlowFast on videos of longer range activities on Charades in Table 4. We observe that audio can facilitate recognition (+1.2% over a strong SlowFast baseline) and we achieve state-of-the-art performance under Kinetics-400 pre-training. We further report the performance for late fusion: It only improves marginally over SlowFast (+0.3%) while AVSlowFast better exploits audio as an integrated audiovisual architecture.

4.2 Ablation Studies

We show ablation results on Kinetics-400 to study the effectiveness and tradeoffs of various design choices.

fusion stage top-1 top-5 GFLOPs
SlowFast 75.6 92.0 36.1
SlowFast+Audio 76.1 92.0 -
pool 75.4 92.0 38.4
res + pool 76.5 92.6 39.1
res + pool 77.0 92.7 39.8
res + pool 75.8 92.4 40.2
Table 6: Effects of hierarchical fusion. All models are based on R50 and input size 416. “pool” refers to fusing audio and visual features at the output of the last ResNet stage (i.e., late-fusion).

Hierarchical fusion.

We first study the effectiveness of fusion in Table 6. The first interesting phenomenon is that direct ensembling of audio/visual models produces modest gain (76.1% vs 75.6%), whereas joint training with late-fusion (“pool”) hurts (75.6% 75.4%).

Next for our hierarchical, multi-level fusion, it is beneficial to fuse audio and visual features at multiple levels. Specifically, we found that recognition accuracy steadily increases from 75.4% to 77.0% when we increase the number of fusion connections from one (i.e., only concatenating pool5 outputs) to three (res + pool) where it peaks. If we further add a lateral connection at res, the performance starts to drop. This suggests that it is beneficial to start fusing audio and visual features from intermediate levels (res) all the way to the top of the network. We hypothesize that this is because audio facilitates the formation of visual concept, but only when features mature to intermediate concepts that are generalizable across modalities (e.g. local edges typically do not have a general sound pattern).

Lateral connections.

We ablate the the effectiveness of different instantiations of lateral connections between audio and visual pathways in Table (a)a. First, AFS, which enforces strong temporal alignment between audio and visual streams, produces much worse classification accuracy comparing to AFS, which relaxes the requirement on alignment. This coincides with [37] arguing it is beneficial to have tolerance on alignment between the modalities, since class-level audio signals might happen out-of-sync to visual frames (e.g., when shooting 3 pointers in basketball, net-touching sound only comes after the action finishes). Finally, the straightforward AFS connection performs similarly to the more complex Audiovisual Nonlocal [79] fusion (77.0% vs 77.2%). We use AFS as our default lateral connection for its good performance and simplicity.

Audio pathway capacity.

We study the impact of the number of channels of the Audio pathway () in Table (b)b. As expected, when we increase the number of channels (e.g., increasing from 1/8 to 1/2, which is the ratio between Audio and Slow pathway’s channels), accuracy improves at the cost of increased computation. However, performance starts to degrade when we further increase it to 1, likely due to overfitting. We use across all our experiments.


As we discussed before, we apply pathway dropping to adjust the incompatibility of the learning speed across modalities. Here we conduct ablative experiments to study the effects of different drop rates . The results are shown in Table (c)c. As shown in the table, a high value of (0.5 or 0.8) is required to slow down the Audio pathway when training audio and visual pathways jointly. In contrast, when we train AVSlowFast without DropPathway (“-”), the accuracy degrades dramatically to be even worse than visual-only models (75.2% vs 75.6%). This is because the Audio pathways learns too fast and start to overfit and dominate the visual feature learning. This demonstrates the importance of DropPathway for joint audiovisual training.

Hierarchical audiovisual synchronization.

We study the effectiveness of hierarchical audiovisual synchronization in Table. (d)d. We use AVSlowFast with and without AVS, and vary the layers for multiple losses. We observe that adding AVS as an auxiliary task is beneficial (+0.6% gain). Furthermore, having synchronization loss at multiple levels slightly increases the performance (without cost). This shows that it is beneficial to have a feature representation that is generalizable across audio and visual modalities and hierarchical AVS could facilitate to produce such representation.

5 Experiments: AVA Action Detection

In addition to the image-level action recognition task, we also apply AVSlowFast models on action detection which requires both localizing and recognizing actions. Although audio does not provide spatial localization information, we hope it can help recognition and therefore benefit detection.


The AVA dataset [24] focuses on spatiotemporal localization of human actions. Spatiotemporal labels are provided for one frame per second, with people annotated with a bounding box and (possibly multiple) actions. There are 211k training and 57k validation video segments. We follow the standard protocol [24] of evaluating on 60 classes. The metric is mean Average Precision (mAP) over 60 classes, using a frame-level IoU threshold of 0.5.

Detection architecture.

We follow the detection architecture introduced in [14], which is adapted from Faster R-CNN [56] for video. During training, the input to our audiovisual detector is RGB frames sampled with temporal stride and spatial size 224224, to SlowFast pathways, and the corresponding log-mel-spectrogram covering this time window to Audio pathway. During testing, the backbone feature extractor is computed fully convolutional with RGB frame shorter side of 256 pixels [14], as is standard in Faster R-CNN [56]. For details on architecture, training and inference, please refer to appendix A.7.

Main Results.

We compare AVSlowFast to SlowFast as well as several other existing methods in Table 7. AVSlowFast, with both R50 and R101 backbones, outperforms SlowFast with a consistent margin (1.2%), but only increases FLOPs222We report FLOPs for fully-convolutional inference of a clip with 256320 spatial size for SlowFast and AVSlowFast models, full test-time computational cost for these models is directly proportional to this. slightly, e.g. for R50 by only 2%, whereas going from SlowFast R50 to R101 (without audio) increases computation by 180% more FLOPs. This demonstrates that information from audio can be cheap and beneficial also for action detection, where spatiotemporal localization is required. Interestingly, the ActivityNet Challenge 2018 [20] hosted a separate track for multiple modalities but no team could achieve gains using audio information on AVA. For system-level comparison to other approaches, Table 7 shows that AVSlowFast achieves state-of-the-art performance on AVA under Kinetics-400 (K400) pretraining.

For future comparisons, we show results on v2.2 of AVA, which provides more consistent annotations. We see consistent results as for v2.1. As for per-class results, we found classes like [“swim” +30.2%], [“dance” +10.0%], [“shoot” +8.6%], and [“hit (an object)” +7.6%] has the largest gain from audio; see appendix A.3 and Fig. A.1 for more details.

model inputs AVA pretrain val mAP GFLOPs
I3D [24] V+F v2.1 K400 15.6 N/A
ACRN, S3D [70] V+F 17.4 N/A
ATR, R50+NL [34] V+F 21.7 N/A
9-model ensemble [34] V+F 25.6 N/A
I3D+Transformer [22] V 25.0 N/A
LFB, + NL R50 [81] V 25.8 N/A
LFB, + NL R101 [81] V 26.8 N/A
SlowFast 416, R50 V 24.3 65.7
AVSlowFast 416, R50 A+V 25.4 67.1
SlowFast 88, R101 V 26.3 184
AVSlowFast 88, R101 A+V 27.8 210
SlowFast 416, R50 V v2.2 K400 24.7 65.7
AVSlowFast 416, R50 A+V 25.9 67.1
SlowFast 88, R101 V 27.4 184
AVSlowFast 88, R101 A+V 28.6 210
Table 7: Comparison with the state-of-the-art on AVA. Both AVSlowFast and SlowFast use 88 frame inputs. For R101, both AVSlowFast and SlowFast also use NL [79].
method inputs #param FLOPs pretrain UCF HMDB
Shuffle&Learn [49, 69] V 58.3M N/A K600 26.5 12.6
3D-RotNet [35, 69] V 33.6M N/A K600 47.7 24.8
CBT [69] A+V N/A N/A K600 54.0 29.5
AVSlowFast A+V 38.5M 63.4G K400 77.4 42.2
Table 8: Comparison under the linear classification protocol. We only train the the last fc layer after self-supervised pretraining on Kinetics-400 (abbreviated as K400). Top-1 accuracy averaged over three splits is reported for comparison to previous work. AVSlowFast use R50 backbone with   = 8  8 sampling. A+V refers to using transcripts obtained from ASR on audio.

6 Experiments: Self-supervised Learning

To further demonstrate the generalization of AVSlowFast models, we apply it to self-supervised audiovisual representation learning. For evaluating if AVSlowFast is readily applicable to existing audio/visual tasks, we directly adopt the audiovisual synchronization [2, 42, 52] and image rotation prediction [21] (0, 90, 180, 270; as a four-way softmax-classification) losses proposed in existing literature. With the learned representation, we then re-train the last fc layer of AVSlowFast on UCF101 [64] and HMDB51 [43] following standard practice. Table 8 lists the results. Using off-the-shelf pretext tasks, our smallest AVSlowFast, R50 model compares favorably to state-of-the-art self-supervised classification accuracy on both datasets, highlighting the strength of our architecture. For more details and results, please refer to appendix A.1.

7 Conclusion

This work has presented AVSlowFast Networks, an architecture for integrated audiovisual perception. We demonstrate the effectiveness of AVSlowFast with state-of-the-art performance on multiple datasets for video action classification and detection. We hope that AVSlowFast Networks will foster further research in video understanding.

Appendix A Appendix

a.1 Results: Self-supervised Learning

method inputs #param FLOPs pretrain UCF HMDB
Shuffle&Learn [49, 69] V 58.3M N/A K600 26.5 12.6
3D-RotNet [35, 69] V 33.6M N/A K600 47.7 24.8
CBT [69] A+V N/A N/A K600 54.0 29.5
AVSlowFast 416 A+V 38.5M 36.2G K400 76.8 41.0
AVSlowFast 88 A+V 38.5M 63.4G K400 77.4 42.2
AVSlowFast 164 A+V 38.5M 117.9G K400 77.4 44.1
ablation (split1)
 SlowFast 416 (ROT) V 33.0M 34.2G K400 71.9 42.0
 AVSlowFast 416 (AVS) A+V 38.5M 36.2G K400 73.2 39.5
AVSlowFast 416 A+V 38.5M 36.2G K400 77.0 40.2
Table A.1: Comparison using the linear classification protocol. We only train the the last fc layer after self-supervised pretraining on Kinetics-400 (abbreviated as K400). Top-1 accuracy averaged over three splits is reported when comparing to previous work (top), results on split1 is used for ablation (bottom). All SlowFast models use use R50 backbones with   sampling. A+V refers to using transcripts obtained from ASR on audio.
method inputs #param pretrain UCF101 HMDB51
Shuffle & Learn [49] V 58.3M UCF/HMDB 50.2 18.1
OPN [45] V 8.6M UCF/HMDB 59.8 23.8
O3N [17] V N/A Kinetics-400 60.3 32.5
3D-RotNet [35] V 33.6M Kinetics-400 62.9 33.7
3D-ST-Puzzle [39] V 33.6M Kinetics-400 65.8 33.7
DPC [25] V 32.6M Kinetics-400 75.7 35.7
CBT [69] A+V N/A Kinetics-600 79.5 44.6
Multisensory [52] A+V N/A Kinetics-400 82.1 N/A
AVTS [41] A+V N/A Kinetics-400 85.8 56.9
VGG-M motion [62, 16] V 90.7M - 83.7 54.6
AVSlowFast A+V 38.5M Kinetics-400 87.0 54.6
Table A.2: Comparison for Training all layers. Results using the popular protocol of fine-tuning all layers after self-supervised pretraining. Top-1 accuracy averaged over three splits is reported. We use AVSlowFast 164, R50 for this experiment. A+V refers to CBT uses transcripts obtained from ASR on audio input. While this protocol has been used in the past, we think it is suboptimal for evaluation of self-supervised representations, as the training of all layers can significantly impact performance; e.g. an AlexNet-like VGG-M motion stream [62, 16] can perform among state-oft-the-art self-supervised approaches, without any pretraining.

In this section, we provide more results and detailed analysis on self-supervised learning using AVSlowFast. Training schedule and details are provided in §


First, we pretrain AVSlowFast with self-supervised objectives of audiovisual synchronization [2, 42, 52] (AVS) and image rotation prediction [21] (ROT) on Kinetics-400. Then, following the standard linear classification protocol used for image recognition tasks [26], we use the pretrained network as a fixed, frozen feature extractor and train a linear classifier on top of the self-supervisedly learned features. In Table A.1 (top), we compare to previous work that follows the same protocol. We note this is the same experiment as in Table 8, but with additional ablations on our models. The results indicate that features learned by AVSlowFast are significantly better than baselines including the recently introduced CBT method [69] (+23.4% for UCF101 and +14.6% for HMDB51), which also uses ROT as well as a contrastive bidirectional transformer (CBT) loss by pretraining on the larger Kinetics-600 with transcripts obtained from audio.

In addition, we also ablate the contribution of individual tasks of AVS and ROT in Table A.1 (bottom). On UCF101, SlowFast/AVSlowFast trained under either ROT or AVS objective show strong individual performance, while the combination of them perform the best. Whereas on the smaller HMDB51, all three variants of our method perform similarly well and audio seems less important.

Another aspect is that, although many previous approaches on self-supervsied feature learning focus on reporting number of parameters, the FLOPs are another important factor to consider – as shown in Table A.1 (top), the performance keeps increasing when we take higher temporal resolution clips by varying (i.e. larger FLOPs), even though model parameters remain identical.

Although we think the linear classification protocol serves as a better method to evaluate self-supervised feature learning (as features are frozen and therefore less sensitive to hyper-parameter settings such as learning schedule and regularization, especially when these datasets are relatively small), we also evaluate by fine-tuning all layers of AVSlowFast on the target datasets to compare to a larger corpus of previous work on self-supervised feature learning. Table A.2 shows that AVSlowFast achieves competitive performance comparing to prior work under this setting. When using this protocol, we believe it is reasonable to also consider methods that train multiple layers on UCF/HMDB from scratch, such as optical-flow based motion streams [62, 16]. It is interesting that this stream, despite being an AlexNet-like model [10], is comparable or better, than many newer models, pretrained on (the large) Kinetics-400 using self-supervised learning techniques.

Figure A.1: AVA per-class average precision. AVSlowFast (27.8 mAP) vs. its SlowFast counterpart (26.3 mAP). The highlighted categories are the 5 highest absolute increases (bold) and top 5 relative increases over SlowFast (orange). Best viewed in color with zoom.

a.2 Results: Audio-only Classification

To understand the effectiveness of our Audio pathway, we evaluate it in terms of Audio-only classification accuracy on Kinetics (in addition to Kinetics-400, we also train and evaluate on Kinetics-600 to be comparable to methods that use this data in challenges [20]). In Table A.3, we compare our Audio-only network to several other audio models. We observe that our Audio-only model performs better than existing methods by solid margins (+3.3% top-1 accuracy on Kinetics-600 and +3.2% on Kinetics-400, compared to best-performing methods), which demonstrates the effectiveness of our Audio pathway design. Note also that unlike some other methods in Table A.3, we train our audio network from scratch on Kinetics, without any pretraining.

model dataset pretrain top-1 top-5 GFLOPs
VGG* [29] Kinetics-600 Kinetics-400 23.0 N/A N/A
SE-ResNext [20] Kinetics-600 ImageNet 21.3 38.7 N/A
Inception-ResNet [20] Kinetics-600 ImageNet 23.2 N/A N/A
Audio-only (ours) Kinetics-600 - 26.5 44.7 14.2
VGG [8] Kinetics-400 - 21.6 39.4 N/A
GBlend [78] Kinetics-400 - 19.7 33.6 N/A
Audio-only (ours) Kinetics-400 - 24.8 43.3 14.2
Table A.3: Results of Audio-only models. VGG* model results are taken from “iTXN” submission from Baidu Research to ActivityNet challenge, as documented in this report [20].

a.3 Results: Classification & Detection Analysis

Comparison to GBlend

We also explored using Gradient Blending (GBlend) [78] to train our base AVSlowFast 416, R50 model. Specifically, we add a prediction head from audio feature only, to classify Kinetics classes, and set (the weight for loss) for audio to be 1/3, which is approximately the ratio between the accuracy of audio and visual models (24.8% vs. 75.6%). Also, we turn off DropPathway such that we can compare it to the effects of GBlend in facilitating multi-modal training. We keep other hyper-parameters to be the same. Interestingly, we found that GBlend does not help in our setting, as it yields 75.1% top-1 accuracy (vs. 75.6% for our visual SlowFast model and 77.0% for our AVSlowFast model trained with DropPathway). We hypothesize that this is because GBlend is targeted for late-fusion of audio and visual streams.

Per-class analysis on Kinetics

Comparing AVSlowFast to SlowFast (77.0% vs. 75.6% for 416, R50 backbone), classes that benefited most from audio include [“dancing macarena” +24.5%], [“whistling” +24.0%], [“beatboxing” +20.4%], [“salsa dancing” +19.1%] and [“singing” +16.0%], etc. Clearly, all these classes have distinct sound signatures to be recognized. On the other hand, classes like [“skiing (not slalom or crosscountry)” -12.3%], [“triple jump” -12.2%], [“dodgeball” -10.2%] and [“massaging legs” -10.2%] have the largest performance loss, as sound of these classes tend to be much less correlated the action.

Per-class analysis on AVA

We compare per-class results of AVSlowFast to its SlowFast counterparts in Fig. A.1. As mentioned in the main paper, classes with largest absolute gain (marked with bold black font) are “swim”, “dance”, “shoot”, “hit (an object)” and “cut”. Further, the classes “push (an object)” (3.2) and “throw” (2.0) largely benefit from audio in relative terms (marked with orange font in Fig. A.1). As expected, all these classes have strong sound signature that are easy to recognize from audio. On the other hand, the largest performance loss arises for classes such as “watch (e.g., TV)”, “read”, “eat” and “work on a computer”, which either do not have a distinct sound signature (“read”, “work on a computer”) or have strong background noise sound (“watch (e.g., TV)”). We believe explicitly modeling foreground and background sound might be a fruitful future direction to alleviate these challenges.

a.4 Details: Kinetics Action Classification

We train our models on Kinetics from scratch without any pretraining. Our training and testing closely follows [14]. We use a synchronous SGD optimizer and train with 128 GPUs using the recipe in [23]. The mini-batch size is 8 clips per GPU (so the total mini-batch size is 1024). The initial base learning rate is 1.6 and we decrease the it according to half-period cosine schedule [46]: the learning rate at the -th iteration is , where is the maximum training iterations. We adopt a linear warm-up schedule [23] for the first 8k iterations. We use a scale jittering range of [256, 340] pixels for R101 model to improve generalization [14]. To aid convergence, we initialize all models that use Non-Local blocks (NL) from their counterparts that are trained without NL. We only use NL on res (instead of res+res used in [79]).

We train with Batch Normalization (BN) 

[33], and the BN statistics are computed within each 8 clips. Dropout [30] with rate 0.5 is used before the final classifier layer. In total, we train for 256 epochs (60k iterations with batch size 1024, for 240k Kinetics videos) when 4 frames, and 196 epochs when the Slow pathway has 4 frames: it is sufficient to train shorter when a clip has more frames. We use momentum of 0.9 and weight decay of 10.

a.5 Details: EPIC-Kitchens Classification

We fine-tune from Kinetics pretrained AVSlowFast 8

8, R101 (w/o NL) for this experiment. For fine-tuning, we freeze all BNs by converting them into affine layers. We train using a single machine with 8 GPUs. Initial base learning rate

is set to 0.01 and 0.0006 for verb and noun. We train with batch size 32 for 24k and 30k for verb and noun respectively. We use a step wise decay of the learning rate by a factor of 10 at 2/3 and 5/6 of full training. For simplicity, we only use a single center crop for testing.

a.6 Details: Charades Action Classification

We fine-tune from the Kinetics pretrained AVSlowFast 168, R101 + NL model, to account for the longer activity range of this dataset, and a per-class sigmoid output is used to account for the mutli-class nature of the data. We train on a single machine (8 GPUs) for 40k iterations using a batch size of 8 and a base learning rate of 0.07 with one 10

 decay after 32k iterations. We use a Dropout rate of 0.7. For inference, we temporally max-pool scores

[79, 14]. All other settings are the same as those of Kinetics.

a.7 Details: AVA Action Detection

We follow the detection architecture introduced in [14], which is adapted from Faster R-CNN [56] for video. Specifically, we set the spatial stride of res from 2 to 1, thus increasing the spatial resolution of res by 2. RoI features are then computed by applying RoIAlign [27] spatially and global average pooling temporally. These features are then fed to a per-class, sigmoid-based classifier for multi-label prediction. Again, we initialize from Kinetics pretrained models and train 52k iterations with initial learning rate of 0.4 and batch size 16 (we train across 16 machines, so effective batch size 1616=256). We pre-compute proposals using an off-the-shelf Faster R-CNN person detector with ResNeXt-101-FPN backbone. It is pretrained on ImageNet and the COCO human keypoint data and more details can be found in [14, 81].

a.8 Details: Self-supervised Evaluation

For self-supervised pretraining, we train on Kinetics-400 for 120k iterations with per-machine batch size 64 across 16 machines and initial learning rate 1.6, similar to §A.4, but with step-wise schedule. The learning rate is decayed with 10 three times at 80k, 100k and 110k iterations. We use linear warm-up (starting from learning rate 0.001) for the first 10k iterations. As noted in Sec. 6, we adopt the curriculum learning idea for audiovisual synchronization [41] to first train with easy negatives for the first 60k iterations and then switch to a mix of easy and hard negatives (14 hard, 34 easy) for the remaining 60k iterations. The easy negatives com from different videos, while hard negatives have a temporal displacement of at least 0.5 seconds.

For the “linear classification protocol” experiments on UCF and HMDB, we train 320k iterations (echoing [40], we found it beneficial to train long iterations in this setting) with an initial learning rate of 0.01, a half-period cosine decay schedule and a batch size of 64 on a single machine with 8 GPUs. For the “train all layers” setting, we train 80k 30k iterations with batch size 16 (also on a single machine), an initial learning rate of 0.005 0.01 and a half-period cosine decay schedule, for UCF and HMDB, respectively.

Appendix B Details: Kinetics-Sound dataset

The original 34 classes selected in [2] are based on an earlier version of the Kinetics dataset. Some classes are removed since then. Therefore, we use the following 32 classes that are kept in current version of Kinetics-400 dataset: “blowing nose”, “blowing out candles”, “bowling”, “chopping wood”, “dribbling basketball”, “laughing”, “mowing lawn”, “playing accordion”, “playing bagpipes”, “playing bass guitar”, “playing clarinet”, “playing drums”, “playing guitar”, “playing harmonica”, “playing keyboard”, “playing organ”, “playing piano”, “playing saxophone”, “playing trombone”, “playing trumpet”, “playing violin”, “playing xylophone”, “ripping paper”, “shoveling snow”, “shuffling cards”, “singing”, “stomping grapes”, “strumming guitar”, “tap dancing”, “tapping guitar”, “tapping pen”, “tickling”.