Composable Augmentation Encoding for Video Representation Learning

04/01/2021 ∙ by Chen Sun, et al. ∙ Google MIT 12

We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. These methods implicitly assume a set of representational invariances to the view selection mechanism (eg, sampling frames with temporal shifts), which may lead to poor performance on downstream tasks which violate these invariances (fine-grained video action recognition that would benefit from temporal information). To overcome this limitation, we propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the video representations for contrastive learning. We show that representations learned by our method encode valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Standard self-supervised contrastive learning methods are augmentation-invariant, they ingest two augmented views of the same instance and encourage their latent representations to be similar. For videos, if the two views were sampled with a temporal shift, this approach would learn representations that are invariant to time changes, losing valuable temporal dynamics information (the action of slicing oranges) for downstream tasks. We instead project latent representations by additionally encoding the relative transformation of the two views (the time shift by ) with a composable augmentation encoding (CATE) to make the representation augmentation-aware.

We focus on contrastive learning [16]

of self-supervised video representations. The contrastive objective is simple: it pulls the latent representations of positive pairs close to each other, while pushing negative pairs apart. It is a natural fit for self-supervised learning, where positive and negative pairs can be constructed from the data itself, without the need for additional annotations.

Amongst different positive pair generation techniques, a particularly successful one has been augmentation-invariant contrastive learning [8, 20, 60], which has shown impressive results for image representation learning. In this instance discrimination framework, positive pairs are constructed by applying artificial aggressive photo-geometric data augmentations to create different versions of the same instances – learned representations are thus encouraged to be invariant to these data augmentations.

A number of self-supervised works also extend this idea to videos, where instead of artificial augmentations, they treat temporal shift as a natural data augmentation [44, 55]. While this objective is useful for capturing high-level semantic information, knowledge of object categories, it can remove fine-grained information that may be useful depending on the downstream task – as previously studied in images [8, 62]. In the case of the oranges in Figure 1, if the task is object classification, invariance to small time shifts in a video that can result in view point changes or shape deformations may be useful and even desirable. However, for a downstream task involving reasoning about temporal relationships, such as recognising the state transition between a whole orange and orange slices caused by the act of cutting fruit, a representation invariant to temporal shifts may have lost valuable information.

We argue that augmentation-aware information can be retained if the relative augmentations of the two views are known to the contrastive learning framework. For the oranges in Figure 1, an encoder could keep the shape information of the sliced orange view if it is aware that the other view is behind in time (and likely to be a whole orange).

We therefore propose a generalised framework for self-supervised video representation learning as follows. For notational convenience, we use the term data augmentation to consider all parameterisable data transformations, including shifts in space or in time. We then apply these generalised data augmentations to create different views of the same data, as is done in previous augmentation-invariant contrastive learning. However, instead of directly applying a contrastive loss on these views, we apply an additional projection head that optionally also encodes the augmentations that were used to create the views in the first place. For example, given two views of an image obtained via cropping, this can be an encoding of the bounding box co-ordinates of the cropping spatial transformation. For videos, this encoding can also include information about the temporal relationship between the views (a 5-second shift), in addition to spatial transformations. In the case of missing or occluded sequential data, we can also specify the particular locations to be predicted. When no such encodings are provided, our framework becomes standard augmentation-invariant contrastive learning.

We formulate this framework as a prediction task, given sequential data as input. The input sequence in this case contains encoded visual representations to be learned, and optionally a set of encoded data transformations for prediction. By using transformers, we can easily compose multiple encoded data transformations in the input sequence. We refer to this projection as a Composable AugmenTation Encoding (CATE) model. When data transformations are explicitly encoded in this way, our training objective can motivate a model to utilise such information if it helps with learning (learning the temporal dynamics that oranges can be cut into slices). We choose to always sample negative pairs across different instances, so the model has the freedom to ignore the transformation encodings if they do not help in reducing the contrastive loss. However, empirical results show that they are almost always utilised. We conduct thorough evaluations to test the efficacy of our framework, and in doing so make the following contributions: (1) we propose Composable AugmenTation Encoding (CATE) to learn augmentation-aware representations, and validate that CATE learns representations that preserve useful information (location, arrow of time) more effectively than a view-invariant baseline without augmentation encoding; (2) we perform a number of ablations on augmentation type and parameterisation, and observe that different downstream tasks favour the awareness of different augmentations, temporal awareness is particularly helpful for fine-grained action recognition. We also find that encoding both the arrow of time and the absolute value of the temporal shift outperforms using just the arrow of time while encoding temporal information; (3) we set a new state-of-the-art for self-supervised learning on Something-Something [15], a dataset designed for fine-grained action recognition, and finally (4) we also achieve state-of-the-art performance on standard benchmarks such as HMDB51 [30]

and UCF101 


2 Related Works

Contrastive Learning. Recently the most competitive self-supervised representation learning methods have used contrastive learning [40, 60, 23, 21, 53, 20, 8, 9]. This idea dates back to [16], where contrastive learning was formulated as binary classification with margins. Modern contrastive approaches rely on a large number of negatives [60, 53, 20] and therefore a common technique is to employ the k-pair InfoNCE loss [49, 40]. By choosing different positive pairs or ‘view’ of the data, contrastive learning can encourage different representational invariances, e.g., luminance and chrominance [53], rotation [37] , image augmentations [20, 8], temporal shifts [40, 47, 17, 68, 45], text and its contexts [36, 32, 29], and multi-modal invariance [39, 10, 42, 46].

A number of works have highlighted the issues with these inbuilt representational invariances. InfoMin [54] demonstrates that different invariances favour different downstream tasks, and proposes that optimal views for a given task should only be invariant to irrelevant factors of that task. In [44], occlusion-invariance is shown to benefit the downstream task of object detection. However, designing task-dependent invariances requires knowing the downstream task beforehand and may make the learned representations less general. To overcome this, [62] learns multiple embedding spaces, each invariant to all but one augmentation. This means that the model (projection head) complexity grows linearly with the cardinality of the set of invariances. Instead, our transformer based projection head takes in a sequence of composable encodings, and can modulate the invariances that we want with a fixed model complexity.

Figure 2: Overview of our contrastive learning framework CATE: Positive pairs are constructed from the same instance. For each view, a random set of data augmentations (e.g. temporal shifting, spatial cropping) is sampled and applied. The views are then encoded by a shared visual encoder (3D ConvNets for videos). Encoded visual features, along with parameterised and embedded data augmentations, are then passed to a transformer head (this contains multiple layers, only the input layer is shown for simplicity) which summarises the input sequence and generates projected features for contrastive learning. In this example, the bottom transformer head is tasked to predict the features knowing the temporal augmentation (predict features seconds ahead in time) and spatial augmentation (shift of box coordinates) relative to . The visual encoder is transferred to the downstream tasks.

Self-supervised Learning for Video. Videos offer additional opportunities for learning representations beyond those of images, by exploiting spatio-temporal and multimodal [3, 41] information. Some works extend spatial tasks to the space-time dimensions of videos by using rotation [25] or jigsaw solving [28], while others use temporal information by ordering frames or clips [31, 14, 38, 59, 64], future prediction [56, 17, 18], speed prediction [5, 58], or motion [1, 12]. TaCo [4] consolidates these works by combining different temporal augmentations - shuffling frames, ordering them in reverse, and changing speed. Similar to us, they try to build augmentation awareness, but do so by adding a different pretext task head besides the projection head for each temporally transformed video, which once again grows linearly with the cardinal set of augmentations.

3 Method

This section describes our unified CATE framework for contrastive learning. We begin by providing an overview of contrastive learning. We then describe two paradigms in contrastive learning - ‘view-invariant’ and ‘predictive coding’, and show how our framework consolidates both using a transformer projection head. We also discuss how existing contrastive learning approaches can be viewed as special cases of our framework. An illustrative overview of our proposed framework is provided in Figure 2.

3.1 Contrastive Learning

Contrastive learning methods learn representations by maximising agreement between differently augmented views of the same data example (a positive pair) while pushing apart different data samples (negative pairs). The construction of views for a positive pair can be very general, , via a stochastic data augmentation module or by sampling co-occurring modalities from multi-sensory data. Formally, given two random variables


– contrastive learning seeks to learn a function that discriminates samples from the joint distribution

and samples from the product of marginals . Therefore, there is a natural connection with mutual information maximisation, and as shown in [40] contrastive learning can be viewed as maximising a lower bound on the mutual information between representations of and . Specifically, given an anchor point , its paired , and a set of negatives (), the InfoNCE loss is defined as follows:


The critical function typically consists of one or more backbone networks [8, 20, 53], projection heads [60, 8]

, and a cosine similarity function. This function is optimised to assign a high score to positive pairs

and a low score to negative pairs . Minimising this InfoNCE loss is equivalent to maximising a lower bound on the mutual information between and , denoted as :


3.2 Invariant and Predictive Coding

For visual representation learning, positive pairs can be constructed from unlabeled data in a self-supervised fashion. Following the notation of [8], a popular approach is instance discrimination: where two views , of the same instance are generated by applying independently sampled random data augmentations:

where is the data augmentation operation parameterised by ,

is the visual encoder usually implemented using ConvNets whose outputs are used for transfer learning, and

is a projection head implemented by a Multilayer Perceptron (MLP). Note here that both

and are shared across views. We refer to this approach henceforth as invariant coding.

Predictive coding, on the other hand provides an alternative approach, where certain regions within the same instance are masked out, such as different words in a sentence, different objects in an image or different frames in a video. This creates two different views - observed and ‘missing’ regions of the instance , :

where is the masking function, parameterised by . Without loss of generality, we assume to be additive. This is a reasonable assumption for visual data which can be considered as sequential, and the indices (timestamp for video, pixel location for image) to be masked are additive. The representations to be contrasted are then computed as:

Unlike with view-invariant coding, the projection head now also takes as input, it is tasked to predict the representation conditioned on (which could represent the time information used to generate the views – ‘five seconds ahead’), and indicates the identity operation. In practice,

can be easily implemented with a recurrent neural network as in 

[40] or a transformer with positional embeddings as in [11].

We note that the predictive projection head does not necessarily need to take into account, assuming already encodes information invariant of . However, arguably a much simpler task is to have encode the information useful for prediction and serve as the predictor (generating new words for text, or learning the temporal dynamics for videos.).

3.3 A Unified CATE Framework

Both and can be seen as atomic data operations that ‘transform’ the data based on , and multiple data operations can be chained together. We can therefore unify invariant and predictive coding using the following notation:

where is a sequence of atomic data transformations, applies the sequence of data augmentations on the input data, and is an encoder that takes as input.

Given a sequence of atomic data operations, we can control which are view-invariant and which are predictive with : when we instruct to ignore certain type of operations, our method is ‘view-invariant’ to such operations, otherwise we claim that such operations are predictive.

We implement using a transformer. Our transformer model takes a set of inputs, among them encoded visual features . We project the features to have the same size as the hidden size of the transformer with a linear layer. Additionally, each selected data operation is encoded into an embedding by an encoder dedicated to the operation type. For example, for cropping, the inputs would be the differences between the coordinates of cropped boxes (for implementation details see Sec. 4.2). Finally, we have a special [CLS] token which ‘summarises’ the information from visual features and augmentation embeddings, and outputs a single embedding for each instance. The output embedding is then projected with a linear layer to the deisred output size. A big advantage of our framework is that the transformer projection head can elegantly deal with variable length inputs, and hence multiple augmentation encodings can be composed with a fixed model capacity. We demonstrate this in the experiments (Sec. 4.3) that composing both crop and time encoding improves performance.

Note that other common contrastive based learning techniques can be expressed as special cases of our generalised framework: if is set to always return (the identity), our model is akin to SimCLR [8] using a transformer-based projection head. When the inputs to correspond to the masked indices from a sequence, our formulation coincides with that of BERT [11] or GPT [6].

4 Experiments

We first describe the datasets and their experimental setup (Sec. 4.1), and then delve into the implementation details (Sec. 4.2) of our framework. We then investigate a number of model ablations to better understand the design choices of CATE, described in Sec. 4.3. To further analyse our framework, we also design and evaluate on two proxy task that require knowledge of time in video - predicting time shifts and early action classification (Sec. 4.4). Finally, we compare performance to the state of the art on regular action classification benchmarks.

4.1 Datasets

Something-something [15] v1: This is a video dataset focused on human object interactions. The dataset has 108, 499 videos with 174 categories of fine-grained human-object interactions. The action categories are designed to be focused on temporal information - time is needed to distinguish between picking up something and putting down something, and hence has been observed [63]

that for this dataset, capturing subtle temporal changes is important for good performance. We use Something-Something v1 (SSv1) for our main ablation experiments.

Something-something [15] v2: SSv2 is built on top of SSv1 by expanding the dataset size to 220, 847 videos. The dataset is then augmented by [33]

with object bounding box annotations. We use SSv2 to compare with previously published results. For both SSv1 and SSv2, we adopt the linear evaluation protocol used by self-supervised learning on ImageNet, where both pretraining and evaluation are done on the same dataset. At pre-training, videos in the training split are used to learn the representations. At the evaluation stage, we first train a supervised linear classifier on top of the frozen representations in the training split, then report classification accuracy on the validation set. To compare with others, we report results on Something-Else, which defines a split of SSv2 for few shot classification.


This dataset consists of 240K 10-second clips from YouTube videos with action labels covering 400 classes. We follow the standard practice of training on the trimmed clips in the training split but ignoring their action labels. To evaluate representation learned on Kinetics-400, we follow standard practice and report results on the following two datasets:

HMDB51: HMDB51 [30] contains 6,766 video clips from 51 action classes. Evaluation is performed using average classification accuracy over three train/test splits from [24], each with 3,570 train and 1,530 test videos.
UCF101: UCF101 [50] contains 13K videos downloaded from YouTube spanning over 101 human action classes. Similar to HMDB51 and as is standardly done, evaluation is performed using average classification accuracy over three train/test splits. We pretrain on Kinetics and then evaluate on HMDB51 and UC101 in the following two ways:
(i) Standard action classification: We report performance with both (a) linear evaluation on frozen features and (b) finetuning. This is to compare with the state-of-the-art.
(ii) Early action classification: This is to further understand the value of CATE, and results are provided in Sec. 4.4. Here we predict high level actions that may be performed in the future, given noisy visual evidence at the current time stamp, similar to what has been explored previously for action detection [52]. For this task, we train on just the first frame of the video in UCF and HMDB, and use only this single frame at test time as well.

Aug. Time Enc. Time Projection Top-1 Acc. Top-5 Acc. MLP 17.1 40.9 Linear 20.9 45.9 MLP 26.4 55.2 Transformer 26.5 55.9 Tr. + MLP 24.6 53.2 Transformer 31.2 61.4 Encoded Dropout Top-1 Acc. Top-5 Acc. No - - 26.5 55.9 Crop 27.2 56.7 Crop 28.1 58.0 Time 28.1 57.9 Time 31.3 62.4 Time 31.2 61.4
Table 1: Left: Ablation showing the value of encoding time with different transformer projection heads (left) and Right: Effect of regularisation on encodings All results are on SSv1 with linear eval on frozen features. Top row: Vanilla simCLR Row four: SimCLR++, simCLR with time augmentation and transformer projection head. Last Row: When we encode time, we get a large boost in performance. On the right we show the impact of adding dropout to two aug. encodings - crop and time shifts. For crop encoding, dropout makes the pre-training task harder (higher contrastive loss at convergence), and also improves downstream accuracy. For time encoding, it is important to encode not only the arrow of time, but also the relative distance. However, dropout regularisation does not help in this setting.

4.2 Implementation Details

Base Model: Our implementation is based on the SimCLR [8] code. Unless otherwise mentioned, we use a standard 3D ResNet-50 following the architecture of the ‘slow’ branch in the SlowFast Networks [13]

. Global batch normalization is used during contrastive pre-training, and local batch normalization is used during transfer learning. We use the standard data augmentations used by SimCLR 

[8]: random cropping, color jittering and Gaussian blur. For spatial cropping, we found it beneficial to limit the range of cropped area to of the original image area. For videos, we use time shifting as an additional atomic data operation. All the above spatial augmentations are applied consistently over time to avoid corrupting temporal continuity. We use a lightweight transformer encoder as the projection head, with a hidden size of 768 units, intermediate size of 3072 and number of attention heads 12. We use 4 transformer layers in total. We add a linear projection layer after the transformer with output size of 256. As investigated in Section 4.3, when used standalone with no encoded data augmentations, the transformer head performs comparably with the nonlinear projection head used by SimCLR. This allows us to focus on ablating the impact of data augmentation encoding.
Augmentation Encoding: In this work we encode two augmentations - spatial cropping and temporal shift, as we found these to be most effective empirically for the downstream tasks of action classification. As noted by SimCLR, invariance to augmentations like color jittering and gaussian blur is beneficial for classification, and hence we do not encode these. We note here that a method that could automatically select which augmentations to encode would be interesting, and we leave this for future work. Augmentations are encoded by as follows: for cropping we record the 4 scalar values representing the bounding box () of the crop, we then compute the relative distance of the cropped boxes between two views and project it to 768-dim with a linear layer. For temporal shifting we encode the binary indicator for arrow of time, and then a single scalar representing the number of frames shifted by. Each is projected to 768-dim with an embedding lookup table respectively, then summed together.

We feed 16 input frames to the ResNet-50-3D backbone at pre-training, with a frame sampling stride of 4 for Kinetics, and 2 for SSv1 and SSv2. All frames are cropped and resized to

. The transformer projection head is jointly trained with the ResNet-3D backbone during pre-training. We use the LARS optimiser with an initial learning rate of 4.8 (), and weight decay of

. Unless otherwise mentioned, we pre-train for 500 epochs, with a batch size of 1024.

Evaluation: For linear evaluation, we freeze the pre-trained visual encoders , extract the 2048-dim output features and train a linear classifier on top. The transformer projection head is not used at this stage. We sample 16 frames with a stride of 1 during training, and up to 8 sliding windows of 32 frames for evaluation, which covers the entire video span. This ‘multi-crop in time’ evaluation protocol is standard practice used by prior work [13]. For linear evaluation, we use a momentum optimiser of learning rate 0.16 and batch size of 256. For finetuning, we lower the learning rate to 0.02 and reduce the batch size to 128. All models are trained for 50 epochs.

4.3 Model Ablations

In this section we perform 5 ablations, all on the SSv1 dataset. We pretrain on SSv1 train set without labels and then train a single linear layer. We evaluate on the SSv1 validation set. We first ablate the two key design choices of CATE: (1) the use of a transformer projection head compared to an MLP or linear layer (Table 1, left), and (2) the way we parameterise and regularise augmentation encodings (Table 1, right). In particular, we assess the impact of encoding both time shifts and their direction (arrow of time) (Table 1, right). We then show (3) that it is possible to compose multiple augmentations in our framework, and finally we ablate some low level details such as (4) the number of layers in the transformer head and (5) the number of epochs used for training.

1. Different Projection Head Types. In this section, we start with the vanilla SimCLR model, and then vary the following (i) adding in temporal augmentations while training our model, (ii) encoding time or not under our CATE framework and (ii) whether we use a linear, MLP or transformer projection head. Results are shown in Table 1 (left).
SimCLR: Vanilla SimCLR [8] is applied to video frames, with no temporal data augmentation. Spatial augmentations are applied on the same frames to create views, and an MLP projection head is used. Results on SSv1 can be seen in the first row of Table 1 (left).
SimCLR++: In addition to spatial augmentation, we sample frames at different times in the same video to create views (row 3). We then also replace the MLP projection head with a transformer projection head which takes only the encoded visual representation . This baseline is shown in the fourth row of Table 1 (left), highlighted in blue, and henceforth is referred to as SimCLR++. It is a strong baseline for us to compare to. From Table 1 (left), it is clear that temporal data augmentation is essential for video representation learning, leading to a 9% gain on top-1 accuracy with an MLP projection head (rows 1 and 3). When no temporal augmentation is encoded, we observe that a nonlinear MLP projection head gives similar performance to the transformer projection head (rows 2 and 3), which validates that any performance improvements in CATE are not due solely to direct replacing the MLP head with a transformer. We also observe that combining the transformer head with the MLP head leads to slightly worse performance (row 5).

2. Encoding Augmentations. We first observe in Table 1 (left) that adding temporal encoding improves the top-1 accuracy by nearly 5% from 26.5% to 31.2% over SimCLR++ (last row vs fourth row in blue).

To further explore the efficacy of our augmentation encoding, in Table 1 (right), we ablate on the augmentation type using two augmentations - cropping and temporal shifts (time). We also explore the method of parameterisation for temporal augmentation (using just the arrow of time or the distance in time , which includes both the absolute value of the temporal shift and its direction), and also the regularisation on the encodings (dropout or no dropout). We observe that both crop encoding and time encoding on their own improve the classification accuracy over the baseline, with time encoding providing a slightly bigger boost. By comparing the fourth row and the fifth row, we can see that the parameterisation of the temporal augmentation also matters, and it is beneficial to pass the distance in time along with the arrow of time to the augmentation encoder. Finally, we observe that dropout regularisation helps crop encoding but not time encoding. We hypothesise that crop encoding might make the contrastive task too easy and stronger regularisation is needed; time encoding does not suffer from this issue as there is more variation to be learnt from different frames in a video, and it actually benefits from a more informative encoding (from to ).

3. Composing multiple augmentations. In Table 2 we show results for composing both crop and time encodings before feeding them to the transformer head. We can see for that for SSv1 (similar results hold for SSv2 and can be found in Table A4), using crop and time encodings individually leads to improved performance over the no-encoding baseline (SimCLR++, first row), and composing them together leads to further improvement.

Enc. Crop Enc. Time Top-1 Acc. Top-5 Acc.
26.5 55.9
28.1 58.0
31.2 61.4
32.2 62.4
Table 2: Composing spatial (crop) and temporal encodings for Something-Something v1. Each individual encoding outperforms the no encoding baseline (SimCLR++). Composing them together yields the best performance.

4. Number of transformer layers. We experiment with a number of Transformer layers (1,2,4,8) for our projection head (with time encoding), and observe that the performance begins to saturate at four layers (Table A2). We use four layers in all other experiments.

5. Number of training epochs. We study the impact of the number of epochs used for pre-training on SSv1 and Kinetics-400. For evaluation we use SSv1, HMDB and UCF. Similar to SimCLR [8], we observe improved performance by increasing the number of epochs initially, however we find saturation at around 500 epochs (Table A3).

4.4 Further Analysis

In this section we further analyse the effect of encoding augmentations on learned representations. We first inspect the per-class performance breakdown on SSv1 to study the impact of augmentation encodings. We then design a proxy task of predicting time shifts between clips and compare the impact of different augmentations on the type of downstream task.

Per-class breakdown on SSv1: We conjecture that augmentation encoding is helpful for the downstream tasks that need to be aware of the corresponding spatial and temporal augmentations. To verify this conjecture, in Table 3 and Table 4 we list the SSv1 classes that benefit the most and the least from crop and time encoding, respectively. We sort the classes by computing their per-class Average Precision. We can clearly see that the top classes for crop encoding are those that require some level of spatial reasoning (lift up, drop down, pull from right to left, and move down), while the bottom classes typically do not require spatial reasoning. Similarly, for time encoding, the classes that benefit the most are typically sensitive of temporal ordering by definition, such as lift up then drop down, and move closer, where changing the arrow of time would lead to the opposite action (move farther away).

Lifting something up completely, then letting it drop down 13.5
Pulling something from right to left 13.2
Moving something and something away from each other 13.2
Dropping something in front of something 12.6
Moving something down 12.2
Pretending to sprinkle air onto something -7.0
Folding something -8.6
Pretending or failing to wipe something off of something -10.0
Moving away from something with your camera -11.6
Table 3: Classes that benefit the most and the least with crop encoding on SSv1. We sort the classes by their differences on Average Precision.
Lifting something up completely, then letting it drop down 21.0
Pulling two ends of something so that it gets stretched 19.8
Moving something and something closer to each other 18.5
Taking one of many similar things on the table 17.2
Pushing something so that it almost falls off but doesn’t 16.7
Poking something so lightly that it doesn’t move -4.6
Pretending to pour something out of something -5.4
Poking a stack of something without the stack collapsing -5.5
Pretending to spread air onto something -7.8
Table 4: Classes that benefit the most and the least with time encoding on SSv1. We sort the classes by their differences on Average Precision.

Predicting Time Shifts: In previous experiments, we confirm empirically that encoding augmentations during pretraining leads to better downstream performance. As a sanity check however, we also further design a proxy task to verify that the representations are indeed storing the encoded information and not discarding it. To analyse the time encodings, we design a time shift classification experiment based on the SSv1 dataset. For each video, we sample two 16-frame clips and use their relative distance in time as the classification label. The label space is quantised every 6 frames (0.5 seconds). During training and evaluation, we take the frozen representations of the two clips, concatenate them channel-wise and pass them to a linear classifier on top. Table 5 shows the results. We can see that by providing the encoded time augmentation during pre-training, CATE learns representations that maintain temporal shift information, solving the task with near perfect accuracy. Providing only the arrow of time retains some information, while the no encoding baseline performs poorly on this probing task.

Encode Time Time Offset Acc.
- 5.7
Table 5: Time Shift Classification on SSv1. Encoding time significantly helps on this proxy task, validating the intuition that our model retains useful time information.

Type of Action Classification: In addition to results on SS, we also show results on standard action classification benchmarks UCF101 and HMDB51 under two settings - using all frames and using only the first frame in Table 6. We only show results with time encoding - we find that unlike SSv1 and SSv2, using the crop encoding hurts the performance. This is interesting and we conjecture that the benefit of augmentation encoding depends on the downstream task at hand: for fine-grained tasks that require some level of spatial reasoning (object localisation is needed to tell picking up from putting down in SSv1.), awareness of spatial augmentations is helpful; however for scene-level classification (UCF101 and HMDB51) it might be beneficial to be invariant to those augmentations.

Table 6 shows a similar trend for encoding time as that on SSv1, improving over the baseline. The relative improvement is bigger for first frame classification vs using all frames, however for both cases, the relative improvement is smaller than on SSv1. This is in line with previous observations [63] that temporal information is more important for the Something-Something dataset.

Input Encode time UCF HMDB
All frames 83.01 52.77
All frames 84.32 53.57
First frame 73.67 38.69
First frame 75.50 40.13
Table 6: Effect of time encoding on UCF101 [50] and HMDB51 [30] We show results for both early action classification (first frame) and regular action classification (all frames). We use frozen features: i.e. pretrained representations trained on Kinetics-400 are fixed and classified with a linear layer. Encoding time helps in both settings, albeit slightly more for early action classification.

4.5 Comparison with State-of-the-Art

Finally, we present comparisons with previous state-of-the-art methods on SS, UCF101 and HMDB51.

For SS, we compare our self-supervised representations with other weakly- and fully-supervised representations. For evaluation, all representations are frozen, and a linear classifier is trained on the labeled training examples from the target dataset. In Table 7, we compare CATE with competitive weakly-supervised methods. CATE is pretrained on the train split of SSv1, and all weakly-supervised representations are pretrained by the authors of [65] on 19M public videos with hashtag supervision. The target dataset is SSv1. Despite training with only 0.1M videos without using their labels, our method is able to outperform these weakly-supervised approaches by large margins.

Method Supervision Top-1 Acc.
Distillation [22] Weak 15.6
Prototype [48] Weak 20.3
ClusterFit [65] Weak 20.6
SimCLR++ [8] Self 26.4
CATE Self 32.2
Table 7: Comparison to SoTA on the Something-something v1 val set. We use linear evaluation on frozen features. We compare to weakly-supervised baselines reported by [65]. *: our re-implementation with temporal augmentation.

In Table 8, we compare with fully-supervised Spatial-Temporal Interaction Networks (STIN) [33] for few-shot action classification. Both CATE and STIN are pretrained on the ‘Base’ split of Something-Else, which contains half of the videos. STIN uses its labels as supervision while CATE does not. The target few-shot dataset contains 5, or 10 examples per class, across 86 classes. This is a more challenging setup than the 5-way classification setup used by [7]. We found CATE achieves on par or better performance than the supervised STIN.

Method Pretrain 5-shot Acc. 10-shot Acc.
STIN+OIE+NL [33] Supervised 17.7 20.7
SimCLR++ [8] Self-sup. 14.4 19.8
CATE Self-sup. 18.0 22.9
Table 8: Comparison to SoTA on Something-Else, a split of Something Something-v2 for few shot classification. *: our re-implementation with temporal augmentation.
Method Modalities Dataset Frozen UCF HMDB
Shuff&Lrn* [38] V UCF 26.5 12.6
3DRotNet [26] V K400 47.7 24.8
CBT [51] V K600 54.0 29.5
MemDPC [18] V K400 54.1 30.5
TaCo [4] V K400 59.6 26.7
CATE V K400 84.3 53.6
MemDPC [18] V+F K400 58.5 33.6
CoCLR [19] V+F K400 74.5 46.1
AVSlowFast [61] V+A K400 77.4 44.1
MIL-NCE [35] V+T HTM 83.4 54.8
XDC [2] V+A IG65M 85.3 56.0
ELO [43] V+A YT8M 64.5
Shuff&Lrn* [38] V UCF 50.2 18.1
CMC [53] V UCF 59.1 26.7
OPN [31] V UCF 59.6 23.8
ClipOrder [64] V UCF 72.4 30.9
[57] V K400 61.2 33.4
3DRotNet [26] V K400 66.0 37.1
DPC [17] V K400 75.7 35.7
CBT [51] V K600 77.0 47.2
MemDPC [18] V K400 78.1 41.2
SpeedNet [5] V K400 81.1 48.8
VTHCL [66] V K400 82.1 49.2
TaCo [4] V K400 85.1 51.6
CATE V K400 88.4 61.9
MemDPC [18] V+F K400 86.1 54.5
CoCLR [19] V+F K400 87.9 54.6
MIL-NCE [35] V+T HTM 91.3 61.0
ELO [43] V+A YT8M 93.8 67.4
XDC [2] V+A IG65M 94.2 67.4
Table 9: Comparison to the state of the art on UCF101 [50] and HMDB51 [30]. *reimplemented by  [51]. Frozen means the pretrained representation is fixed and classified with a linear layer, while means all layers are finetuned end-to-end. Rows highlighted in light blue use modalities beyond RGB frames as sources of supervision. Modalities are V: RGB frames only, T: text from ASR, F: pre-extracted optical flow, A: audio.

We also compare to the state-of-the-art on both HMDB51 and UCF101 in Table 9. Using frozen features, our model outperforms all other works that pretrain using RGB frames only – on UCF we even outperform a large number of works that use end-to-end finetuning. Additionally, our model outperforms AVSlowFast [61] which uses additional supervision from audio, and both MemDPC [18] and the recently proposed CoCLR [19], which use additional information from pre-extracted optical flow. Our model also compares favourably with MIL-NCE, XDC and ELO that are trained on orders of magnitude more training data – IG65M consists of 21 years of data (XDC), HTM, 15 years (MIL-NCE), and YouTube 8M, 13 years (ELO). In contrast, Kinetics400 contains only 28 days of video data.

On finetuning we note that the gaps are smaller, however we still outperform all previously published works that use RGB frames only. We note that methods that use additional information from other modalities and train on orders more training data (MIL-NCE, XDC and ELO) are able to almost saturate performance on the UCF dataset.

5 Conclusion

We propose a general framework for contrastive learning, that allows us to build augmentation awareness in video representations. Our method consists of an elegant transformer head to encode augmentation information in a composable manner, and achieves state-of-the-art results for video representation learning. Future work will include evaluating on structured video understanding tasks and measuring the extent of equivariance learned by the representations.


6 Appendix

Table No. Pretrain Data (Unlabeled) Target Data (Train) Target Data (Eval)
1,2,3,5 SSv1 train split SSv1 train split SSv1 val split
4,7 Kinetics-400 train split UCF/HMDB train splits UCF/HMDB val splits
6 SElse ‘Base’ train split SElse ‘Novel’ train split SElse ‘Novel’ val split
Table A1: Pretraining datasets for self-supervised representation learning with CATE, and target datasets for linear evaluation for results reported in the main paper.

6.1 Evaluation Protocol

Our work is about self-supervised pretraining of video representations. For evaluation, we mostly perform transfer learning experiments following the standard linear evaluation protocol commonly used by recent self-supervised image representation learning approaches [40, 8]. Our video representation is first pretrained on unlabeled videos from a large pretraining dataset.

We then transfer the self-supervised representations to the target dataset, by training a linear classifier on top of the frozen representations. This linear classifier is trained on labeled examples from the training split of the target dataset. Accuracy on the test split of the target dataset is used to measure the representation quality. We list the pretrain and target datasets used to generate the results in our main submission in Table A1.

6.2 More Model Ablations

1. Number of transformer layers. We vary the number of transformer layers (1,2,4,8) for our projection head, which receives encoded time augmentations as additional input. Performance on SSv1 linear evaluation can be found in Table A2. We observe that the performance begins to slightly saturate at four layers.

No. Layers Top-1 Acc. Top-5 Acc.
1 26.9 56.2
2 30.0 60.3
4 31.2 61.4
8 31.3 62.4
Table A2: Impact of number of layers in the transformer projection head on Something-Something v1. Time shift encoding is used for all runs. The performance begins to gradually saturate at four layers. The transformer projection head is only applied during pre-training, and is not used in downstream tasks.

2. Number of Pretraining Epochs. We ablate the number of pretraining epochs when evaluated on SSv1, UCF101 and HMDB51. We observe in Table A3 that pretraining for more epochs helps improve representation quality, as also observed by [8], and it saturates at  500 epochs.

Epochs SSv1 UCF101 HMDB51
200 29.8 71.4 43.6
500 32.2 84.3 53.6
800 33.1 83.6 53.0
Table A3: Impact of number of training epochs on SSv1, UCF101 and HMDB51, using linear eval on frozen features.

3. Results on SSv2. We follow the same setup as Table 2 and study the impact of crop and time encodings when both the pretraining and target datasets are SSv2. Results are shown in Table A4. We observe a similar trend as in SSv1: encoding time outperforms the no encoding baseline, and composing time and crop encodings further improves performance.

Enc. Crop Enc. Time Top-1 Acc. Top-5 Acc.
40.0 72.4
40.1 72.4
42.3 74.5
43.5 75.3
Table A4: Results on crop and time encodings on on SSv2 under a linear eval protocol. Trend is consistent with SSv1.

6.3 Results on CLEVR and DSprites

Additionally, we further study the impact of crop encoding by using two image benchmarks that explicitly require spatial reasoning. The first dataset is CLEVR [27] with 70,000 training and 15,000 validation images. It is a diagnostic dataset which contains multiple objects of diverse shape and location configurations. We follow the setup used by [67] and evaluate on two tasks: Count which requires counting the total number of objects, and Dist which requires predicting the depth of the closest object to the camera, where the depth is bucketed into 6 bins. Both tasks are formulated as classification. The second dataset is DSprites [34] which contains a single object floating around in an image, with various shape, scale, orientation and location. We use the Location task which requires predicting the center location of the object. The and

coordinates are bucketed into 16 bins each. We report the geometric mean of classification accuracy on the bucketed

and coordinates.

For both benchmarks, we train CATE using the same setups as we did with videos, except that the visual encoder is now a 2D ResNet-50, and the learning rate is reduced by 5x. We pretrain and evaluate on the datasets themselves.

Crop Enc. CLEVR-Count [27] CLEVR-Dist [27] DSprites [34]
65.3 64.3 28.1
68.8 66.9 38.8
Table A5: Ablation of crop encoding on downstream tasks that require spatial reasoning, such as counting the number of objects, or localising objects in bucketed x, y coordinates.

The linear evaluation performance is shown in Table A5. We observe that encoding crop improves the transfer learning performance on all three tasks that require spatial reasoning, which further validates our conjecture.