The recent development of deep learning has promoted a series of applications in videoswang2016temporal; tran2018closer; feichtenhofer2019slowfast. Meanwhile, practitioners have developed various large-scale benchmarks abu2016youtube; carreira2017quo; goyal2017something
accessible to the traditional fully-supervised learning methods and have greatly facilitated video-related research. Nevertheless, the high cost of manual annotation involved in fully-supervised methods excludes the potential utilization of millions of uncurated videos on the Internet. Therefore, video representation learning in an unsupervised manner is of great significance and emerges as a general trend.
Recently, unsupervised learning in imagesoord2018representation; wu2018unsupervised; tian2019contrastive has achieved competitive performances compared to their supervised counterparts, especially with the contrastive self-supervised learning formulation he2020momentum; chen2020simple. Inspired by these successes, various attempts have also been made in self-supervised video representation learning qian2021spatiotemporal; han2020self. However, videos are distinct from images due to the extra dimension of temporal dynamics.
Unfortunately, the existing video benchmarks present severe static bias he2016human; li2018resound; choi2019sdn, which induces the model to focus on the static and straightforward cues like background context instead of the motion patterns that are intrinsically more informative. Therefore, when generalizing to novel benchmarks, the scene-biased model struggles to recognize some unseen action classes in the seen background. For instance, the action of “yoga” in the pretrain dataset often happens in the gym, so the pretrained model inclines to bind the yoga action to the gym scene blindly but fails to understand the essence of the “yoga” action. On the other hand, when the unseen action class of “push-up” also occurs in the gym in the downstream dataset, the pretrained model has difficulty adapting to a new action that takes place in the same scene. In short, previous models tend to focus on background context but neglect moving foreground. This phenomenon is more severe in self-supervised learning since the model is trained without semantic supervision, i.e., action labels.
Driven by the motivation of mitigating the background bias in the self-supervised learning framework, BE wang2021removing proposes a simple method of adding a specific static frame on all other frames in the video. By doing so, their method can reduce the learning emphasis on the background to some extent and make the model generalize better. However, this simple operation is relatively coarse and does not take good care of the foreground region. Although adding one static frame distracts the background pixels as expected, it damages the appearance and the motion patterns of the foreground objects at the same time, leading to impaired temporal modeling and motion understanding. To ameliorate the aforementioned limitation and help the model better grasp the foreground action, we present a new augmentation technique named Foreground-background Merging (FAME). Particularly, we separate each video’s foreground and background regions, and then merge foreground areas with random backgrounds. In the separation step, we first circle out the edge region of the moving object as the seed region via frame difference. Then, we use color statistics to extrapolate the entire foreground area from the seed region. This efficient foreground discovery method extracts most areas of dynamic on which we expect the model to put the emphasis. Next, in the merging step, we fuse the extracted foreground regions of each video with random backgrounds from other videos to form action samples. This merging step aims to reduce the influence of the original background by introducing diverse backgrounds. After that, we force the model to learn the consistent representation between original clips and distracting clips. The training pipeline is shown in Figure 1. In this way, the model is asked to prioritize the motion patterns and alleviate the background bias in the self-supervised learning framework.
We evaluate FAME on two action recognition benchmarks. The superior experimental performance verifies that FAME enables self-supervised video representation learning to generalize better and distill the motion-aware representations. In short, we summarize our contributions as follows:
We propose a simple yet effective augmentation method for self-supervised video representation learning. Our approach helps the model mitigate background bias in video benchmarks and learn the motion-aware representations.
Our method enhances the MoCo framework remarkably and achieves the start-of-the-art performance on two downstream tasks, action recognition and video retrieval, on two mainstream datasets, UCF101 and HMDB51.
2 Related Work
Contrastive Visual Representation Learning. Recently, contrastive learning has greatly facilitated self-supervised visual representation learning wu2018unsupervised; oord2018representation; tian2019contrastive; chen2020simple; he2020momentum. It performs instance discrimination in a fully self-supervised manner to pull the representations of the same instance close and push those of different instances far away. Following this idea, wu2018unsupervised proposes to formulates the instance discrimination as a non-parametric classification problem. oord2018representation
mathematically proves that we could estimate mutual information with InfoNCE lossgutmann2010noise, which can be easily used for optimization. Later, he2020momentum proposes MoCo to make use of key representations calculated in previous iterations as negative samples to facilitate contrastive learning. Also, SimCLR chen2020simple proposes to employ a large batch size instead of the memory bank to expand the negative pool for more robust visual representation. Considering that SimCLR requires tremendous computational resources, we adopt the MoCo framework as a strong baseline for self-supervised pretraining in our work.
Self-supervised Video Representation Learning. In video representation learning, there has been a line of works that employ diverse pretext tasks for self-supervised representation learning misra2016shuffle; lee2017unsupervised; xu2019self. The most prevalent approaches include temporal order prediction misra2016shufflevondrick2018tracking, spatio-temporal puzzling kim2019self and speed prediction benaim2020speednet. These methods generally employ manually designed tasks to seek the spatio-temporal cues in video data, but the performance is limited. Then, for further improvement, some works apply contrastive learning formulation into video representation learning gordon2020watching; qian2021spatiotemporal. Han et al. use InfoNCE loss to guide dense predictive coding in videos han2019video; han2020memory. asano2019self; han2020self propose to leverage the consistency between different modalities to enhance video representation. However, the video representations learned from these methods are mostly dominated by the background instead of the dynamic motions wang2021removing, which introduces strong background bias and impairs generalization ability in downstream applications. Therefore, we now propose FAME to construct positive samples with the same motions but different backgrounds for self-supervised pretraining in this work.
Video Background Bias Mitigation. How to mitigate the background bias has been a long-standing topic for action recognition. In the supervised scenario, choi2019sdn use an off-the-shelf human detector to mask out the human regions and train the model in an adversarial manner. Later, to make the self-supervised video representations more robust to the background bias, a line of works employ other natural supervision huang2021self; xiao2021modist to guide the model to capture motion information explicitly. However, these methods require more than one backbone to pretrain multi-modality data, resulting in an undesired computational cost. To better leverage the implicit motion information in videos, DSM wang2021enhancing aims to decouple the motion and context by deliberately constructing the positive/negative samples through spatial and temporal disturbance. BE wang2021removing proposes to add a static frame as background noise for static bias mitigation. But these two methods would erode the foreground moving objects, while our FAME meticulously extracts foreground regions and preserves high-quality motion patterns.
In this section, we introduce our Foreground-background Merging (FAME) method. In section 3.1, we first revisit the vanilla contrastive learning framework and then illustrate how our method is applied in this framework. In section 3.2, we elaborate on how to separate foreground regions using our method. To clarify the notation, we denote the video clips as , where are respectively the dimension of the channel, timespan, height, width.
3.1 Vanilla Contrastive Learning
The vanilla contrastive learning approach employs instance discrimination to learn the feature representation in a fully self-supervised manner (chen2020simple; he2020momentum; grill2020bootstrap). Generally, it aims to maximize the similarity between the query sample and its positive keys , and minimize the similarity between and negative keys . We empirically use InfoNCE loss gutmann2010noise for optimization:
where is the temperature hyper-parameter controlling the concentration level of the distribution, and
measures the cosine similarity between the latent embeddings, i.e.,. In most existing works (feichtenhofer2021large), is the set of clip embeddings extracted from the same video as , and is the set of clip embeddings extracted from other videos.
However, this vanilla contrastive learning formulation in the video domain cannot fully utilize the dynamic motion information and tends to discriminate different instances according to the background cues (wang2021removing). If the model depends excessively on the background but ignores the foreground object, such a misleading focus can risk the model’s generalization ability. Thus, we carefully design FAME as an augmentation technique to circumvent the negative impact of the background. We show the contrastive learning framework with FAME in Figure 1. In detail, we randomly sample two clips from different timestamps. Before applying the basic augmentation, we use our proposed FAME method to compound the foreground of one clip with the background from other videos in the same mini-batch. After that, the two clips only resemble each other in the moving foreground objects but differ in the background context. Then, we feed these two clips into the 3D encoder and treat them as the positive keys while the rest of the clips serve as negative keys. Finally, we minimize the InfoNCE loss to pretrain the 3D encoder. By constructing the positive pair with the same foreground but diverse backgrounds, we guide the model to focus on temporal dynamics and suppress the impact of the background.
3.2 Foreground-background Merging
Motivated by mitigating background bias in self-supervised video representation learning, we intend to retain the foreground regions in original videos and shuffle the background areas among various videos. To achieve this goal, we propose the Foreground-background Merging method to augment the clips with minimal computation overhead. Concretely, FAME consists of two stages, one for separation and the other for merging.
In the separation stage, we show the pipeline in Figure 2. We first differentiate adjacent frames iteratively and then sum up the magnitude of the difference along channel and timespan dimensions to generate the seed region . We formulate as
The intuition is that moving foreground objects tend to possess a great magnitude in terms of frame difference, while the static backgrounds are minor in this metric. In practice, we find that the large values of the seed region usually correspond to the moving objects’ edge region. To expand the edge of the foreground objects into the whole foreground, we take inspiration from the unsupervised foreground discovery stretcu2015multiple for seed propagation. Specifically, we leverage the color distributions to estimate the entire foreground. Denoting as the total number of pixels in the foreground region and as the number of the given color
appearing in the foreground region, the probability of a given colorappearing in the foreground region can be estimated as . Similarly, the probability of belonging to the background region is . In practice, we sample the foreground color distribution in the top of seed region and the background color distribution in the last of seed region . Namely, in our setting, and . Given the above two distributions for the color and the assumption that all pixels with the same color have the same probability of being the foreground and background, we approximate the foreground likelihood for a given color as
Therefore, the soft segmentation mask can be calculated based on the color of each pixel. We formulate it as , where is the color at pixel
. To better filter out the background region, we binarize the mask
where is a hyper-parameter to describe the portion of the foreground. Note that the mask we generate is constant with respect of timespan for the sake of computational efficiency. To do so, we view video clips as “image” when counting the color statistics. In other words, we reduce over the timespan dimension, i.e., . We have tried three variants to obtain the foreground mask . Experiment results in Table 6 verify that FAME works best among them. Moreover, all variants of FAME consistently enhance the representation ability greatly. It conforms to our intuition that retaining the original foregrounds and shuffling the backgrounds stimulate motion understanding. Having foreground mask , we then conduct the stage of merging. Denoting as foreground and background source clips, the synthetic clip , where is the element-wise multiplication. As a result, we merge the moving foreground objects with random backgrounds.
In this section, we first introduce the dataset and the implementation details for our experiments. Then, we report our evaluation results on downstream tasks: action recognition, video retrieval. Next, we conduct a set of ablation studies to analyze and validate our FAME method quantitatively. Finally, we investigate and make sense of what the model learns with FAME qualitatively.
Kinetics-400 carreira2017quo is a large-scale and high-quality dataset for action recognition, which consists of around 240K video clips with 400 human action classes. Each clip lasts about 10 seconds and is annotated with a single action class. We use the training set of Kinetics-400 to pretrain our model in a self-supervised manner.
UCF101 soomro2012ucf101 is a human action dataset. It contains over 13k clips covering 101 action classes. In our experiment, split 1 of UCF101 is used for pretrain and downstream tasks.
HMDB51 kuehne2011hmdb is a human action dataset with 51 action categories and around 7,000 manually annotated clips. We also use split 1 of HMDB51 in our experiments.
4.2 Implementation Details
In the stage of self-supervised training, we apply our FAME method on MoCo framework he2020momentum; chen2020improved. We select two common backbone choices, R(2+1)D-18 tran2018closer and I3D-22 carreira2017quo, as the 3D encoder.
First, we randomly sample two different temporal clips in the same video as positive pair. Each clip consists of 16 frames with a temporal stride of 2. We spatially crop a random portion of clips and resize it to the size ofor . We then use FAME to distract one out of the positive pairs. Notice that the background videos are from the clips in the same mini-batch. Next, following the prior work feichtenhofer2021large
, we perform the basic augmentation containing random grayscale, color jittering, random horizontal flip, and random Gaussian blur. All these augmentations are temporally consistent. We pretrain the model for 200 epochs with a batch size of 64 on 8 Tesla V100 GPUs during the training phase. The SGD optimizer is adopted with the initial learning rate ofand weight decay of . For the implementation of MoCo, the number of the negative queue is set to 65536 for Kinetics-400, and 2048 for UCF101, respectively. We also swap the key/queue samples so that every sample can generate the gradient for optimization. More details about implementation are in Appendix.
|Pace Prediction wang2020self||R(2+1)D||Kinetics-400||16||112||77.1||36.6|
After pretraining, we initialize the backbone with the pretrained parameters except for the last fully connected layer. There are two protocols of action recognition to validate the self-supervised representations. One is linear probe. The encoder is frozen, and we only train the last fully connected layer. The second one is finetune, where we train the whole network in a supervised fashion. During the inference phase, we take the standard evaluation protocol xu2019self; wang2020self; pan2021videomoco. We uniformly sample ten -frame video clips with a temporal stride of 2 from each testing video, then center crop and resize them to or . We average the prediction of each testing video clip and report Top-1 accuracy to measure the action recognition performance.
Without further training, we directly use the representation from the pretrained encoder for evaluation. Following xu2019self; luo2020video, we take video clips in the test set to query nearest neighbors in the training set. In detail, we average the representation of ten uniformly sampled clips to obtain the global representation. If the category of the testing clip appears in the nearest neighbors, it counts as a hit. We report Top- recall R@k for evaluation.
4.3 Evaluation on Downstream Tasks
We compare our method with the existing self-supervised video representation learning approaches on action recognition. In Table 1, we report Top-1 accuracy on UCF101 and HMDB51. We do not consider the existing methods with a deeper backbone or non-single modality, e.g., optical flow, audio, and text.
Our method obtains the best result on UCF101 and a comparable result on HMDB51 in the linear probe setting. Even though MoCo chen2020improved serves a strong baseline and outperforms most previous methods, our FAME could still improve MoCo baseline by 4.8% and 2.4% respectively on UCF101 and HMDB51. MoCo+FAME also beats MLRep qian2021enhancing, which carefully designs the multi-level feature optimization and temporal modeling, by a large margin, i.e., about 9.0% gain on both UCF101 and HMDB51. The outstanding performance demonstrates that our method can capture the moving foreground patterns and represent the temporal information.
In the finetune protocol, FAME with R(2+1)D backbone also achieves the best result on UCF101 and HMDB51. It indicates that FAME helps MoCo learn the scene-debiased and motion-aware representations on the Kinetics-400 dataset, which generalize well to the downstream dataset. Remarkably, FAME brings 1.9% and 2.8% performance gain on UCF101 and HMDB51 against the MoCo baseline. In comparison to other state-of-the-art methods, although SRTC zhang2021incomplete introduces two additional sub-loss terms to regularize the self-supervised pretraining, our simple formulation outperforms SRTC by and with the same backbone R(2+1)D. Notably, we share similar motivation with BE wang2021removing, both hoping to alleviate the background bias problem by disturbing the background. BE directly adds a static frame to every other frame and regards this distracting video as the positive pair to the original video. This coarse disturbance ruins the foreground region, and our experiments testify that it does hurt the temporal modeling and motion understanding. Using the same backbone I3D, FAME outperforms BE by 1.8% and 5.7% on UCF101 and HMDB51, respectively. It proves that our model can better mitigate the static background bias by further separating foreground and background.
We report the performance comparison for video retrieval in Table 2. Our method achieves significant performance gain on R@1. Remarkably, though ASCNet huang2021ascnet devises two particular tasks to learn appearance and speed consistency, we still reach higher Top-1 retrieval accuracy. It is because our motion-aware representations can precisely retrieve the action with the same semantics. Ours is slightly lower than ASCNet in R@5 to R@50 since FAME almost abandons background cues, while ASCNet may take trace of background shortcuts to retrieve samples of the same category when is large.
4.4 Ablation Study
In this section, we conduct thorough ablation studies to analyze how FAME improves self-supervised video representation learning. We choose split 1 of UCF101 as the pretrain dataset and I3D as the backbone for computational efficiency. All the Top-1 accuracy in the ablation study is measured under the protocol of finetune.
Importance of the area ratio of the foreground region.
In this section, we inspect how the area of the foreground region contributes to the representation quality. We ablate , the portion of the foreground, in the range of . Note that reverts to the baseline method without applying FAME. We report the performance in Table 3. It can be observed that the results of and vastly outperform baseline by on both UCF101 and HMDB51. The improvement of is also considerable, though slightly inferior to the smaller value of due to insufficient background replacement. This phenomenon shows that our method is insensitive to the hyper-parameter , and the harder contrastive task formulated by FAME is more conducive to the representation quality.
Impact of background source.
Besides the foreground ratio, we also wonder how the source of background affects the representation ability to capture the motion. Specifically, we aim to explore whether the performance would change dramatically using the background in the same video instead of other videos. We perform an additional experiment where we merge the foreground of one video with the background sampled at different timestamps of the video itself. As shown in Table 4, we find that using the background from intra-video boosts the baseline with 1.6% and 2.1% improvement on UCF101 and HMDB51 and the introduction of other videos’ backgrounds brings further improvement, i.e., 5.4% and 7.1% gain on UCF101 and HMDB51. In general, the intra-video background is almost the same as the original one, while the inter-video background is quite distinct. Thus, it demonstrates that the modification from the intra-video is not adequate to mitigate background bias while replacing the background with diverse scenes better strengthens motion pattern learning.
Stronger background debiasing.
To explore whether FAME is sufficiently strong to reduce the background bias in the contrastive learning, we design a stronger contrastive objective. That is, we apply FAME on both branches of MoCo and neither of the two processed video clips contains initial background information. We report the results in Table 5. The slight difference in performance between those two settings proves that our default setting is strong enough for the model to learn the scene-debiased representations.
Variants of Foreground-background Separation.
In order to verify that emphasizing moving foreground advances the motion understanding, we devise three variants of foreground mask: (i) Gauss: we adopt a 2D Gaussian kernel matrix as the foreground mask. It derives from the assumption that videos are shot in the object-centric form. (ii) Seed: we just take the seed region to characterize the foreground. (iii) Grid: we separate foreground and background via the grid. In practice, the video is split into grids spatially. We count the sum of in each grid and take the greatest eight grids as the foreground area. A brief illustration is displayed in Figure 3.
We compare FAME with these three variants in Table 6. First, we note that all variants improve the baseline by a large margin, showing the effect of FAME. Furthermore, refining the foreground mask from Gauss, Seed, Grid to FAME continually increases the action recognition performance. Interestingly, we notice that Grid outperforms FAME slightly on UCF101. We conjecture that it is because both the pretrain dataset and downstream dataset are UCF101. Thus, background bias can be leveraged as a shortcut for action recognition. To delve into this phenomenon, we carry out an extra experiment on another pretrain dataset Kinetics-400. Top-1 accuracy of Grid variant is over 2% lower than FAME on both UCF101 and HMDB51. It indicates that a meticulous segmentation mask instead of a rough grid box is more effective in facilitating generalization ability.
4.5 Visualization Analysis
To better demonstrate the effectiveness of FAME, we provide the CAM zhou2016learning visualization in Figure 4
. We train a linear classifier with the pretrained model similar to the linear probe and omit the last global average pooling layer to generate activation maps. With that, we can spot the contribution of each area and find crucial regions for discriminating the action class. We find that when integrated with FAME, MoCo can focus on moving foreground area rather than background context. For example, in the first row of Figure4, MoCo+FAME precisely captures the moving upper and lower body when the man is practicing TaiChi, while the MoCo baseline displays a dispersed highlight map and fails to attend to the motion area. In addition, we illustrate that the CAM activation map can almost overlap with the foreground mask generated by FAME. It testifies that our method enables the model to perceive the motion patterns and hinder the background bias.
In this work, we propose a new Foreground-background Merging (FAME) method to alleviate the background bias in self-supervised video representation learning. Via Foreground-background Merging, we augment the original video by separating the foreground and background regions and fusing the original foreground with other videos’ backgrounds. Then, the backbone model is trained to learn semantically consistent representation between the original video and the fused video. In this way, the model can learn the scene-debiased and motion-aware representations of videos. Experimental results on a bunch of downstream tasks manifest the effectiveness of our method.