Dense Semantic Contrast for Self-Supervised Visual Representation Learning

09/16/2021 ∙ by Xiaoni Li, et al. ∙ 0

Self-supervised representation learning for visual pre-training has achieved remarkable success with sample (instance or pixel) discrimination and semantics discovery of instance, whereas there still exists a non-negligible gap between pre-trained model and downstream dense prediction tasks. Concretely, these downstream tasks require more accurate representation, in other words, the pixels from the same object must belong to a shared semantic category, which is lacking in the previous methods. In this work, we present Dense Semantic Contrast (DSC) for modeling semantic category decision boundaries at a dense level to meet the requirement of these tasks. Furthermore, we propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning. Specially, we explicitly explore the semantic structure of the dataset by mining relations among pixels from different perspectives. For intra-image relation modeling, we discover pixel neighbors from multiple views. And for inter-image relations, we enforce pixel representation from the same semantic class to be more similar than the representation from different classes in one mini-batch. Experimental results show that our DSC model outperforms state-of-the-art methods when transferring to downstream dense prediction tasks, including object detection, semantic segmentation, and instance segmentation. Code will be made available.



There are no comments yet.


page 1

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Illustration of pixels with same semantics under intra- and inter-image. The pixels from a same object share a common semantic category, which should be drawn closer.

Despite that self-supervised pre-training (Luo et al., 2020b; Yao et al., 2020; Luo et al., 2020a; Li et al., 2021a) has achieved breakthrough performance with large-scale datasets (

, ImageNet

(Deng et al., 2009)), the gap between the pre-trained model and the downstream dense prediction tasks (such as object detection (Everingham et al., 2010; Lin et al., 2014; Yang et al., 2020, 2021; Jiang et al., 2019; Qin et al., 2020, 2019; Qiao et al., 2020b, a; Chen et al., 2019, 2020c) and segmentation (Cordts et al., 2016)) is still unnegligible. The object detection task aims to predict categories and bounding boxes for all the objects of interest in the image, while the objective of the segmentation task is to assign a category for each pixel. All these tasks need denser and semantic representation for more precise prediction. However, the previous instance-level self-supervised learning (SSL) methods (Chen et al., 2020a; He et al., 2020a; Chen et al., 2020b; Wu et al., 2018; Grill et al., 2020; Chen and He, 2020; Caron et al., 2018; Asano et al., 2020; Niu et al., 2020; Huang et al., 2020b; Zhan et al., 2020; Ji et al., 2019; Caron et al., 2020; Huang and Gong, 2021; Zhang et al., 2021) for pre-training obtain only global feature representation, which is more fit for global classification. Additionally, recently developed pixel-level methods (Pinheiro et al., 2020; Xie et al., 2020; Wang et al., 2020) are limited in a finite level due to the lack of semantic category decision boundary modeling.

The representation learning methods based on instance discrimination have recently achieved state-of-the-art performance by attracting positive samples while repelling negative samples. IR (Wu et al., 2018)

proves that non-parametric instance-level classification can capture visual similarity. After that, some view-invariant approaches such as MoCo v1&v2 

(He et al., 2020a; Chen et al., 2020b) and SimCLR (Chen et al., 2020a) propose that good representation can be learned by treating its augmented version as positive samples. Nowadays, BYOL (Grill et al., 2020) and SimSiam (Chen and He, 2020) prove that positive pairs are sufficient for learning good feature representation without negative pairs. However, these instance discrimination methods neglect the relations among different instances. To supply actual semantic category information in the dataset, some works (Caron et al., 2018; Huang et al., 2020a) are devoted to modeling semantic structures to reach supervised learning’s performance. DeepCluster (Caron et al., 2018) and AND (Huang et al., 2020a) are two typical semantic information exploring works implemented by clustering and discovering nearest neighbors. Additional works (Zhan et al., 2020; Caron et al., 2020; Huang et al., 2020a; Zhuang et al., 2019)

begin to explore semantic information to learn more discriminative feature representations subsequently. Nevertheless, both instance discrimination and semantics-discovery methods focus on global feature representation of images, and are only suitable for object-centric datasets such as ImageNet 

(Deng et al., 2009), which only contains one main object in each image. As shown in Figure 1, instance discrimination methods treat each image as an individual class, ignoring the relation of the two images that both of them contain the same object (, dog). While semantics-discovery methods can’t distinguish these two images as a positive pair because the objects they contain are not the same (a dog a dog and a cat), which reduced their similarity. Therefore, these approaches are not universal for downstream dense prediction tasks and the pre-training in complex authentic scenario images.

To explore more suited pre-training approaches for dense prediction tasks, (Pinheiro et al., 2020; Xie et al., 2020; Wang et al., 2020) conduct contrastive learning from a denser perspective with the notion of pixel discrimination. They treat each pixel as a single class and learn the discriminative representation for pixels. Although their specific design narrows the gap between the pre-trained model and the downstream dense prediction tasks evidently, they lack pixel-level semantic category discriminative capability since any non-linear intra-class variations in each pixel is not modeled. Hence, it is limited to low- and mid-level visual understandings on an individual pixel level. Take Figure 1 as an example again, the pixels in the region of the two dogs belong to the same category semantically (the red circles), which should be drawn closer to each other. Conversely, the pixels in the two dogs’ regions are not the same as those in the region of the cat (the blue circles), which should be pushed away. However, the previous pixel discrimination methods push all of those pixels away, lacking high-level visual understandings.

In this work, we present the concept of Dense Semantic Contrast (DSC) for explicitly modeling semantic category decision boundaries at the pixel level, which establishes connections both instance-to-instance and pixel-to-pixel semantically. Besides, a dense cross-image semantic contrastive learning framework for multi-granularity representation learning is constructed to make up for semantics’ defects compared with previous SSL pre-training methods. Specially, we first explore a neighbors-discovery method to enhance the correlation of the pixels within the image, which mines the neighbors from multiple views (Firgure 3). Moreover, we design a dense semantic module for cross-image semantic relation modeling by adopting certain clustering methods shown in the right part of Figure 2

. Here we focus on k-means (KM) and prototype mapping (PM)

(Caron et al., 2020) for simplicity. While other clustering methods can also be adopted, such as Power Iteration Clustering (Lin and Cohen, 2010) and Invariant Information Clustering (Ji et al., 2019). For the other granularities, we conduct instance and pixel discrimination performed standard contrastive learning. DSC is an end-to-end manner as it can consider multi-granularity contrastive representation learning simultaneously. To summarize, the major contributions of our work are three-fold:

1) For the first time, we reveal that the pixel discrimination task is short of semantic category decision boundary reasoning capability. The insufficiency of the ability leads that the transferred model can’t assign the same category label for the pixels from one object accurately, resulting in the gap between the pre-trained models and the downstream dense prediction tasks. Consequently, we model the semantic decision boundary explicitly to narrow this gap.

2) We propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning. Unlike the previous SSL pre-training methods, the framework considers the semantic relations of both intra- and inter-image pixels. We learn the discriminative information in the instance, pixel, and pixel category granularity simultaneously to ensure the diversity of the intra-class features and the discrimination of the inter-class features at the pixel level.

3) We transfer the model pre-trained on ImageNet (Deng et al., 2009)


(Lin et al., 2014) to abundant downstream dense prediction tasks. All the experimental results show that DSC achieves superior or comparable performance with previous works (Chen et al., 2020b; Wang et al., 2020), which once again proves the importance of semantic relation in self-supervised visual representation learning.

Figure 2. Architecture of the DSC framework for multi-granularity representation learning.

2. Related Works

2.1. Instance Discrimination

The concept of instance discrimination can be traced back to IR (Wu et al., 2018), which conducts contrastive learning by treating each sample as a separate class, and pulling the positive samples closer while pushing the negative ones far away to learn instance-specific discriminative representation. MoCo (He et al., 2020a) adopts an online encoder and a momentum encoder to receive two views of a sample as the positive pair . Additionally, a momentum updated queue is built to store negative samples. SimCLR (Chen et al., 2020a) adjusts the batch size as large as 4096 in the experiments to push SSL pre-training to the comparable effect as the supervised methods. Meanwhile, it also carefully sorts out the tricks that are very useful for improving the effect of SSL, such as longer training time, adding MLP projectors, or stronger data augmentation. Inspired by SimCLR (Chen et al., 2020a), He added the projector to MoCo (He et al., 2020a) and proposes MoCo-v2 (Chen et al., 2020b), which refreshes SSL performance once again. Without adopting any negative samples, BYOL (Grill et al., 2020) adds a predictor to learn the map from the online encoder to the momentum encoder instead of displaying positive samples. Meanwhile, through the stop gradient mechanism, the negative samples are skillfully discarded. Simsiam (Chen and He, 2020) lets the target encoder and the online encoder be the same and points out that the predictor and the stop gradient mechanism are sufficient conditions for training a strong SSL encoder for pre-training.

2.2. Semantics Discovery of Instance

Neighbors Discovery AND (Huang et al., 2019)

discovers sample anchored neighborhoods to reason the underlying class decision boundaries. But it is restricted by the small size of local neighborhoods. PAD

(Huang et al., 2020a) bases on self-discovering semantically consistent groups of unlabelled training samples with the same class concepts through a progressive affinity diffusion process. (Zhuang et al., 2019) trains an embedding function to maximize a metric of local aggregation, causing similar data instances to move together in the embedding space while allowing dissimilar instances to separate.

Deep Clustering As a common technology for unlabeled data mining to learn higher-level visual understandings, deep clustering (Caron et al., 2018; Asano et al., 2020; Niu et al., 2020; Huang et al., 2020b; Zhan et al., 2020; Ji et al., 2019; Caron et al., 2020; Zhang et al., 2020)

has been extended to learn in deep neural networks. DeepCluster

(Caron et al., 2018) is a representative method in alternate learning, which iteratively groups the features with k-means and uses the subsequent assignments to update the deep network. Another recent work SeLa (Asano et al., 2020) makes cluster assignments by solving the optimal transport problem and alternatively performs representation learning and self-labeling. Another mode of deep clustering is pretext supervision, whose objective is to design a pretext task to use specific excuse objectives to learn label assignment and feature update simultaneously and indirectly impose the requirements for learning a good cluster. GATCluster (Niu et al., 2020) designs four self-learning tasks with the constraints of transformation invariance, separability maximization, entropy analysis, and attention mapping to directly outputs semantic cluster labels without further post-processing. PICA (Huang et al., 2020b) learns the most semantically plausible data separation by maximizing the “global” partition confidence of clustering solution. There is a lot of work turn to online clustering (Zhan et al., 2020; Ji et al., 2019; Caron et al., 2020) to reduce error accumulation and the irrelevance of the pretext task to the downstream tasks occurring in offline clustering. ODC (Zhan et al., 2020) designs and maintains two dynamic memory modules to perform clustering and network updating simultaneously. IIC (Ji et al., 2019) maximizes the mutual information between the class assignments of each pair to output semantic labels. SwAV (Caron et al., 2020) predicts the cluster assignment of a view from the representation of another view to simultaneously cluster the data while enforcing consistency between cluster assignments produced for different views of the same image, instead of comparing features directly as in traditional contrastive learning.

2.3. Pixel Discrimination

Global (instance-level) representations are efficient to compute but provide low-resolution features invariant to pixel-level variations. This might be sufficient for few tasks like image classification but are not enough for dense prediction tasks (Pinheiro et al., 2020). For better transference to downstream dense prediction tasks, (Pinheiro et al., 2020; Xie et al., 2020; Wang et al., 2020) begin to focus on contrastive learning based on pixel discrimination. VaDeR (Pinheiro et al., 2020) forces representations of pixels to be viewpoint agnostic through the correspondence of spatial coordinates, that is, the positive sample pairs come from the intersection of the two views, while the pixels from different images are negative samples. Pixpro (Xie et al., 2020) adds a Pixel-to-Propagation Module, emphasizing the consistency from pixel to propagation, which encourages spatially close pixels to be similar and can aid prediction in areas that belong to the same label. However, both these two methods don’t consider the situation that the multiple views are not overlapping at all. DenseCL (Wang et al., 2020) selects positive samples by ranking the similarities between all the pixels instead of utilizing spatial correspondence, which is no longer restricted by the requirement of the views’ intersection.

3. Methods

3.1. Sample Discrimination

Instance Discrimination For self-supervised representation learning, the breakthrough approaches are (Wu et al., 2018; Chen et al., 2020a; He et al., 2020a; Chen et al., 2020b; Grill et al., 2020; Chen and He, 2020), which employ instance discrimination based on contrastive learning to learn good representations from unlabeled data. As our baseline is MoCo-v2 (Chen et al., 2020b), we briefly introduce the instance-level contrastive learning based on it. Two kinds of transformations and are randomly applied to a given sample , and get two views , . and are from a set of transformations . The details will be introduced in the experiments. After feeding them into an online encoder and a momentum encoder

respectively, we can get the feature vectors

and as shown in Figure 2. In order to project the feature vectors into an embedded space with a specific dimension, the two-layer global MLP projector is applied in MoCo-v2 (Chen et al., 2020b), that is, and for different views. And the embeddings can be expressed as and . The positive sample pairs are the two views from different transformations ( and ), and the negative ones

are from the momentum queue. A contrastive loss function InfoNCE

(van den Oord et al., 2018) is employed to pull the positive sample pairs closer while pushing them away from other negative keys:


where is a temperature to control the instance-level concentration degree of distribution (Wu et al., 2018). And the pair-wise similarity is represented by cosine distance:


Pixel Discrimination The discrimination of instance-granularity regards each image as a separate individual, which will help to get a relatively global discriminative feature representation. Based on this, we conduct pixel discrimination in each image to get dense distinguished feature representation for each pixel. Similarly, we take the two views of a sample to feed into the two encoders. Different from the instance discrimination task, we replace the global projector of the previous instance-level with the dense projector, that is, and . The corresponding dense feature representations are denoted as and , and the contrastive learning is carried out in the dense embedding space:


where is a temperature to control the pixel-level concentration degree of distribution similarly. Following DenseCL (Wang et al., 2020), we get the correspondence of two dense views by calculating the distance of each pixel pair, then take the closest one as the positive pair:


where is the most similar pixel index with the pixel vector . Instead of adopting the spatial information to get the correspondence between two views (Pinheiro et al., 2020; Xie et al., 2020), the situation that the two views are not overlapped is considered.

3.2. Semantics Mining

After sample discrimination, there are two problems to be faced with. First, all the pixels are pushed away in each instance ( image), ignoring its inherent pixel-level semantic structure. Taking the image containing a dog and a cat in Figure 1 for example, all the pixels of the dogs should be closer to each other, and those of cats should also be closer to each other, while the pixels of these two objects should be pushed away. Second, after the execution of pixel-level discrimination, there is still no connection between the pixels of different images. As shown in Figure 1, the sample discrimination process does not consider that the pixels of the same kind of objects ( dogs) in the two images should be drawn closer. On the contrary, it blindly pushed them farther instead, merely because they belong to different images. For the first problem, we propose to search for neighbors from multiple views, and for the second problem, we try to conduct some cluster methods to reassign the label of each pixel among images (, cross images).

Figure 3. Illustration of neighbors discovery. The solid yellow circles are pixels from the object “dog” in different views (a and b).

Neighbors Discovery Based on sample discrimination, we try to discover relative samples for each pixel to solve the first problem of lacking pixel-level semantic structures within images. We define these relative samples as the neighbors of the pixels, which should be treated as positive samples to draw closer. Note that is the number of neighbors for pixel . The adjusted contrastive loss can be expressed as Eq.5. Through this additional pulled operation, the inherent pixel-level semantic structure within images is explored explicitly to a certain extent. Pixels are not only close to their augmented versions from different views but also adjacent with their neighbors from the same view, as shown in Figure 3. For each pixel, the neighbors are discovered by ranking each pixel pair’s similarity in one image. And we select the top- pixels as their neighbors. In fact, we can not only draw close to the neighbors in the same view, but also pull the neighbors from different views close. In theory, all these samples should belong to the same semantic category. In this paper, only the neighbors in the same view are discussed, and the performance is improved evidently, which benefits from the semantic information brought by the neighbors.


Furthermore, we try to construct a tripled relation of pixels (, a pixel , its augmented version from another view and its j neighbors from the same view ). Our objective is to force the distance between and should be shorter than the distance between and its j-th neighbors . The formal expression is


where is the margin parameter and we take 0.3 in the experiments. The pair-wise similarity is defined as cosine distance in Eq.2.

Deep Clustering For the second problem which lacks the global relations across the images, we explore some clustering methods to model the semantic category decision boundaries for high-level visual understandings. A natural practice is to perform k-means on the pixel-level embedded features in each image to get a certain number of clusters and then carry out contrastive learning for each cluster. The pixels within a specific cluster are closer to each other, and the samples among different clusters are pushed farther away. Theoretically, the clustering results in different images are various, and contrastive learning is required to carry out in each cluster and each image, which is very time-consuming and computation intensive. Therefore, it is necessary to explore a way to save time and effort simultaneously. Instead of contrasting the obtained features from the network offline, we skillfully design two modules (, KM and PM) to implement online clustering through cluster alignment as shown in Figure 4.

Figure 4. Illustration of cluster alignment. denotes the clustering process, such as k-means or prototype mapping.

DSC-KM K-means is a general way to cluster samples but is time-consuming and only offline, which can not be used in large-scale application scenarios. We adjust the k-means algorithm to achieve online clustering. Specifically, all the pixels are clustered in a mini-batch, considering not only the pixels with similar semantics in one image, but also the pixels with similar semantics in different images. We compute the pixel centroid embeddings using the pixel embeddings only within a batch, and then contrast the centroid embeddings with Eq.7:


where is the temperature coefficient. We don’t cluster the pixels within an image according to the general thinking, as it limits the semantic category relation modeling in an image, which ignores the “global” context of the training data, , the rich semantic relations between pixels across different images. DSC-KM can overcome the shortcomings of the previous pixel-level approaches without considering the semantic information and skillfully solve the problem of blindly pushing away all instances without considering the semantic relations among different instances. What’s more, we realize online clustering to avoid time-consuming, which is one stroke in the third.

DSC-PM We also explore a prototype-mapping approach to realize pixel clustering for efficiency. Specially, we cluster the cross-image pixels and enforce the consistency between cluster assignments from different views of the same pixel simultaneously. Given two pixel embeddings and from two views of , we compute their codes (or cluster assignments) and by matching these embeddings to a set of prototypes . We then set up a “swapped” prediction problem (Caron et al., 2020) with Eq.9.:


The function measures the matched degree between the feature and the code

, which can be represented by a cross-entropy loss between the code and the probability

obtained by taking a softmax function. The softmax function is made up of the dot products of and all prototypes in :


Instead of contrasting the specific clustering results one by one, we force the alignment of cluster assignments from different views to make the network learn more semantic discriminative representation at the pixel level.

Discussion DSC-KM and DSC-PM impose different constraints on the SSL model to explore the semantic decision boundary among pixels, both confirming this effectiveness obviously. DSC-KM requires the centroids of pixel clustering to be consistent under the two views, while DSC-PM forces each pixel’s class assignment to be consistent. Comparatively, DSC-PM implements more strict constraints than DSC-KM to the model.

3.3. Multi-granularity Framework

Our proposed dense cross-image semantic contrastive learning framework considers multiple granularity representation learning to obtain not only low- and middle- but also high-level visual understandings shown in Figure 2. The overall objective function can be divided into three components according to different granularities in Eq.10, that is instance discrimination loss , pixel discrimination loss and semantic discrimination loss . Note that can be , , and , only one of which will be used in our DSC model. Note that we have explored the influence of different orders of magnitude of weights on the model, and the experimental settings with the best performance are shown in the paper.


4. Experiments

Following recent self-supervised methods (Chen et al., 2020b; Wang et al., 2020), we adopt ResNet-50 (He et al., 2016)

as our backbone. The model is pre-trained on ImageNet and MS COCO respectively. Our objective is to verify the transferred ability of the feature representation learned by the pre-trained model to the downstream dense prediction tasks. Therefore, we evaluate the object detection and segmentation tasks on various datasets. Specifically, it’s object detection on PASCAL VOC, semantic segmentation on PASCAL VOC and Cityscapes, object detection and instance segmentation on MS COCO. Besides, we conduct our ablation study with object detection and semantic segmentation on PASCAL VOC, adopting the model pre-trained on MS COCO.

In the pretraining stage, we follow the data augmentation of DenseCL (Wang et al., 2020), which contains two randomly sampled crops from the image and resized to 224 × 224 with a random horizontal flip, followed by a Random Grayscale. At the same time, ColorJitter and GaussianBlur are randomly selected.

MoCo-v2 CC* 52.1 79.0 56.7
DenseCL CC* 56.4 81.8 62.7
DSC-KM (Ours) 57.0 82.1 63.0
DSC-PM (Ours) 57.2 82.3 63.4
SimCLR IN (Chen et al., 2020a) 51.5 79.4 55.6
BYOL IN (Grill et al., 2020) 51.9 81.0 56.5
MoCo IN (He et al., 2020a) 55.9 81.5 62.6
Moco-v2 IN* 57.1 82.0 63.9
DenseCL IN* 58.4 82.7 65.7
DSC-KM (Ours) 58.7 82.7 65.6
DSC-PM (Ours) 58.6 82.8 65.6
Table 1. The performance of PASCAL VOC object detection. CC and IN indicate the pre-training models trained on MS COCO and ImageNet respectively. represents our re-implementation.

4.1. Experimental Settings

Pre-training Followed MoCo-v2 (Chen et al., 2020b), we utilize the same training settings, like is 0.2 (in our work, , , and

are all set to 0.2). The initial learning rate is 0.3. A SGD optimizer is adopted in the models with a Nesterov momentum of 0.9 and a weight decay of 1e-4. All the models are optimized on 8 V100 GPUs with a cosine learning rate decay schedule and a mini-batch size of 256. We train 800 epochs for MS COCO and 200 epochs for ImageNet. Note that we re-implement DenseCL and MoCo-v2, and can achieve comparable results. For a fair comparison, all the experiments share the same training settings including other re-implemented methods and ours, and we use the re-implemented results to represent their performance as we can get comparable results with their papers.

Downstream tasks training In order to further evaluate the quality of feature representation learned by pre-trained models, we apply the models to various downstream dense prediction tasks followed by fine-tuning. For PASCAL VOC object detection, we fine-tune a Faster R-CNN (Ren et al., 2017) detector with C4-backbone adopting a standard 2x schedule with (Wu et al., 2019). The training set is , and the test set is . For PASCAL VOC and Cityscapes semantic segmentation, we fine-tune a FCN (Shelhamer et al., 2017) model with 20k iterations and 40k iterations respectively. Different from DenseCL (Wang et al., 2020), we adopt for training and for evaluating on PASCAL VOC, while was used for training in the former. And for Cityscapes, is used for training and evaluating. For object detection and instance segmentation on MS COCO, we fine-tune a Mask R-CNN (He et al., 2020b) detector with a FPN model as the backbone, adopting a standard 1x schedule. The model is trained on and evaluated on . We uses , and to evaluate for object detection, , and for instance segmentation, and mean Intersection over Union () for semantic segmentation.

4.2. Main Results

PASCAL VOC object detection We compare our methods with state-of-the-art approaches on PASCAL VOC object detection. As shown in Table 1, for the model pre-trained on MS COCO and ImageNet, both DSC-KM and DSC-PM can get better performance than another two previous methods. Especially, DSC-PM achieves 5.1% higher performance than MoCo-v2 and 0.8% higher than the baseline DenseCL on MS COCO pre-trained models, DSC-KM is 1.6% higher than MoCo-v2 and 0.3% higher than DenseCL on ImageNet pre-trained models. The promotion shown in Table 1 indicates that the gap between the pre-trained models and downstream dense prediction tasks will narrow substantially with the help of reasoning semantic category decision boundaries.

(a) PASCAL VOC pre-train MoCo-v2 CC* 48.6 DenseCL CC* 56.7 DSC-KM (Ours) 57.7 DSC-PM (Ours) 57.9 Moco-v2 IN* 56.3 DenseCL IN* 58.9 DSC-KM (Ours) 59.3 DSC-PM (Ours) 59.6 (b) Cityscapes pre-train MoCo-v2 CC* 72.4 DenseCL CC* 75.3 DSC-KM (Ours) 75.7 DSC-PM (Ours) 75.5 Moco-v2 IN* 72.6 DenseCL IN* 75.5 DSC-KM (Ours) 76.0 DSC-PM (Ours) 75.6
Table 2. The performance of semantic segmentation on (a) PASCAL VOC and (b) Cityscapes. CC and IN indicate the pre-training models trained on MS COCO and ImageNet respectively. represents our re-implementation.

PASCAL VOC and Cityscapes semantic segmentation Table 2 (a) demonstrates PASCAL VOC semantic segmentation pre-trained on MS COCO and ImageNet respectively. Both DSC-KM and DSC-PM significantly improve the performance with a large margin. Especially, DSC-PM outperforms MoCo-v2 with 9.3% and DenseCL with 1.2% when pre-trained on MS COCO, and outperforms MoCo-v2 with 3.3% and DenseCL with 0.7% when pre-trained on ImageNet, which verifies the effectiveness of the semantic category decision boundary modeling again. In Table 2 (b), we report the semantic segmentation results on Cityscapes. For the models pre-trained on MS COCO, DSC-KM is 3.3% higher than MoCo-v2 and 0.4% higher than DenseCL. And for the models pre-trained on ImageNet, DSC-KM is 3.4% higher than MoCo-v2 and 0.5% higher than DenseCL. The significant performance improvements in semantic segmentation tasks demonstrate that the semantic category labels assignment in our methods is more accurate than any other SSL pre-training approach.

MoCo-v2 CC* 37.0 55.9 40.2 33.5 53.1 35.9
DenseCL CC* 38.8 58.4 42.6 35.1 55.4 37.7
DSC-KM (Ours) 39.2 58.8 42.8 35.5 55.9 38.0
DSC-PM (Ours) 39.0 58.6 42.5 35.1 55.5 37.7
Moco-v2 IN* 38.9 58.5 42.5 35.2 55.6 37.8
DenseCL IN* 39.2 58.7 42.9 35.5 56.0 37.7
DSC-KM (Ours) 39.4 58.8 43.0 35.6 56.1 38.1
DSC-PM (Ours) 39.4 58.9 43.2 35.7 56.1 38.3
Table 3. The performance of MS COCO object detection and instance segmentation. CC and IN indicate the pre-training models trained on MS COCO and ImageNet respectively. represents our re-implementation.

MS COCO object detection and instance segmentation The results of MS COCO object detection and instance discrimination are shown in Table 3. With MS COCO pre-training, DSC-KM is 2.2% , 2.0% higher than MoCo-v2 and 0.4% , 0.4% higher than DenseCL. And for ImageNet, DSC-PM achieves 0.5% , 0.5% higher performance than MoCo-v2 and 0.2% , 0.2% higher performance than DenseCL. The improvements are limited both in MS COCO and ImageNet pre-trained models as MS COCO contains of a lot of authentic scenario images, which is still challenging for SSL pre-training to cope with difficulties in visual tasks of complex scenes.

- 56.4 81.8 62.7 56.7
Neighbor 56.6 81.6 63.0 57.5
Triplet 55.5 80.9 61.4 53.5
CE 56.8 81.9 63.0 58.1
KM 56.8 81.9 62.8 57.7
PM 57.1 82.2 63.3 57.9
Table 4. Comparison of different semantic strategies for PASCAL VOC object detection and semantic segmentation. “-” represents our baseline without any semantic strategies.

4.3. Ablation Study

We conduct abundant ablation studies to explore the importance of each component in our framework. Specially, we first compare various strategies to supply the semantic category information. Moreover, we discuss the influence of the number of clusters on our performance. Besides, we also study the significance of discriminative representation learning in different granularities.

Figure 5. The performance and efficiency with different for DSC-KM on PASCAL VOC object detection.

Semantic strategies We explore five strategies to model the semantic category decision boundaries explicitly in Table 4. “Neighbor” means the neighbor-discovery method we mentioned before, and we take as 1. “Triplet” constructs a triple with one pixel, its augmented version, and its neighbor in the same view, performing triplet loss based on pixel discrimination. “CE” adopts cross-entropy loss for the cluster codes obtained by prototype mapping. “PM” and “KM” denote the semantics-mining methods we mentioned before. The experimental results show that all the strategies improve the performance of the downstream dense prediction tasks to some extent, indicating that the supplement of semantics for pixels can help get better semantic structures in the dataset. Moreover, we can see that the performance of “CE”, “PM” and “KM” is better than ‘Neighbor” and “Triplet”, which demonstrates that exploring global relations among cross-images pixels is more effective than mining the local relations of the pixels within a single image.

Number of clusters K We explore the effect of the number to cluster the pixel embeddings from the backbone. Figure 5 shows that in DSC-KM, with the increase of , downstream tasks’ performance becomes better, indicating that moderately over clustering is more beneficial for semantic representation learning. While the time-consuming situation also becomes serious along with the growth of . Balancing the performance improvement and the time cost, we choose in our experiments for KM and for PM.

Granularities of representation learning The influence of each component corresponding to different visual granularities is investigated, as shown in Table 5. With the increase of granularity considering in the framework, the performance of downstream tasks shows a progressive upward trend. By jointly learning representation in multiple granularities, the DSC model gets not only low- and middle-level visual understandings on an individual instance or pixel level, but also obtains high-level visual understandings on a semantic category level. This kind of multi-granularity consideration is beneficial for an accurate category assignment when carrying out downstream dense prediction tasks. We have carried out experiments with only pixel or semantic granularity, but both models can’t converge. It is also mentioned in DenseCL that contrastive learning only at pixel-level can’t converge.

Figure 6. Comparison of the heat map visualization under MoCo-v2, DenseCL, our DSC-KM and DSC-PM by Grad-CAM (Selvaraju et al., 2017) on MS COCO dataset.
instance pixel semantics
54.7 81.0 60.6 48.6
56.4 81.8 62.7 56.7
57.1 82.2 63.3 57.9
Table 5. The influence of different granularities on PASCAL VOC object detection and semantic segmentation.

4.4. Visualization

Feature representation visualization More explicit boundary information of the objects is required in the downstream dense prediction tasks. To further understand the feature representation learned by different pre-trained models, we utilize Grad-CAM (Selvaraju et al., 2017) to visualize the heat maps of these models pre-trained on MS COCO. As illustrated in Figure 6, MoCo-v2 is more interested in distinguishing features conducive to classification, while pixel-level methods (DenseCL and ours) focus on regional features for downstream dense prediction tasks. Especially, our DCS model is more sensitive to object boundary information than DenseCL as we focus on mining the inherent semantic relations among pixels.

Downstream task visualization We visualize the transferability of different models pre-trained on MS COCO fine-tuning on PASCAL VOC semantic segmentation task. From the third row of Figure 7, we can find that MoCo-v2 is more prone to misjudge the pixels’ category as it only focuses on discriminative learning in the instance level, which lacks denser observation in feature representations. Despite that DenseCL is devoted to pixel discrimination task, the fourth row of Figure 7 shows that there are still some mistakes in pixel category assignment. It has a lot to do with the fact that DenseCL doesn’t explore the semantic relations among pixels. With our DSC model, the performance becomes better in the segmentation task, proving the effectiveness of our semantics-mining methods for downstream dense prediction tasks once again.

Figure 7. Comparison of semantic segmentation under MoCo-v2, DenseCL, our DSC-KM and DSC-PM on PASCAL VOC dataset.

5. Conclusion

In this work, for the first time we have proposed Dense Semantic Contrast (DSC) for self-supervised visual representation learning, which models semantic category decision boundary at the pixel level to meet the command of semantic representation in downstream dense prediction tasks. What’s more, a dense cross-image semantic contrastive framework has been constructed for multi-granularity pre-training representation learning, considering low-, middle- and high-level visual understandings simultaneously. The experimental results indicate that the effectiveness of making up for the absence of semantic category relation in our methods. We have skillfully designed an approach to align the semantic category assignment, forcing the network to learn more semantic discriminative feature representation implicitly. In the future, we will explore more skillful means to mine the semantic relationships among the datasets. What’s more, we will consider hierarchical idea (Li et al., 2021b) for clustering. We expect that the first exploration of modeling semantic category decision boundaries may inspire more related works in pixel semantic mining.

Supported by the Open Research Project of the State Key Laboratory of Media Convergence and Communication, Communication University of China, China (No. SKLMCC2020KF004), the Beijing Municipal Science & Technology Commission (Z191100007119002), the Key Research Program of Frontier Sciences, CAS, Grant NO ZDBS-LY-7024, the National Natural Science Foundation of China (No. 62006221), and CAAI-Huawei MindSpore Open Fund.


  • Y. M. Asano, C. Rupprecht, and A. Vedaldi (2020) Self-labelling via simultaneous clustering and representation learning. In ICLR, Cited by: §1, §2.2.
  • M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In ECCV, pp. 139–156. Cited by: §1, §1, §2.2.
  • M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, Cited by: §1, §1, §1, §2.2, §3.2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020a) A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607. Cited by: §1, §1, §2.1, §3.1, Table 1.
  • X. Chen, H. Fan, R. B. Girshick, and K. He (2020b) Improved baselines with momentum contrastive learning. CoRR abs/2003.04297. Cited by: §1, §1, §1, §2.1, §3.1, §4.1, §4.
  • X. Chen and K. He (2020) Exploring simple siamese representation learning. CoRR abs/2011.10566. Cited by: §1, §1, §2.1, §3.1.
  • Y. Chen, W. Wang, Y. Zhou, F. Yang, D. Yang, and W. Wang (2020c) Self-training for domain adaptive scene text detection. In ICPR, pp. 850–857. Cited by: §1.
  • Y. Chen, Y. Zhou, D. Yang, and W. Wang (2019) Constrained relation network for character detection in scene images. In PRICAI, Vol. 11672, pp. 137–149. Cited by: §1.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR, pp. 3213–3223. Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §1, §1, §1.
  • M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2010) The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88 (2), pp. 303–338. Cited by: §1.
  • J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko (2020) Bootstrap your own latent - A new approach to self-supervised learning. In NeurIPS, Cited by: §1, §1, §2.1, §3.1, Table 1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick (2020a) Momentum contrast for unsupervised visual representation learning. In CVPR, pp. 9726–9735. Cited by: §1, §1, §2.1, §3.1, Table 1.
  • K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2020b) Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42 (2), pp. 386–397. Cited by: §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.
  • J. Huang, Q. Dong, S. Gong, and X. Zhu (2019)

    Unsupervised deep learning by neighbourhood discovery

    In ICML, pp. 2849–2858. Cited by: §2.2.
  • J. Huang, Q. Dong, S. Gong, and X. Zhu (2020a) Unsupervised deep learning via affinity diffusion. In AAAI, pp. 11029–11036. Cited by: §1, §2.2.
  • J. Huang, S. Gong, and X. Zhu (2020b) Deep semantic clustering by partition confidence maximisation. In CVPR, pp. 8846–8855. Cited by: §1, §2.2.
  • J. Huang and S. Gong (2021) Deep clustering by semantic contrastive learning. CoRR abs/2103.02662. Cited by: §1.
  • X. Ji, A. Vedaldi, and J. F. Henriques (2019) Invariant information clustering for unsupervised image classification and segmentation. In ICCV, pp. 9864–9873. Cited by: §1, §1, §2.2.
  • N. Jiang, Y. Zhang, D. Luo, C. Liu, Y. Zhou, and Z. Han (2019) Feature hourglass network for skeleton detection. In CVPR Workshops, pp. 1172–1176. Cited by: §1.
  • W. Li, D. Luo, B. Fang, Y. Zhou, and W. Wang (2021a) Video 3d sampling for self-supervised representation learning. CoRR abs/2107.03578. Cited by: §1.
  • X. Li, Y. Zhou, Y. Zhou, and W. Wang (2021b) MMF: multi-task multi-structure fusion for hierarchical image classification. CoRR abs/2107.00808. Cited by: §5.
  • F. Lin and W. W. Cohen (2010) Power iteration clustering. In ICML, pp. 655–662. Cited by: §1.
  • T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, pp. 740–755. Cited by: §1, §1.
  • D. Luo, B. Fang, Y. Zhou, Y. Zhou, D. Wu, and W. Wang (2020a) Exploring relations in untrimmed videos for self-supervised learning. CoRR abs/2008.02711. Cited by: §1.
  • D. Luo, C. Liu, Y. Zhou, D. Yang, C. Ma, Q. Ye, and W. Wang (2020b) Video cloze procedure for self-supervised spatio-temporal learning. In AAAI, pp. 11701–11708. Cited by: §1.
  • C. Niu, J. Zhang, G. Wang, and J. Liang (2020) GATCluster: self-supervised gaussian-attention network for image clustering. In ECCV, pp. 735–751. Cited by: §1, §2.2.
  • P. O. Pinheiro, A. Almahairi, R. Y. Benmalek, F. Golemo, and A. C. Courville (2020) Unsupervised learning of dense visual representations. In NeurIPS, Cited by: §1, §1, §2.3, §3.1.
  • Z. Qiao, X. Qin, Y. Zhou, F. Yang, and W. Wang (2020a) Gaussian constrained attention network for scene text recognition. In ICPR, pp. 3328–3335. Cited by: §1.
  • Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang (2020b) SEED: semantics enhanced encoder-decoder framework for scene text recognition. In CVPR, pp. 13525–13534. Cited by: §1.
  • X. Qin, Y. Zhou, D. Wu, Y. Yue, and W. Wang (2020) FC2RN: A fully convolutional corner refinement network for accurate multi-oriented scene text detection. CoRR abs/2007.05113. Cited by: §1.
  • X. Qin, Y. Zhou, D. Yang, and W. Wang (2019) Curved text detection in natural scene images with semi- and weakly-supervised learning. In ICDAR, pp. 559–564. Cited by: §1.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6), pp. 1137–1149. Cited by: §4.1.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In ICCV, pp. 618–626. Cited by: Figure 6, §4.4.
  • E. Shelhamer, J. Long, and T. Darrell (2017) Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39 (4), pp. 640–651. Cited by: §4.1.
  • A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. CoRR abs/1807.03748. Cited by: §3.1.
  • X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li (2020) Dense contrastive learning for self-supervised visual pre-training. CoRR abs/2011.09157. Cited by: §1, §1, §1, §2.3, §3.1, §4.1, §4, §4.
  • Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: §4.1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pp. 3733–3742. Cited by: §1, §1, §2.1, §3.1.
  • Z. Xie, Y. Lin, Z. Zhang, Y. Cao, S. Lin, and H. Hu (2020) Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. CoRR abs/2011.10043. Cited by: §1, §1, §2.3, §3.1.
  • D. Yang, Y. Zhou, and W. Wang (2021) Multi-view correlation distillation for incremental object detection. CoRR abs/2107.01787. Cited by: §1.
  • D. Yang, Y. Zhou, D. Wu, C. Ma, F. Yang, and W. Wang (2020) Two-level residual distillation based triple network for incremental object detection. CoRR abs/2007.13428. Cited by: §1.
  • Y. Yao, C. Liu, D. Luo, Y. Zhou, and Q. Ye (2020) Video playback rate perception for self-supervised spatio-temporal representation learning. In CVPR, pp. 6547–6556. Cited by: §1.
  • X. Zhan, J. Xie, Z. Liu, Y. Ong, and C. C. Loy (2020) Online deep clustering for unsupervised representation learning. In CVPR, pp. 6687–6696. Cited by: §1, §1, §2.2.
  • Y. Zhang, C. Liu, Y. Zhou, W. Wang, W. Wang, and Q. Ye (2020) Progressive cluster purification for unsupervised feature learning. In ICPR, pp. 8476–8483. Cited by: §2.2.
  • Y. Zhang, Y. Zhou, and W. Wang (2021) Exploring instance relations for unsupervised feature embedding. CoRR abs/2105.03341. Cited by: §1.
  • C. Zhuang, A. L. Zhai, and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In ICCV, pp. 6001–6011. Cited by: §1, §2.2.