Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

12/02/2017 ∙ by Xiaohang Zhan, et al. ∙ The Chinese University of Hong Kong 0

Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g. ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g. image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision's performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a "mix-and-match" (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the "mix" stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A "match" stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.



There are no comments yet.


page 2

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Semantic image segmentation is a classic computer vision task that aims at assigning each pixel in an image with a class label such as “chair”, “person”, and “dog”. It enjoys a wide spectrum of applications, such as scene understanding 

[Li, Socher, and Fei-Fei2009, Lin et al.2014, Li et al.2017b] and autonomous driving [Geiger et al.2013, Cordts et al.2016, Li et al.2017a]

. Deep convolutional neural network (CNN) is now the state-of-the-art technique for semantic image segmentation 

[Long, Shelhamer, and Darrell2015, Liu et al.2015, Zhao et al.2017, Liu et al.2017]. The excellent performance, however, comes with a price of expensive and laborious label annotations. In most existing pipelines, a network is usually first pre-trained on millions of class-labeled images, e.g., ImageNet [Russakovsky et al.2015] and MS COCO [Lin et al.2014], and subsequently fine-tuned with thousands of pixel-wise annotated images.

Self-supervised learning

111Project page: http://mmlab.ie.cuhk.edu.hk/projects/M&M/ is a new paradigm proposed for learning deep representations without extensive annotations. This new technique has been applied to the task of image segmentation [Zhang, Isola, and Efros2016a, Larsson, Maire, and Shakhnarovich2016, Larsson, Maire, and Shakhnarovich2017]. In general, self-supervised image segmentation can be divided into two stages: the proxy stage, and the fine-tuning stage. The proxy stage does not need any labeled data but requires one to design a proxy or pretext task with self-derived supervisory signals on unlabeled data. For instance, learning by colorization [Larsson, Maire, and Shakhnarovich2017] utilizes the fact that a natural image is composed of luminance channel and chrominance channels. The proxy task is formulated with cross-entropy loss to predict an image chrominance from the luminance of the same image. In the fine-tuning stage, the learned representations are utilized to initialize the target semantic segmentation network. The network is then fine-tuned with pixel-wise annotations. It has been shown that without large-scale class-labeled pre-training, semantic image segmentation could still gain encouraging performance over random initialization or from-scratch training.

Though promising, the performance of self-supervised learning is still far from that achieved by supervised pre-training. For instance, a VGG-16 network trained with the self-supervised method of [Larsson, Maire, and Shakhnarovich2017] achieves a 56.0% mean Intersection over Union (mIoU) on PASCAL VOC 2012 segmentation benchmark [Everingham et al.2010], higher than a random initialized network that only yields 35.0% mIoU. However, an identical network trained on ImageNet achieves 64.2% mIoU. There exists a considerable gap between self-supervised and pure supervised pre-training.

Figure 1:

(a) shows samples of patches from categories ‘bus’ and ‘car’, and these two categories have similar color distributions but different patch statistics. (b) depicts deep feature distributions of ‘bus’ and ‘car’, before and after mix-and-match, visualized with t-SNE 

[Maaten and Hinton2008]. Best viewed in color.

We believe that the performance discrepancy is mainly caused by the semantic gap between the proxy task and the target task. Take learning by colorization as an example, the goal of the proxy task is to colorize gray-scale images. The representations learned from colorization may be well-suited for modeling color distributions, but are likely amateur in discriminating high-level semantics. For instance, as shown in Fig. 1(a), a red car can be arbitrarily more similar to a red bus than to a blue car. The features of both car and bus classes are highly overlapped, as depicted by the feature embedding in the left plot of Fig. 1(b).

Improving the performance of self-supervised image segmentation requires one to improve the discriminative power of representation tailored to the target task. This goal is non-trivial – target’s pixel-wise annotations are discriminative for the goal but they often available with just a handful amount, typically in thousands of labeled images. Existing approaches typically use a pixel-wise softmax loss to exploit pixel-wise annotations for fine-tuning a network. This strategy may be sufficient for a network that is well-initialized by supervised pre-training but could fall inadequate for a self-supervised network of which the features are weak. We argue that pixel-wise softmax loss is not the sole way of harnessing the information provided by pixel-wise annotations.

In this study, we present a new learning strategy called ‘mix-and-match’ (M&M), which can help harness the scarce labeled information of a target set for improving the performance of networks pre-trained by self-supervised learning. The M&M learning is conducted after the proxy stage and before the usual target fine-tuning stage, serving as an intermediate step to bridge the gap between the proxy and target tasks. It is noteworthy that M&M only uses the target images and its labels thus no additional annotation is required.

The essence of M&M is inspired by metric learning. In the ‘mix’ step, we randomly sample a large number of local patches from the target set and mix them together. The patch set is formed across images thus decouple any intra-image dependency to faithfully reflect the diverse and rich target distribution. Extracting patches also allows us to generate a massive number of triplets from the small target image set to produce stable gradients for training our network. In the ‘match’ step, we form a graph with nodes defined by patches represented by their deep features. An edge between nodes is defined as attractive if the nodes share the same class label; otherwise, it is a rejective edge. We enforce a class-wise connected graph, that is, all nodes from the same class in the graph compose a connected subgraph, as shown in Fig. 3(c). This ensures global consistency in triplet selection coherent to the class labels. With the graph, we can derive a robust triplet loss that encourages the network to map each patch to a point in feature space so that patches belonging to the same class lie close together while patches of different classes are separated by a wide margin. The way we sample triplets from a class-wise connected graph differs significantly from existing approach [Schroff, Kalenichenko, and Philbin2015] that forms multiple disconnected subgraphs for each class.

We summarize our contributions as follows. 1) We formulate a novel ‘mix-and-match’ tuning method, which for the first time, allows networks pre-trained with self-supervised learning to outperform the supervised learning counterpart. Specifically, with VGG-16 as the backbone network, by using image colorization as the proxy task, our M&M method achieves 64.5%, outperforming the ImageNet pre-trained network that achieves 64.2% mIoU on PASCAL VOC2012 dataset. Our method also obtains 66.4% mIoU on CityScapes dataset, comparable to 67.9% mIoU achieved by using a ImageNet pre-trained network. This improvement is significant considering that our approach is based on unsupervised pre-training. 2) Apart from the learning by colorization method, M&M also improves learning by context method [Noroozi and Favaro2016] by a large margin. 3) In the setting of random initialization, our method achieves significant improvements with both AlexNet and VGG-16, on both PASCAL VOC2012 and CityScapes. It makes training semantic segmentation from scratch possible. 4) In addition to the new notion of mix-and-match, we also present a triplet selection mechanism based on class-wise connected graph, which is more robust than conventional selection scheme for our task.

Related Work

Self-supervision. It is a standard and established practice to pre-train a deep network with large-scale class-labeled images (e.g., ImageNet) before fine-tuning the model for other visual tasks. Recent research efforts are gearing towards reducing the degree of or eliminating supervised pre-training altogether. Among various alternatives, self-supervised learning is gaining substantial interest. To enable self-supervised learning, proxy tasks are designed so that meaningful representations can be induced from the problem-solving process. Popular proxy tasks include sample reconstruction [Pathak et al.2016b], temporal correlation [Wang and Gupta2015, Pathak et al.2016a], learning by context [Doersch, Gupta, and Efros2015, Noroozi and Favaro2016], cross-transform correlation [Dosovitskiy et al.2015] and learning by colorization [Zhang, Isola, and Efros2016a, Zhang, Isola, and Efros2016b, Larsson, Maire, and Shakhnarovich2016, Larsson, Maire, and Shakhnarovich2017]. In this study, we do not design a new proxy task, but present an approach that could uplift the discriminative power of a self-supervised network tailored to the image segmentation task. We demonstrate the effectiveness of M&M on learning by colorization and learning-by-context.

Weakly-supervised segmentation. There exists a rich body of literature that investigates approaches for reducing annotations in learning deep models for the task of image segmentation. Alternative annotations such as point [Bearman et al.2016], bounding box [Dai, He, and Sun2015] [Papandreou et al.2015], scribble [Lin et al.2016] and video [Hong et al.2017] have been explored as “cheap” supervisions to replace the pixel-wise counterpart. Note that these methods still require ImageNet classification as a pre-training task. Self-supervised learning is more challenging in that no image-level supervision is provided in the pre-training stage. The proposed M&M approach is dedicated to improve the weak representation learned by self-supervised pre-training.

Graph-based segmentation. Graph-based image segmentation [Felzenszwalb and Huttenlocher2004] has been explored from early years. The main idea is to explore dependency between pixels. Different from the conventional graph on pixels or superpixels in a single image, the proposed method defines the graph on image patches sampled from multiple images. We do not partition image by performing cuts on a graph, but use the graph to select triplets for the proposed discriminative loss.

Mix-and-Match Tuning

Figure 2 illustrates the proposed approach, where (a) and (c) depict the conventional stages for self-supervised semantic image segmentation, while (b) shows the proposed ‘mix-and-match’ (M&M) tuning. Specifically, in (a), a proxy task, e.g., learning by colorization, is designed to pre-train the CNN using unlabeled images. In (c), the pre-trained CNN is fine-tuned on images and the associated per-pixel labeled maps of a target task. This work inserts M&M tuning between the proxy task and the target task as shown in (b). It is noteworthy that M&M uses the same target images and label maps in (c), hence no additional data is required. As the name implies, M&M tuning consists of two steps, namely ‘mix’ and ‘match’. We explain these steps as follows.

The Mix Step – Patch Sampling

Recall that our goal is to better harness the information in pixel-wise annotations of the target set. Image patches have long been considered as strong visual primitive [Singh, Gupta, and Efros2012] that incorporates both appearance and structure information. Visual patches have been successfully applied to various tasks in visual understanding [Li, Wu, and Tu2013]. Inspired by these pioneering works, the first step of M&M tuning is designed to be a ‘mix’ step that aims at sampling patches across images. The relation between these patches can be exploited for optimization in the subsequent ‘match’ operation.

More precisely, a large number of image patches with various spatial sizes are randomly sampled from a batch of images. Heavily overlapped patches are discarded. These patches are represented by using the features extracted from the CNN pre-trained in the stage of Fig. 

2(a), and assigned with unique class labels based on the corresponding label map. The patches across all images are mixed to decouple any intra-image dependency so as to reflect the diverse and rich target distribution. The mixed patches are subsequently utilized as the input for the ‘match’ operation.

Figure 2: An overview of the mix-and-match approach. Our approach starts with a self-supervised proxy task (a), and uses the learned CNN parameters to initialize the CNN in mix-and-match tuning (b). Given an image batch with label maps of the target task, we select and mix image patches and then match them according to their classes via a class-wise connected graph. The matching gives rise to a triplet loss, which can be optimized to tune the parameters of the network via back propagation. Finally, the modified CNN parameters are further fine-tuned to the target task (c).

The Match Step – Perceptual Patch Graph

Our next goal is to exploit the patches to generate stable gradients for tuning the network. This is possible since patches are of different classes, and such relation can be employed to form a massive number of triplets. A triplet is denoted as , where is an anchor patch, is a positive patch that shares the same label as , and is a negative patch with a different class label. With the triplets, one can formulate a discriminative triplet loss for fine-tuning the network.

A conventional way of sampling triplets is to follow the notion of schroff2015facenet schroff2015facenet. For convenience, we call this strategy as ‘random triplets’. In this strategy, triplets are randomly picked from the input batch. For instance, as shown in Fig. 3(a), nodes and an arbitrary negative patch forms a triplet, and nodes and another negative patch forms another triplet. As can be seen, there is no positive connection between nodes and despite they share a common class label. While locally the distance between each triplet is optimized, the boundary of the positive class can be loose since the global constraint (i.e. all nodes must lie closer) is not enforced. We term this phenomenon as global inconsistency. Empirically, we found that this approach tends to perform poorer than the proposed method, which will be introduced next.

The proposed ‘match’ step draws triplets in a different way from the conventional approach [Schroff, Kalenichenko, and Philbin2015]. In particular, the ‘match’ step begins with graph construction based on the mixed patches. For each CNN learning iteration, we construct a graph on-the-fly given a batch of input images. The nodes of the graph are patches. Two types of edges are defined between nodes – a) “attractive” if two nodes have an identical class label and b) “rejective” if two nodes have different class labels. Different from [Schroff, Kalenichenko, and Philbin2015], we enforce the graph to be connected, and importantly, the graph should be class-wise connected. That is, all nodes from the same class in the graph compose a connected subgraph via “attractive” edges. We adopt an iterative strategy to create such a graph. At first, the graph is initialized to be empty. Then, as shown in Fig. 3(b), patches are absorbed individually into the graph as a node and it creates respectively one “attractive” and “rejective” edge with existing nodes in the graph.

An example of an established graph is shown in Fig. 3(c). Considering nodes again, unlike ‘random triplets’, the nodes form a connected subgraph. Different classes represented in green nodes and pink nodes also form coherent clusters based on their respective classes, imposing tighter constraints than random triplets. To fully realize such class-wise constraints, each node in the graph will take turn to serve as an anchor for loss optimization. An added benefit of permitting all nodes as possible anchor candidate is the improved utilization efficiency of patch relation over random triplets.

Figure 3: This figure shows different strategies of drawing triplets. The color of nodes represent their labels. Blue and red edges denote attractive and rejective edges, respectively. (a) depicts the random triplet strategy [Schroff, Kalenichenko, and Philbin2015], where nodes from the same class do not necessarily form a connected subgraph. (b-i) and (b-ii) shows the proposed triplet selection strategy. A class-wised connected graph is constructed to sample triplets, which enforces tighter constraints on positive class boundary. Details are explained in the main text of methodology section. Best viewed in color.

The Tuning Loss

Loss function. To optimize the semantic consistency within the graph, for any two nodes in the graph, if they are connected by attractive edges, we seek to minimize their distance in the feature space; and if they are connected by rejective edges, the distance should be maximized. Consider a node that connects two other nodes via attractive and rejective edges, we denote it as an “anchor” while the two connected nodes are denoted as “positive” and “negative” respectively. These three nodes are grouped to be a “triplet”.

When constructing the graph, we ensure that each node can serve as “anchor”, except for those nodes whose labels are unique among all the nodes. Thus, the number of nodes equals the number of triplets. Assume that in each iteration we discover triplets in the graph. By converting the graph optimization problem into “triplet ranking” [Schroff, Kalenichenko, and Philbin2015], we formulate our loss function as follows:


where , , , denote “anchor”, “positive”, “negative” nodes in a triplet, is a regularization factor controlling the distance margin and is a distance metric measuring patch relationship.

In this work, we leverage perceptual distance [Gatys, Ecker, and Bethge2015] to characterize the relationship between patches. This is different from previous works [Singh, Gupta, and Efros2012] [Doersch et al.2012] that define patch distance using low-level cues (e.g., colors and edges). Specifically, the perceptual representation can be formulated as , where denotes a convolutional neural network (CNN) and denotes the extracted representation. is the perceptual distance between two patches, which can be formulated as:


where and is the CNN representation extracted from patch and . normalization is used here for calculating Euclidean distances.

By optimizing the “triplet ranking” loss, our perceptual patch graph converges to both intra-class and inter-class semantic consistency.

M&M implementation details. We use both AlexNet [Krizhevsky, Sutskever, and Hinton2012] and VGG-16 [Simonyan and Zisserman2015] as our backbone CNN architectures, as illustrated in Fig. 2. For initialization, we try random initialization and two proxy tasks including Jigsaw Puzzles [Noroozi and Favaro2016] and Colorization [Larsson, Maire, and Shakhnarovich2017]. From a batch of images in each CNN iteration, we sample patches per image with various sizes and resize them to a fixed size of . Then we extract “pool5” feature of these patches from the CNN for later usage. We assign the patches’ unique labels as the central pixel labels using the corresponding label maps. Then we perform the iterative strategy to construct the graph as discussed in the methodology section. We make use of each node in the graph as an “anchor”, which is made possible by our graph construction strategy. If any node whose label is unique among all the nodes, we duplicate it as its “positive” counterpart. In this way, we obtain a batch of meaningful triplets whose number is equal to the number of nodes, and feed them into a triplet loss layer, whose margin is set as . Such a M&M tuning is conducted for 8000 iterations on PASCAL VOC2012 or CityScapes training dataset. The learning rate is fixed at before iteration 6000, and then dropped to

. We apply batch normalization to speed up convergence.

Segmentation fine-tuning details. Finally, we fine-tune the CNN to the semantic segmentation task. For AlexNet, we follow the same setting as presented in [Noroozi and Favaro2016], and for VGG-16, we follow [Larsson, Maire, and Shakhnarovich2017] whose architecture is equipped with hyper-columns [Hariharan et al.2015]. The fine-tuning process undergoes 40k iterations, with an initial learning rate as 0.01 and dropped with a factor of 10 at iteration 24k, 36k. We keep tuning batch normalization layers before “pool5”. All experiments follow the same setting.


Settings. Different proxy tasks are combined with our M&M tuning to demonstrate its merits. In our experiments, as initialization, we use released models of different proxy tasks from learning by context (or Jigsaw Puzzles) [Noroozi and Favaro2016] and learning by colorization [Larsson, Maire, and Shakhnarovich2017]. Both methods adopt 1.3 million unlabeled images in ImageNet dataset [Deng et al.2009] for training. Besides that, we also perform experiments on randomly initialized networks. In M&M tuning, we make use of PASCAL VOC2012 dataset [Everingham et al.2010], which consists of 10,582 training samples with pixel-wise annotations. The same dataset is used in  [Noroozi and Favaro2016, Larsson, Maire, and Shakhnarovich2017] for fine-tuning so no additional data is used in M&M. For fair comparisons, all self-supervision methods are benchmarked on PASCAL VOC2012 validation set that comes with 1,449 images. We show the benefits of M&M tuning on different backbone networks, including AlexNet and VGG-16. To demonstrate the generalization ability of our learned model, we also report the performance of our VGG-16 full model on PASCAL VOC2012 test set. We further apply our method on the CityScapes dataset [Cordts et al.2016], with 2,974 training samples and report results on the 500 validation samples. All results are reported in mean Intersection over Union (mIoU), which is the standard evaluation criterion of semantic segmentation.


Overall. Existing self-supervision works report segmentation results on PASCAL VOC2012 dataset. The highest performance attained by existing self-supervision methods is learning by colorization [Larsson, Maire, and Shakhnarovich2017], which achieves 38.4% mIoU and 56.0% mIoU with AlexNet and VGG-16 as the backbone network, respectively. Therefore, we adopt learning by colorization as our proxy task here. With our M&M tuning, we boost the performance to 42.8% mIoU and 64.5% mIoU with AlexNet and VGG-16 as the backbone network. As shown in Table 1, our method achieves state-of-the-art performance on semantic segmentation, outperforming [Larsson, Maire, and Shakhnarovich2016] by 14.3% and [Larsson, Maire, and Shakhnarovich2017] by 8.5% when using VGG-16 as backbone network. Notably, our M&M self-supervision paradigm shows comparable results (0.3% point of advantage) to its ImageNet pre-trained counterpart. Furthermore, on PASCAL VOC2012 test set, our approach achieves 64.3% mIoU, which is a record-breaking performance for self-supervision methods. Qualitative results of this model are shown in Fig. 6.

We additionally perform an ablation study on the AlexNet setting. As shown in Table 1, with colorization task as pre-training, our class-wise connected graph outperforms ‘random triplets’ by 1.9%, suggesting the importance of class-wise connected graph. With solving jigsaw-puzzles as pre-training task, our model performs even better than colorization pre-training.

Method Arch.
ImageNet VGG-16 64.2
Random VGG-16 35.0
Larsson et al. [Larsson, Maire, and Shakhnarovich2016] VGG-16 50.2
Larsson et al. [Larsson, Maire, and Shakhnarovich2017] VGG-16 56.0
Ours (M&M + Graph, colorization pre-trained) VGG-16 64.5
ImageNet AlexNet 48.0
Random AlexNet 23.5
k-means [Krähenbühl et al.2015] AlexNet 32.6
Pathak et al. [Pathak et al.2016b] AlexNet 29.7
Donahue et al. [Donahue, Krähenbühl, and Darrell2016] AlexNet 35.2
Zhang et al. [Zhang, Isola, and Efros2016a] AlexNet 35.6
Zhang et al. [Zhang, Isola, and Efros2016b] AlexNet 36.0
Noroozi et al. [Noroozi and Favaro2016] AlexNet 37.6
Larsson et al. [Larsson, Maire, and Shakhnarovich2017] AlexNet 38.4
Ours (M&M + Random Triplets, colorization pre-trained) AlexNet 40.9
Ours (M&M + Graph, colorization pre-trained) AlexNet 42.8
Ours (M&M + Graph, jigsaw-puzzles pre-trained) AlexNet 44.5
Table 1: We test our model on PASCAL VOC2012 validation set, which is the generally accepted benchmark for semantic segmentation with self-supervised pre-training. Our method achieves the state-of-the-art with both VGG-16 and AlexNet architectures.
aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mIoU.
ImageNet 81.7 37.4 73.3 55.8 59.6 82.4 74.7 82.4 30.8 60.3 46.1 71.4 65.3 72.6 76.7 49.7 70.6 34.2 72.7 60.2 64.2
Colorization 73.6 28.5 67.5 55.5 50.2 78.3 66.1 78.3 26.8 60.8 50.6 70.6 64.9 62.2 73.5 38.2 66.8 38.8 68.1 55.1 60.2
M&M 83.1 37.0 69.6 56.1 62.9 84.4 76.4 82.8 33.4 61.5 44.7 67.3 68.5 68.0 78.5 42.2 72.7 37.2 75.7 58.6 64.5
84.5 39.4 76.3 60.3 64.6 85.4 77.7 84.1 35.6 63.6 50.4 70.6 72.0 73.6 80.1 50.2 73.7 37.6 77.8 66.6 67.4
Table 2:

Per-class segmentation results on PASCAL VOC2012 val. The last row shows the additional results of our model combined with ImageNet pre-trained model by averaging their prediction probabilities. The results suggest the complementary nature of our self-supervised method with ImageNet pre-trained model.

Figure 4: Feature distribution with and without the proposed mix-and-match (M&M) tuning. We use 17,684 patches obtained from PASCAL VOC2012 validation set to extract features, and map the high-dimensional features to a 2-D space with t-SNE, along with their categories. For clarity, we split 20 classes into four parts in order. The first row shows the feature distribution of a naively fine-tuned model without M&M, and the second row depicts the feature distribution of a model additionally tuned with M&M. Note that the features are respectively extracted from the CNNs which have been fine-tuned to segmentation task, in this case, two CNNs have undergone the identical amount of data and labels. Best viewed in color.

Per-class results. We analyze per-class results of M&M tuning on PASCAL VOC2012 validation set. The results are summarized in Table 2. When compared our method with the baseline model that uses colorization222We obtain higher performance than reported with the released pre-training model of  [Larsson, Maire, and Shakhnarovich2017]. as pre-training, our approach demonstrates significant improvements in classes including aeroplane, bike, bottle, bus, car, chair, motorbike, sheep, train. A further attempt at combining our self-supervised model and the fully-supervised model (through averaging their predictions) leads to an even higher mIoU of 67.4%. The results suggest that self-supervision serves as a strong candidate complementary to the current fully-supervised paradigm.

Applicability to different proxy tasks. Besides colorization [Larsson, Maire, and Shakhnarovich2017], we also explore the possibility of using Jigsaw Puzzles [Noroozi and Favaro2016] as our proxy task. Similarly, our M&M tuning boosts the segmentation performance from 36.5%333We use the released pre-training model of Jigsaw Puzzles [Noroozi and Favaro2016] for fine-tuning and obtain a slightly lower baseline than the reported 37.6% mIoU in the paper. mIoU to 44.5% mIoU. The result suggests that the proposed approach is widely applicable to other self-supervision methods. Our method can also be applied to randomly initialized cases. In PASCAL VOC 2012, M&M tuning boosts the performance from 19.8% mIoU to 43.6% mIoU with AlexNet and from 35.0% mIoU to 56.7% mIoU with VGG-16. The improvements of our method over different baselines are shown in Table 3 for PASCAL VOC 2012.

benchmark PASCAL VOC2012 CityScapes
pre-train Random Jigsaw Colorize Random Colorize Random Colorize
backbone AlexNet VGG-16 VGG-16
baseline 19.8 36.5 38.4 35.0 60.2 42.5 57.5
M&M 43.6 44.5 42.8 56.7 64.5 (64.3) 49.1 66.4 (65.6)
ImageNet 48.0 64.2 67.9
Table 3: The table shows the improvements of our method with different pre-training tasks. They respectively are Random (Xavier initialization) with AlexNet and VGG-16, Jigsaw Puzzles [Noroozi and Favaro2016] with AlexNet and Colorization [Larsson, Maire, and Shakhnarovich2017] with AlexNet and VGG-16. Baselines are produced with naive fine-tuning. ImageNet pre-trained results are regarded as upper bound. Evaluations are conducted on PASCAL VOC2012 validation set and CityScapes validation set. Results on testing sets are shown in brackets.

Generalizability to CityScapes. We apply our method on CityScapes dataset. With colorization as pre-training, naive fine-tuning yields 57.5% mIoU and M&M tuning improves it to 66.4% mIoU. The result is comparable with ImageNet pre-trained counterpart that yields 67.9% mIoU. With a random initialized network, M&M could bring a large improvement from 42.5% mIoU to 49.1% mIoU. The comparison can be found in Table 3.

Further Analysis

Learned representations. To illustrate the learned representations enabled by M&M tuning, we visualize the sample distribution changes in the t-SNE embedding space. As shown in Fig. 4, after M&M tuning, samples from the same category tend to stay close while those from different categories are torn apart. Notably, this effect is more pronounced on categories of aeroplane, bike, bottle, bus, car, chair, motorbike, sheep, train and tv, which aligns with the per-class performance improvements listed in Table 2.

The Effect of graph size. Here we investigate how the self-supervision performance is influenced by the graph size (the number of nodes in the graph), which defines the number of triplets that can be discovered. Specifically, we set the image batch size to be , so that the number of nodes is , as shown in Fig. 5. The comparative study is performed on AlexNet with learning by colorization [Larsson, Maire, and Shakhnarovich2017] as initialization. We have the following observations. On the one hand, a larger graph leads to a higher performance, since it brings more diverse samples for more accurate metric learning. On the other hand, a larger graph takes longer time for processing, since a larger batch size of images is fed in each iteration.

Efficiency. The previous study suggests that performance and speed trade-off can be enabled through graph size adjustment. Nevertheless, our graph training process is very efficient. It costs respectively hours and hours on a single TITAN-X for AlexNet and VGG-16, which are much faster than conventional ImageNet pre-training or any other self-supervised pre-training task.

Failure cases. We also include some failure cases of our method, as shown in Fig. 7

. The failed examples can be explained as follows. Firstly, patches sampled from thin objects may fail to reflect the key characteristics of the object due to the clutter, so the boat in the figure ends as a false negative. Secondly, our M&M tuning method inherits its base model (i.e., colorization model) to some extent, which accounts for the case in the figure that the dog is falsely classified as a cat.

Figure 5: The figure shows that a larger graph brings better performance, but costs a longer time in each iteration. We train the model with the same hyper-parameters for different settings and test on PASCAL VOC2012 validation set.
Figure 6: Visual comparison on PASCAL VOC2012 validation set (top 4 rows) and CityScapes validation set (bottom 3 rows). (a) Image. (b) Ground Truth. (c) Results with ImageNet supervised pre-training. (d) Results with colorization pre-training. (e) Our results.
Figure 7: Our failure cases. (a) Image. (b) Ground Truth. (c) Results with ImageNet supervised pre-training. (d) Results with colorization pre-training. (e) Our results.


We have presented a novel ‘mix-and-match’ (M&M) tuning method for improving the performance of self-supervised learning on semantic image segmentation task. Our approach effectively exploits mixed image patches to form a class-wise connected graph, from which triplets can be sampled to compute a discriminative loss for M&M tuning. Our approach not only improves the performance of self-supervised semantic segmentation with different proxy tasks and different backbone CNNs on different benchmarks, achieving state-of-the-art results, but also outperforms its ImageNet pre-trained counterpart for the first time in the literature, shedding light on the enormous potential of self-supervised learning. M&M tuning is potentially to be applied to various tasks and worth further exploration. Future work will focus on the essence and advantages of multi-step optimization like M&M tuning.

Acknowledgement. This work is supported by SenseTime Group Limited and the General Research Fund sponsored by the Research Grants Council of the Hong Kong SAR (CUHK 14241716, 14224316. 14209217).


  • [Bearman et al.2016] Bearman, A.; Russakovsky, O.; Ferrari, V.; and Fei-Fei, L. 2016. What’s the point: Semantic segmentation with point supervision. In ECCV.
  • [Cordts et al.2016] Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The Cityscapes dataset for semantic urban scene understanding. In CVPR.
  • [Dai, He, and Sun2015] Dai, J.; He, K.; and Sun, J. 2015. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV.
  • [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
  • [Doersch et al.2012] Doersch, C.; Singh, S.; Gupta, A.; Sivic, J.; and Efros, A. 2012. What makes Paris look like Paris? ACM Transactions on Graphics 31(4).
  • [Doersch, Gupta, and Efros2015] Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervised visual representation learning by context prediction. In ICCV.
  • [Donahue, Krähenbühl, and Darrell2016] Donahue, J.; Krähenbühl, P.; and Darrell, T. 2016. Adversarial feature learning. arXiv:1605.09782.
  • [Dosovitskiy et al.2015] Dosovitskiy, A.; Fischer, P.; Springenberg, J.; Riedmiller, M.; and Brox, T. 2015. Discriminative unsupervised feature learning with exemplar convolutional neural networks. arXiv:1506.02753.
  • [Everingham et al.2010] Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2010. The PASCAL visual object classes (VOC) challenge. IJCV 88(2):303–338.
  • [Felzenszwalb and Huttenlocher2004] Felzenszwalb, P. F., and Huttenlocher, D. P. 2004. Efficient graph-based image segmentation. IJCV 59(2):167–181.
  • [Gatys, Ecker, and Bethge2015] Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2015. A neural algorithm of artistic style. arXiv:1508.06576.
  • [Geiger et al.2013] Geiger, A.; Lenz, P.; Stiller, C.; and Urtasun, R. 2013. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research 32(11):1231–1237.
  • [Hariharan et al.2015] Hariharan, B.; Arbeláez, P.; Girshick, R.; and Malik, J. 2015. Hypercolumns for object segmentation and fine-grained localization. In CVPR.
  • [Hong et al.2017] Hong, S.; Yeo, D.; Kwak, S.; Lee, H.; and Han, B. 2017. Weakly supervised semantic segmentation using web-crawled videos. arXiv:1701.00352.
  • [Krähenbühl et al.2015] Krähenbühl, P.; Doersch, C.; Donahue, J.; and Darrell, T. 2015. Data-dependent initializations of convolutional neural networks. arXiv:1511.06856.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.
  • [Larsson, Maire, and Shakhnarovich2016] Larsson, G.; Maire, M.; and Shakhnarovich, G. 2016. Learning representations for automatic colorization. In ECCV.
  • [Larsson, Maire, and Shakhnarovich2017] Larsson, G.; Maire, M.; and Shakhnarovich, G. 2017. Colorization as a proxy task for visual understanding. arXiv:1703.04044.
  • [Li et al.2017a] Li, X.; Liu, Z.; Luo, P.; Loy, C. C.; and Tang, X. 2017a. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In CVPR.
  • [Li et al.2017b] Li, X.; Qi, Y.; Wang, Z.; Chen, K.; Liu, Z.; Shi, J.; Luo, P.; Tang, X.; and Loy, C. C. 2017b. Video object segmentation with re-identification. In CVPRW.
  • [Li, Socher, and Fei-Fei2009] Li, L.-J.; Socher, R.; and Fei-Fei, L. 2009. Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In CVPR.
  • [Li, Wu, and Tu2013] Li, Q.; Wu, J.; and Tu, Z. 2013. Harvesting mid-level visual concepts from large-scale internet images. In CVPR.
  • [Lin et al.2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft COCO: Common objects in context. In ECCV.
  • [Lin et al.2016] Lin, D.; Dai, J.; Jia, J.; He, K.; and Sun, J. 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR.
  • [Liu et al.2015] Liu, Z.; Li, X.; Luo, P.; Loy, C.-C.; and Tang, X. 2015. Semantic image segmentation via deep parsing network. In ICCV.
  • [Liu et al.2017] Liu, Z.; Li, X.; Luo, P.; Loy, C. C.; and Tang, X. 2017. Deep learning markov random field for semantic segmentation. TPAMI.
  • [Long, Shelhamer, and Darrell2015] Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In CVPR.
  • [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. JMLR 9:2579–2605.
  • [Noroozi and Favaro2016] Noroozi, M., and Favaro, P. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV.
  • [Papandreou et al.2015] Papandreou, G.; Chen, L.-C.; Murphy, K. P.; and Yuille, A. L. 2015.

    Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation.

    In ICCV.
  • [Pathak et al.2016a] Pathak, D.; Girshick, R.; Dollár, P.; Darrell, T.; and Hariharan, B. 2016a. Learning features by watching objects move. arXiv:1612.06370.
  • [Pathak et al.2016b] Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and Efros, A. A. 2016b. Context encoders: Feature learning by inpainting. In CVPR.
  • [Russakovsky et al.2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. IJCV 115(3):211–252.
  • [Schroff, Kalenichenko, and Philbin2015] Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015.

    Facenet: A unified embedding for face recognition and clustering.

    In CVPR.
  • [Simonyan and Zisserman2015] Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
  • [Singh, Gupta, and Efros2012] Singh, S.; Gupta, A.; and Efros, A. 2012. Unsupervised discovery of mid-level discriminative patches. In ECCV.
  • [Wang and Gupta2015] Wang, X., and Gupta, A. 2015. Unsupervised learning of visual representations using videos. In ICCV.
  • [Zhang, Isola, and Efros2016a] Zhang, R.; Isola, P.; and Efros, A. A. 2016a. Colorful image colorization. In ECCV.
  • [Zhang, Isola, and Efros2016b] Zhang, R.; Isola, P.; and Efros, A. A. 2016b.

    Split-brain autoencoders: Unsupervised learning by cross-channel prediction.

    In CVPR.
  • [Zhao et al.2017] Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid scene parsing network. In CVPR.