Deep learning methods have been successfully applied to a wide range of tasks in computer vision, e.g., image classification [15, 12], object detection  and semantic segmentation . The outstanding results are usually achieved by employing deep neural networks with large capacity. However, these large and deep neural networks [12, 25] are difficult to be deployed in edge-devices with severe constraints on memory, computation and power, especially for the dense prediction tasks of semantic segmentation or scene parsing. Knowledge distillation technique is a popular and effective solution to improve the performance of a smaller student network by forcing it to mimic the behaviors of the original complicated teacher network, which is promising for offering deployable efficient network models.
Existing knowledge distillation methods are mostly for image classification where a student network tries to learn the knowledge represented by the teacher networks, such as soft predictions 
, logits, or features in intermediate layers [23, 16, 5]. However, few works consider knowledge distillation for the dense pixel-level prediction tasks, e.g.
, semantic segmentation. Directly extending current distillation methods to this complicated task can only bring marginal improvements, since most of them provide image-level global knowledge, while pixel-wise knowledge is needed for semantic segmentation. Naively, intermediate convolutional neural network (CNN) features can be regarded as pixel-wise knowledge represented by the fixed feature vectors. However, it is quite challenging for the student to learn the exactly same feature values across all of the pixel locations for semantic segmentation, due to the inherent capacity gap between the student and the teacher. Only marginal improvements can be obtained as shown by our experimental results. Therefore, new pixel-wise knowledge representation should be developed to ease the whole distillation process of semantic segmentation.
In this paper, we propose to utilize pixel-wise feature similarities (PFS) between different pixel locations of the high-level CNN features as the knowledge transfer media for distillation. As shown in Figure 1, for each space location of the CNN feature maps, we endow it with a calculated PFS map. After visiting all the feature locations, we obtain the final generated knowledge representation, which includes all the PFS maps with the number of . Compared to the naive CNN features, our proposed PFS owns two advantages. The first one is that PFS models intra relationships of different image contents, which are invariant to the network architectures. In contrast, high-level CNN feature values of the teacher network are variant when using different network architectures. The second advantage is that PFS encodes more image contexts than the naive CNN feature vectors, which have been certified beneficial to the task of semantic segmentation. To further promote the distillation performance, we propose another new knowledge-gap guided knowledge distillation approach for the soft prediction mimicking, which also encourages more knowledge transfer in the pixels where large gaps exist between both networks.
To summarize, our contributions include:
1. We propose a novel way of learning pixel-wise feature similarities for the knowledge distillation of semantic segmentation.
2. A novel knowledge-gap based distillation approach on the soft predictions is proposed to achieve the weighted knowledge transfer for the pixels, based on the student’s necessity of be taught by the teacher network at each pixel.
3. Extensive analysis and evaluations of the proposed approach are provided on the challenging ADE20K, Pascal VOC 2012 and Pascal Context benchmarks, to validate the superiority of the approach over the conventional knowledge distillation methods and current state-of-the-arts. Significant performance improvements are achieved by our newly proposed method.
2 Related Work
2.1 Semantic segmentation
Semantic segmentation, as a fundamental problem in computer vision, has been largely studied in the past several years. Currently, deep learning based methods occupy the best entries of several semantic segmentation benchmarks, e.g., Pascal VOC, ADE20K, and Pascal Context.  is the pioneer work that address semantic segmentation with deep convolutional networks. It presents a conv-deconv architecture to achieve the end-to-end segmentation with fully convolutional networks. After this, many following works are proposed to further enhance the model’s performance. For example,  tries to augment the features at each pixel location with the global context features. [33, 6] introduce dilated operation into the convolutional operation to enlarge the filters’ receptive field, which is testified beneficial for the segmentation task.  proposes spatial pyramid pooling to aggregate the information from different global scales to help segmentation.  proposes DeepLab v3, a state-of-the-art segmentation model by introducing parallel dilated spatial pyramid pooling to the image-level features for capturing multi-scale information. Despite the fact that superior performance is achieved, most of the models require larger backbone networks or computationally expensive operations which is too heavy to be deployed in edge-devices. Therefore, many works are proposed recently to accelerate the networks for practical applications, which are reviewed in the following Section 2.2.
2.2 Network acceleration
Neural network acceleration approaches can be roughly categorized into weight pruning [10, 26], weight decomposition [4, 20], weight quantization [9, 21] and knowledge distillation [13, 23, 14]. Weight pruning methods [10, 26]
iteratively prune the neurons or connections of low importance in a network based on designed criteria. Weight decomposition[4, 20] converts the dense weight matrices to a compact format to reduce the parameter number greatly. Weight quantization [9, 21] reduces the precision of the weights or features and then the generated low-precision model can be deployed during inference. However, most of weight quantization methods need specific hardware or implementation, and thus cannot fully capitalize modern GPU and existing deep learning frameworks.
trains a compressed model to mimic the output of an ensemble of strong classifiers. introduces a temperature cross entropy loss in the soft prediction imitation during training.  proposes to regress the logits (pre-softmax activations) where more information is maintained.  introduces FitNets which extends the soft prediction imitation to the intermediate hidden layers and shows that mimicking in these layers can improve the training of deeper and thinner student network.  uses a teacher-student network system with a function-preserving transform to initialize the parameters of the student network. 
compresses a teacher network into an equally capable student network based on reinforcement learning. defines a novel information metric for knowledge transfer between the teacher and student network.  uses the attention map summarized from the CNN features to achieve knowledge transfer.  explores a new kind of knowledge neuron selectivity to transfer the knowledge from the teacher model to the student model and achieves better performance.
Beyond the above works that utilize knowledge distillation to improve the performance of student networks for image classification, some works extend knowledge distillation to more complex vision tasks, e.g., object detection [16, 5], face verification and alignment [28, 19] and pedestrian detection . 
is the only one we find currently dealing with the semantic segmentation task. Specifically, the authors take the probability output and the segmentation boundaries as the transferred knowledge. In contrast, our proposed approach performs pixel-wise distillation, in which each pixel obtains its knowledge of similarities with respect to other ones.
3 Proposed method
In this section, we present the details of our proposed method. The feature similarity based distillation achieves pixel-wise mimic regarding to the image spatial structure, which is described in Section 3.1. Knowledge-gap based distillation realizes pixel-wise mimic considering the different necessities of being taught among the pixels, as illustrated in Section 3.2. Before starting the knowledge distillation process, we first train a teacher network on the given training set to reach its best performance. Formally, for the task of semantic segmentation, let represents the dataset with samples. Pixel-wise multi-class cross entropy loss  regarding to the ground truth label map is employed to update the network. As shown in Figure 2, after the teacher network is well-trained, it is kept fixed to provide knowledge targets to the student. The student network is then driven by the regular segmentation loss as well as the losses generated by the proposed distillation modules.
3.1 Feature similarity based knowledge distillation
Current knowledge distillation methods are mostly for image-level or proposal-level classification (i.e., object detection). In these works, teacher’s knowledge is summarized over the whole image or region proposals and then transferred to the student. One naive way of adapting the existing methods to semantic segmentation is to treat the pixel-wise high-dimensional CNN features encoded by the teacher as knowledge transfer media, as done by . However, only marginal improvements are achieved as shown in our experiments in Section 4. To facilitate the distillation process of semantic segmentation, we propose to extract the new knowledge representation encoded by PFS.
Take the final convolutional output feature map of the ResNets as an example. We present two methods to extract the pixel-wise feature similarities as shown in Figure 3, the simplified version of PFS (S-PFS) and the complicated version one (C-PFS) which is partially similar to . Formally, we denote the convolutional features utilized for the computation of PFS as with size of [B, C, H, W]. Here, B, C, H and W represent the batch size, channel number, height and width of the input feature map. For the calculation of S-PFS, raw input
is directly reshaped into tensorsand with size of [B, , C] and [B, C, ], respectively. Then the two reshaped tensors are multiplied with the following softmax operation to generate the final overall S-PFS map with size of [B, , ]. The whole calculation process can be expressed as following equation (1)
in which is the intermediate S-PFS logits before the softmax operation.
As shown in Figure 3, the final obtained output represents the overall similarities of any two feature locations. Specifically, we extract the knowledge representation for the -th feature location as PFS. For the calculation of C-PFS, we add two extra convolutional operations to the raw inputs firstly to produce two new features and , where and . Then and are reshaped as above for the following calculation. For the convolutional transformations and , we use filters with kernel size of 1 and output channel size of . Other subsequent computing processes are the same as S-PFS. Then the loss for the PFS based knowledge distillation can be written as,
where is the total number of pixel locations of the input feature map, and indicate the feature similarities extracted by the teacher and student network, respectively. After obtaining the feature similarity map , we calculate the final output feature with following equation (3).
in which is a learnable parameter to match the scales of the two features.
3.2 Knowledge-gap based knowledge distillation
Beyond the PFS based knowledge distillation part, we also propose another novel knowledge-gap based distillation approach for soft prediction mimicking. We first review the conventional knowledge distillation on soft predictions. Let and be the teacher and student network, and their pre-softmax scores are denoted as and . Assuming to be a temperature parameter used to soften the teacher’s scores, the post-softmax predictions of the teacher and student are and , respectively. The student is trained to minimize the sum of the cross-entropy w.r.t. ground truth label and the cross-entropy w.r.t. the soft predictions of the teacher network, as shown with and separately in equation (4). Here the superscript in and denotes the prediction on the -th class among all the classes, and is a hyper-parameter to balance both cross-entropies,
in which indicates the ground truth label among the categories. Directly extending the image-level soft prediction mimicking with equation (4) to a pixel-level one is a straightforward adaptation for the dense semantic segmentation. However, it also ignores the varying necessity of being taught by the teacher for each pixel in the student network, as the knowledge-gaps between both networks on different pixels can be greatly diverse. The predictions on easy pixels of both networks may be similar, while there may exist huge differences between the predictions of both networks on the difficult pixels. To allow each pixel in the student network to be selectively taught by the teacher, according to the knowledge-gap between both networks, we propose to add a weight reflecting the pixel-level knowledge-gap between both networks in the
term. Formally, the loss function is shown in equation (5). Subscript denotes the -th pixel among all the pixels in the image. is the pixel-level weight showing the difference between the teacher’s post-softmax prediction on the ground truth class of the -th pixel and that of the student network . When the teacher produces worse post-softmax prediction (i.e., lower post-softmax prediction value) on the ground truth class than the student for a certain pixel, is set to 0 to avoid the knowledge transfer from the teacher to student in that pixel.
During training, the total loss to be minimized is the sum of and shown in equation (6),
in which controls the contributions of PFS based distillation loss to the optimization.
|Vanilla student||Post-softmax mimic ||Hint learning ||Attention transfer ||PFS distillation||Knowl.-gap post- softmax mimic||Final result||Vanilla teacher|
|ResNet18-ResNet101 S-PFS||67.24||67.76 (+0.52)||67.90 (+0.66)||67.92 (+0.68)||69.59 (+2.35)||68.65 (+1.41)||70.01 (+2.77)||73.41|
|ResNet18-ResNet101 C-PFS||68.96||69.58 (+0.62)||69.78 (+0.82)||69.66 (+0.70)||72.01 (+3.05)||70.29 (+1.33)||72.94 (+3.98)||75.82|
|Vanilla student||PFS distillation||Knowl.-gap post- softmax mimic||Final result||Vanilla teacher|
|ResNet34-ResNet101 S-PFS||69.80||71.81 (+2.01)||70.83 (+1.03)||72.04 (+2.24)||73.41|
|ResNet34-ResNet101 C-PFS||71.72||74.27 (+2.55)||72.87 (+1.15)||74.89 (+3.17)||75.82|
The datasets used in the experiments are the challenging ADE20K, Pascal VOC 2012 and Pascal Context benchmarks. ADE20K is a scene parsing benchmark with 150 classes. It contains 20210 training images and 2000 validation images for the evaluation. Pascal VOC 2012 segmentation benchmark contains 10582 training images (including both original and extra labeled ones ) and 1449 validation images which are used for testing in this paper. Pascal Context dataset provides dense semantic labels for the whole scene, which has 4998 images for training and 5105 for test. Following , we only choose the most frequent 59 classes in the dataset for evaluation.
4.2 Implementation details
In this work, both the teacher and student networks are dilated ResNet-FCN models which are adapted from the conventional ResNet models 
for semantic segmentation. Specifically, the dilated ResNet has two main modifications compared to the ResNet for image classification. First, the strides are changed from 2 to 1 for the convolution filters with kernel size 3 in the first residual block of the 4-th and 5-th groups. Thus, the output size is downsized to 1/8 of the original input image. Second, all the convolution layers in the 4-th and 5-th groups of ResNet are casted into dilated convolution layers with dilation rate being 2 and 4, respectively. The developed PFS extraction module is inserted into the 5-th group of the backbone network and participates into the segmentation as introduced in Figure3.
Following previous works [35, 6], we also employ the poly learning rate policy where the initial learning rate is multiplied by after each iteration. The base learning rate is set to 0.01 for all the used datasets. Balance weights and are set to 1 and
separately to make the corresponding losses into the same order of magnitude. Stochastic Gradient Descent (SGD) optimizer is used to update the network for 60 epochs, with momentum and weight decay parameters set as 0.9 and 0.0001, respectively. General data augmentation methods are used in network training, such as randomly flipping and random scaling (scales between 0.5 to 1.5) the input images. Mean IoU is used as the metric to evaluate our algorithm as well as other methods.
4.3 Baseline methods
To compare with the traditional knowledge distillation methods, here we choose several excellent distillation methods as our baseline and quantitatively compared with our proposed method.
Hint learning is proposed in  with using the output of intermediate hidden layer learned by the teacher as target knowledge to improve the training process and final performance of the student network. L2 distance loss is usually used to close the output differences between the student intermediate layers and the corresponding teacher one’s. In our experiments, we choose the final residual block of the backbone network as the hint layer. Since the output channel dimensions are different (e.g., 512 for ResNet18/ResNet34 and 2048 for ResNet101), we transform the student network with an extra convolutional layer to match the sizes of both outputs.
Zagoruyko et.al.  proposed to use attention as the knowledge to help train the student network. Specifically, they define the attention by conducting operation of summation or maximization on the intermediate CNN features along the channel dimension. In contrast, we represent each pixel’s knowledge by its similarities with other ones. Here we use the averaged features generated by the final residual block as the learned knowledge for the distillation process.
4.4 Quantitative results
4.4.1 Results on Pascal VOC 2012
Together with above three strong baseline methods, we present the quantitative segmentation results of Pascal VOC 2012 in this section.
We firstly conduct experiments with the student-teacher pair of ResNet18-ResNet101 under the distillation modules of S-PFS and C-PFS, respectively. The results are shown in Table 1. Generally, our proposed method obtains much better results than all the baseline methods, which certifies the importance of modeling pixel-wise knowledge for the distillation task on semantic segmentation. Specifically, traditional knowledge distillation methods can only improve the mIoU results around 0.7%, while our proposed method can improve the results of ResNet18 from 67.24%/68.96% to 69.59%(+2.35%)/72.01%(+3.05%) separately with the only distillation modules of S-PFS/C-PFS. The knowledge-gap based post-softmax mimic also doubles the improvements obtained by the traditional methods, which also certifies the necessary of paying more attention to the difficult pixels within the image. Combining the two proposed distillation modules, significant improvements of 2.77% and 3.98% are achieved for the student network of ResNet18 with S-PFS and C-PFS, respectively.
With different student
We further show our experimental results with the student network replaced by ResNet34 to check the effectiveness of our proposed methods for different student networks. As shown in Table 2, performance of the ResNet34 network can be improved by 2.01% and 2.55% with the separate S-PFS and C-PFS modules. A final significant improvement of 2.24% and 3.17% are achieved by our proposed method, respectively.
With stronger teacher
Considering the fact that C-PFS based ResNet101 owns better segmentation performance, experiments are conducted in this part to check whether a stronger teacher could help the student further. Here we use S-PFS based ResNet18 and ResNet34 as the student and C-PFS based ResNet101 as the teacher for the experiment. As shown in Table 3, after the distillation, results of ResNet18/ResNet34 can be improved from 67.24%/69.80% to 70.46%(+3.22%)/72.87%(+3.07%), respectively. With stronger teacher, performance of the student can be further enhanced.
|Vanilla stu.||PFS distillation||Vanilla tea.|
|S-PFS ResNet18||67.24||70.46 (+3.22)||75.82|
|S-PFS ResNet34||69.80||72.87 (+3.07)||75.82|
|Vanilla student||PFS distillation||Knowl.-gap post- softmax mimic||Final result||Vanilla teacher|
|ResNet18-ResNet101||33.43||36.58 (+3.13)||34.74 (+1.31)||37.42 (+3.99)||40.63|
|ResNet34-ResNet101||36.44||39.37 (+2.93)||37.53 (+1.09)||39.89 (+3.45)||40.63|
4.4.2 Results on ADE20k
We then list the distillation results of the ADE20k dataset with the C-PFS module in Table 4. Consistent improvements are also achieved under the two employed student-teacher pairs. Specifically, the mIoU results are improved from 33.43% to 37.42% (+3.99%) with ResNet18 and 36.44% to 39.89% (+3.45%) with ResNet34. Surprisingly, our final result of ResNet34 can even approach its teacher network with only a performance gap of 0.74%. This clearly demonstrates the effectiveness of our proposed method, since the performance gap between the teacher (ResNet101) and the student (ResNet34) is only 4.19%.
|Datasets||Model||Vanilla student||Vanilla teacher||Tea.-stu. gap||Distilled stu.||Improvements||Speed(FPS)|
4.5 Qualitative results
Since our key idea is to represent the distillation knowledge for each pixel by modeling its similarities with all the other ones, we qualitatively visualize the learned PFS by the vanilla student (i.e., S-PFS based ResNet18), distilled version of the student and the vanilla teacher network (i.e., C-PFS based ResNet101) separately in Figure 4. Different pixels within the image are selected and marked with red bold cross to check their PFS with respect to the whole image contents. Student network given its limited model capacity and representation ability is hard to capture the overall structural information of the image. After conducting the proposed pixel-wise distillation, PFS of the student network can perform similarly with its teacher of including more broaden and long-range image contexts for semantic segmentation. Figure 5 shows several example images from the datasets of Pascal VOC 2012 and ADE20K, which include the segmentation results generated by the vanilla student network and the one enhanced by our newly proposed knowledge distillation method. Better segmentation outputs are achieved by the distilled student network after learning the rich feature similarity information from the teacher, which also further certifies the efficacy of our proposed method.
4.6 Comparison with state-of-the-arts
Apart from the knowledge distillation methods introduced in Section 4.3,  is the only one we found to tackle the specific distillation problem of semantic segmentation. In their work, student-teacher pair of MobileNet1.0-DeepLabV2 and ResNet101-DeepLabV2 is used to evaluate the segmentation performance. For fair comparison, we implement our method with the same network architectures of . Experiments are conducted on both Pascal VOC 2012 and Pascal Context datasets for complete comparison. In our experiments, we use S-PFS based student and C-PFS based teacher for the distillation and show the comparison results in Table 5. Our method achieves consistent better results than current state-of-the-art work. On the Pascal Context dataset, S-PFS based MobileNet1.0 can obtain 3.1% improvement by our proposed method. Under the similar teacher and student performance gap,  can only achieve an improvement of 1.9%. This clearly demonstrates the efficacy of our proposed method. When using the Pascal VOC 2012 dataset, performance of the student network can reach 72.7% with a relative improvement of 3.6%, which both surpasses the results of , i.e., 69.6% and 2.3%. Our proposed S-PFS module only takes minor computation time. Compared to , inference speed becomes slightly slower than from 46.5 frames per second (FPS) to 43.8 FPS. However, we can achieve much higher improvements on the mIoU results and still keep the real time segmentation speed.
In this paper, we propose a novel pixel-wise feature similarity based knowledge distillation method tailored to the task of semantic segmentation. Our newly designed PFS module endows each pixel with the specific knowledge representation modeled by its similarities with all the other ones, which naturally reflects the inherent structural information of the input image and helps guide the distillation process in an easier way. Furthermore, we propose a novel knowledge-gap based distillation approach on the soft predictions to achieve the weighted knowledge transfer for the pixels, based on the necessity of the student work to be taught by the teacher network. Extensive evaluations on the challenging ADE20K, Pascal VOC 2012 and Pascal Context benchmarks strongly validate the efficacy of our approach in knowledge distillation for semantic segmentation.
-  (2017) N2N learning: network to network compression via policy gradient reinforcement learning. arXiv preprint arXiv:1709.06030. Cited by: §2.2.
-  (2014) Do deep nets really need to be deep?. In NIPS, Cited by: §1, §2.2.
-  (2006) Model compression. In SIGKDD, Cited by: §2.2.
-  (2016) An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678. Cited by: §2.2.
-  (2017) Learning efficient object detection models with knowledge distillation. In NIPS, Cited by: §1, §2.2.
-  (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.1, §4.2.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2.1.
-  (2015) Net2net: accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641. Cited by: §2.2.
-  (2014) Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. Cited by: §2.2.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.2.
-  (2011) Semantic contours from inverse detectors. In CVPR, Cited by: §4.1.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §4.2.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.2, §2.2, Table 1, §4.3.
-  (2017) Like what you like: knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219. Cited by: §2.2, §2.2.
-  (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
-  (2017) Mimicking very efficient network for object detection. In CVPR, Cited by: §1, §2.2.
-  (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §2.1.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1, §2.1, §3.
-  (2016) Face model compression by distilling knowledge from neurons.. In AAAI, Cited by: §2.2.
-  (2015) Tensorizing neural networks. In NIPS, Cited by: §2.2.
-  (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §2.2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1.
-  (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §1, §2.2, §2.2, §3.1, Table 1, §4.3.
-  (2016) In teacher we trust: learning compressed models for pedestrian detection. arXiv preprint arXiv:1612.00478. Cited by: §2.2.
Inception-v4, inception-resnet and the impact of residual connections on learning.. In AAAI, Cited by: §1.
-  (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: §2.2.
-  (2016) Do deep convolutional nets really need to be deep and convolutional?. arXiv preprint arXiv:1603.05691. Cited by: §2.2.
-  (2017) Model distillation with knowledge transfer in face classification, alignment and verification. arXiv preprint arXiv:1709.02929. Cited by: §2.2.
-  (2017) Non-local neural networks. arXiv preprint arXiv:1711.07971 10. Cited by: §3.1.
-  (2018) Improving fast segmentation with teacher-student learning. British Machine Vision Conference. Cited by: §2.2, §4.1, §4.6, Table 5.
Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. Cited by: §2.2.
A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In CVPR, Cited by: §2.2.
-  (2017) Dilated residual networks.. In CVPR, Vol. 2, pp. 3. Cited by: §2.1.
-  (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §2.2, Table 1, §4.3.
-  (2017) Pyramid scene parsing network. In CVPR, Cited by: §1, §2.1, §4.2.