1 Introduction
Semantic segmentation, which densely assigns semantic labels to each image pixel, is one of the important topics in computer vision. Recently we have witnessed that the CNNs based on the encoderdecoder architecture
[16, 4, 28, 2] achieve striking performance on several segmentation benchmarks [19, 7, 5, 29, 1]. Generally, an encoderdecoder architecture consists of an encoder module that extracts context information and gradually reduces the spatial resolution of features to save the computational cost; and a decoder module that aggregates the information from the encoder and gradually recovers the spatial resolution of the dense segmentation map. Among these works, Long et al. first propose Fully Convolutional Neural Network (FCN)
[16] that predicts the dense segmentation map by utilizing the skiparchitecture, where the features of different granularities in the encoder are upsampled and integrated in the decoder, yet still faces the challenge of acquiring accurate object boundaries in the segmentation map. The similar idea can be also observed in the UNet [23], which further adds dense skipconnections between the corresponding downsampling and upsampling modules of the same feature dimensions, in results the boundary localization is improved but not fully resolved yet. Other than skip connections, Chen et al. propose the DeepLab models [2, 3, 4] that integrate the atrous spatial pyramid pooling module (ASPP) and the dense conditional random field (dense CRF) [13], where the former utilizes the dilated convolutional layer composed of the filters at multiple sampling rates thus having the contextual information at the various spatial resolution, in order to boost the edgeresponse at object boundaries. Moreover, Kirillov et al. [12] propose the PointRend module which outputs an initial segmentation map then adaptively refines it by extra pointwise predictions upon the object boundaries.Clearly, improving the edgeresponse at the object boundaries becomes a challenge of the existing segmentation networks. In the frequency domain, the edgeresponse actually corresponds to the accuracy at highfrequency region of segmentation maps.
However, most of the existing methods predict the dense segmentation map on a lowresolution image grid, which is then upsampled to the original image resolution to save the computational cost [4, 16]. Such lowresolution grid of segmentation map can indispensably limit the edgeresponse of the segmentation networks.
It is unclear whether these lowresolution grids are sufficient to capture the information of segmentation maps.
On the other hand, it is also critical that how these networks learn the semantic from the ground truth annotation in the frequency domain.
In fact, these networks can learn sufficient semantic contents from the weak annotation, which keep only the location or the coarse contours of objects [20, 11], and achieve comparable accuracy with that learn from the ground truth annotation.
These results actually indicates that
(1) The most of semantic contents in segmentation maps can be learned in the lowfrequency region. This further rationalize the lowresolution grid of the predicted segmentation maps in the existing segmentation networks.
(2) Despite the inaccurate boundary of object of the weak annotation, the learned networks can still tolerate the annotation noise induced by the weak annotation and learn the semantics.
In other words, the networks are insensitive to the noise that associated with the semantic edges. Such noise robustness helps the network to adapt to the real world datasets such as satellite and aerial images, in which the accurate annotation becomes almost impossible in these large scale datasets.
For the aerial images, annotation noise can comes from various sources thus leads to different scales of noise [18].
Many works tries to increase the noise robustness by proposing the noise robust training objective [27, 25, 26].
Yet, they do not discuss the scales dependency of annotation noise as well as the response in the frequency domain.
Lastly, the general training framework for segmentation network trains the networks by the training objective, such as crossentropy (CE), and evaluate it upon the intersectionoverunion (IoU) score. The correlation among the sampling frequency, CE, IoU score, and the edgeresponse remains unclear. Thus, further investigations on sampling efficiency and the frequency response of a semantic segmentation network are indispensable for clear evaluation on network performance.
For the semantic segmentation, the response in the frequency domain turned out to be critical yet unclear, as summarized above. For the simpler task such as 1D regression, it is found that the network tends to learn the lowfrequency component of target signal in the early training stage. [21, 22, 17] such tendency is also known as spectral bias [21]. This spectral bias indicates that it is easier for network to learn the lowfrequency target, which is also consistent to the trend in semantic segmentation. However, these works only study the regression of the data with various frequencies without considering the prior distribution of segmentation maps. For the semantic segmentation, the frequency distribution of signal in the segmentation maps should be considered. This further motivates us to study the frequency distribution of signal as well as the contribution to the training objective function and the evaluation metric.
In this work, we propose a spectral analysis to provide a theoretical point of view on the above mentioned issues. We analyze the common objective function and the evaluation metric for semantic segmentation (
i.e. CE and the IoU score respectively) as well as the gradient propagation within CNNs, in the frequency domain. Our analysis demonstrates the following novel discoveries:
The crossentropy (CE) can be explicitly decomposed into the summation of frequency components, we thus can evaluate their contributions to CE and further estimate the sampling efficiency of the lowresolution grid for segmentation maps, including the segmentation output and the groundtruth annotation. We find that CE is mainly contributed by the lowfrequency component of the segmentation maps.

The decomposition of CE inspires us to further deduce the correlation between the segmentation logits and the features within CNNs, in the frequency domain. We discover that the segmentation logits of a specific frequency are mainly affected by the features with the same frequency.

Based on the two findings above, the high frequency components of smooth features (e.g. the ones in the decoder of CNN) are found to be negligible, where our experiments show that truncating these high frequency components does not interfere the performance of semantic segmentation networks.

Frequency analysis of the IoU score reveals its close correlation to the CE. This justifies the use of IoU metric for evaluating the segmentation networks optimized by CE objective.
Our findings above contribute to the semantic segmentation networks in the following two objectives:
(1) Feature truncation for segmentation networks. The features in the decoder, which generally are assumed to be smooth comparing to the ones in the encoder, can be truncated in order to reduce the computational cost. This truncation method can also be easily integrated with the commonlyused pruning approaches [15, 10] for further cost reduction where they instead remove the redundant filters or weights, which are independent to the feature size. Moreover, one can determine the efficient sizes of these features by validating the sampling efficiency of the corresponding band limits;
(2) Blockwise annotation. As a novel weak annotation for semantic segmentation, it is cheaper to collect than the full pixelwise groundtruth and keeps the lowresolution information of the groundtruth annotation. The intuition behind the proposed blockwise annotation is similar to the existing weak annotations that utilize only the coarse contours of the instances in the segmentation map [20, 11] thus contain only the lowfrequency information. The blockwise annotation can be directly associated with the spectral information and its efficiency is also well explained by our analysis scheme.
2 Proposed Spectral Analysis
As motivated above, we propose to investigate the sampling efficiency for semantic segmentation networks by having the spectral analysis upon the crossentropy (CE) objective function and the intersectionoverunion (IoU) evaluation metric. In section 2.1 and 2.2, CE and IoU score are decomposed into the components of each frequency, respectively. In section 2.3, we further deduce the gradient propagation for convolutional layers in order to demonstrate the correlation between the segmentation output and the features in CNNs.
Notation
The notations in this section are defined as follows. In general, the upper case letter, e.g. , denote the functional in the spatial domain while the lower case letter e.g. , denote the corresponding spectrum in the frequency domain
, in which the spectrum is obtained by the Fourier transform of functional in the spatial domain. For example, the spectrum
and is the Fourier transform ; where . The rest of notations will be defined whenever they appear.2.1 Spectral Decomposition of CrossEntropy
Let denote the segmentation logits produced by a semantic segmentation network and denote the groundtruth annotation, in which and are indexes for the object class and image pixel respectively. The commonlyused objective function for learning semantic segmentation, crossentropy (CE), can be written as
(1)  
where . For all and , the integral can be transformed to the frequency domain as follows. (See lemma 1 of supplementary)
(2) 
where and are the spectrum of the segmentation logits and that of the groundtruth annotations, respectively. For simplicity, we hereafter refer as segmentation spectrum. The overall CE as written in Eq. 1 is hence given as
(3)  
where . The discrete form of Eq. 3 is
(4)  
By Eq. 4, we can thus decompose the as the summation of over frequency domain , where
(5) 
we hereafter name as the frequency components of CE. Moreover, based on such spectral analysis, the contribution from each frequency components to CE can be evaluated. Later in section 3.1, we will further demonstrate that the CE is mainly contributed by the lowfrequency components in section 3.
2.2 Spectral Analysis of IntersectionoverUnion Score
Given the segmentation logits and the ground truth annotation , the intersectionoverunion (IoU) score is typically defined as , where is the segmentation output . In order to analyze the IoU score in the frequency domain, we extend the above definition to the continuous space as follows:
(6)  
where denotes pixel indexes. Please note Eq. 6 holds for each object class where we skip
for simplicity. Besides, this definition is equivalent to the origin definition of IoU score for the binarized segmentation maps. The components in Eq.
6 can be written as follows (see lemma 1 and lemma 2 of supplementary),(7)  
where . IoU score can be hence written as
(8) 
and it is composed of two terms: and . The latter term is positively correlated to the component of CE in Eq. 1 since function is monotonically increasing. As a result, the minimization of maximizes as well as IoU score. This derivation actually justifies the rationality of the common learning procedure for semantic segmentation models: the networks are trained by the CE objective function while being validated by the IoU scores. In addition, although the IoU score in Eq. 8 can not be explicitly decomposed into frequency components as Eq. 3 due to its nonlinearity, the positive correlation between IoU and CE can still helps to clarify its connection to the edgeresponse (i.e. highfrequency response) which most segmentation models are targeting to improve on. Therefore, later in our experiments and analysis, we adopt the decomposition of CE to study the frequency response (as well as the edgeresponse) and take IoU as a reasonable metric for evaluation.
2.3 Spectral Gradient of Convolutional Layers
In section 2.1, we have demonstrated that the CE can be decomposed into the summation of frequency components . Here we further deduce the gradient propagation of within CNNs in order to reveal how updates the network. We hereafter refer the gradient propagation in frequency domain as the spectral gradient
. For simplicity, we deduce the spectral gradient for a convolutional layer, including a convolution and an activation function
(9)  
where is the kernel, is the activation function, is input feature, and is the output of convolutional layer. Here we consider the softplus activation since its everywhere differentiable thus make it easier for analysis. Here we derive the spectral gradient for Eq. 9. That is, the gradient of with respect to , i.e. . Assuming the is small and , the gradient of under frequency with respect to the under frequency is written as follows (see lemma 3 of supplementary).
(10) 
where . These assumptions actually rely on the fact that the numeric scale of feature and kernel are usually limited to a small range of value for the numeric stability of networks. Under these assumptions, the spectral gradient in Eq. 10 can be approximated as the delta function, which reveals how the variation of affect . In addition, due to the property of the delta function, is affected only by with the same frequency.
Following the spectral gradient in Eq. 10, we now derive the spectral gradient of with respect to the features in a convolutional layer. For each semantic class in segmentation maps, let and denote the spectrum of kernel and that of the segmentation output for a convolutional layer, respectively. By Eq. 5 and Eq. 10, the spectral gradient of can be written as follows. (see lemma 4 of supplementary)
(11)  
This results in the complicated spectra gradient consists of and . Here recalls that the segmentation map is usually predicted upon the lowresolution grid of the original image as mentioned in section 1. As a result, should be smooth such that is small when is large. Hence,
(12) 
Furthermore, should also be small at highfrequency since it is the product of the spectrum of segmentation maps, i.e. and , as shown in Eq. 5. We would therefore neglect the highfrequency region of and focus on the case when is small. In such case, is negligible when , as well as , is large. Namely, modifying the features at highfrequency does not effect at lowfrequency . Regarding to the spectral gradient for various cases of and , we provide the numerical validation of this observation in section 3.1.
2.4 Discussion for Spectral Analysis
In this section, we summarize the discussion for the spectral analysis based on the theoretical analysis in the above sections. As mentioned in section 1, the segmentation map is usually predicted upon the lowresolution grid of the original image, yet it is unclear whether this grid is sufficient to capture the most of the information of CE. Following the decomposition of CE in Eq. 4, we define the truncated CE as the frequency components of CE filtered by a lowpass filter with the band limit as follows,
(13) 
The losses of due to the band limit can thus be estimated from the discrepancy between and . As a result, an efficient grid can be defined when the discrepancy between and is negligible.
In addition, these efficient grids can be applied to the segmentation output and the groundtruth annotation since is related to the product of and as shown in Eq. 5.
Further more, based on the discussion of Eq. 12, the feature can also be truncated when is large, i.e. , . Based on the above observations, we can apply these efficient grids not only to the ground truth annotations but also the feature pruning in the decoder of the CNN.
In the following section, we apply the efficient grids on the features and the groundtruth annotation and validate the positive correlation between the the loss of CE and the loss of IoU score caused by the efficient grids. The truncated CE as well as the efficient grids can therefore serve as an informative reference for removing the highfrequency components of the segmentation output and groundtruth annotation in our applications.
3 Validation and Applications
In section 2
, we propose the spectral analysis to analyze the deep learning framework in aspect of frequency domain. The frequency components of CE and the spectral gradient is therefore proposed in Eq.
11 and Eq. 5, respectively. In section 3.1, we validate our proposed spectral analysis, including the spectral decomposition of CE and the spectral gradients, upon on various segmentation datasets and segmentation models described below. Based on these numeric validations, we identify the efficient grids and apply the grids onto the features in CNNs and the ground truth annotation. This leads us to further propose two novel applications: (1) Feature truncation and (2) Blockwise annotation, which are detailed in section 3.2 and section 3.3, respectively.Datasets.
We examine the experiments upon the following three semantic segmentation datasets: PASCAL semantic segmentation benchmark [7], DeepGlobe landcover classification challenge [6]
and Cityscapes pixellevel semantic labeling task
[5] (denoted as PASCAL, DeepGlobe and Cityscapes respectively). The PASCAL dataset contains 21 categories, 1464 training images, and 1449 validation images; the dataset further augmented by the extra annotations from [8]. The DeepGlobe dataset contains 7 categories, 803 training images, which are split into 701 and 102 images for training and validation, respectively. The Cityscapes dataset contains 19 categories, 2975 training images, and 500 validation images.Segmentation models and implementation details.
In our experiment, we utilize the standard segmentation models including DeepLab v3+ [4] and Deep Aggregation Net (DAN) [14]. We adopt the ResNet101 [9]
pretrained on ImageNet1k
[24] as the backbone of these models. These models are trained by the following training policies: For all datasets, the images are randomly cropped to 513513 pixels and randomflipped in the training stage; the training batch size are 8. For PASCAL dataset, the model is trained with initial learning rate 0.0007 and 100 epochs; for DeepGlobe dataset, the model is trained with initial learning rate 0.007 and 600 epochs; for Cityscapes dataset, the model is trained with initial learning rate 0.001 and 200 epochs.
3.1 Validation of Spectral Analysis
Spectral Decomposition of CE.
Recall that in Eq. 4, , where , as well as the fact that the segmentation map leads to small in the highfrequency region since it is upsampled from a lowresolution grid. The should also be small in the highfrequency region. To elaborate the limitation of such lowresolution grid, we the truncated CE at Eq. 13. Here, we further define the relative discrepancy between truncated CE and CE,
(14) 
In our experiments, is maximal at 256, which is half size of the segmentation map, i.e. 513. Hence .
We evaluate , , and based on the various segmentation models (DeepLab v3+ and DAN) and datasets (PASCAL, DeepGlobe and Cityscapes); and are the averaged value of and over all semantic classes respectively. In addition, the evaluation are performed at both the initial and final stages of model training, in order to monitor the training progress. The results are shown in Fig. 1 and Fig. 2. The results shown in Fig. 1 indicate that is indeed small in the highfrequency region thus leads to small , as discussed above. These results also support the fact that CE is mainly contributed by the lowfrequency components. On the other hands, the results shown in Fig. 2 reveals that the lowfrequency components of apparently decreases as training progresses, suggesting that the model learns to capture low resolutiongrid more effectively.
Here we further investigate the limitation of the lowresolution grid and the efficient resolution of the features based on examined in Fig. 1. For comparison, we examine the numeric value of in Table 1. Apparently, as the resolution of the grid goes higher, becomes smaller since more information on highfrequency is captured. More specifically, dramatically decrease to small values when in all experiments of the trained models. On the other hand, the segmentation maps predicted by these models are evaluated on the 129 129 lowresolution grid which corresponds to . These empirical evidences suggest that the grid with could still serve for efficient sampling of the segmentation map. These efficient grids are further validated on our proposed applications, i.e. feature truncation and blockwise annotation, in sections 3.2 and section 3.3, respectively. Besides, it is apparent from Table 1 that on the DeepGlobe dataset is significantly smaller than that on the PASCAL dataset as well as that on the Cityscapes dataset. A better efficacy in the feature truncation and the blockwise annotation is expected and will be shown in the following sections. In summary, our analysis can serve as a quantitative criterion for sampling efficiency over different datasets and models.
Dataset  8  16  32  64  256  
DeepLab  
PASCAL  0.275  0.040  0.018  0.010  0  
DeepGlobe  0.102  0.028  0.010  0.005  0  
Cityscapes  2.152  0.668  0.146  0.017  0  
DAN  
PASCAL  0.248  0.035  0.019  0.009  0  
DeepGlobe  0.081  0.021  0.008  0.004  0  
Cityscapes  1.307  0.255  0.029  0.012  0 
Spectral Gradient.
We now turn to demonstrate that the analytic form of spectral gradient can be approximated as a delta function in section 2.3. Here, we validate such approximation numerically on the three datasets (PASCAL, DeepGlobe and Cityscapes datasets) and take the spectra of ASPP features in DeepLab v3+ as our example . We also provide the validations on
for various operations within the decoder module of DeepLab v3+, including the convolution, ReLU, bilinear upsampling, in which
and denote the input and output spectra for these operations, respectively. These spectral gradients are illustrated in Fig. 4, where each column represents different or while are used respectively for each row, in which stands for the size of input features. Please note that each figure of spectral gradient has included all possible . Clearly, for the convolution, ReLU, and bilinear upsampling operations demonstrate delta responses, which is consistent with our discussion for Eq. 10 where is approximated by . Regarding to on PASCAL, DeepGlobe and Cityscapes datasets, it still behaves close to delta function despite the complex operations along the gradient propagation from the segmentation output back to the ASPP module. Such result helps to verify our assumptions used for the derivation in section 2.3 as well as the corresponding approximation i.e. Eq. 12. Noting that the segmentation output is related to not only the ASPP features but also the lowlevel features from the encoder, as shown in Fig. 3. As the paths of gradient propagation from the segmentation output to these two features are only deviated by a bilinear upsampling operator, whose spectral gradient acts as a delta function, we hence validate only on the former one. Besides of validation on the DeepLab v3+ model mentioned above, we provide also the validation on DAN in Fig. 8 of supplementary. These spectral gradients can be well approximated by the delta function. These results verify our assumptions in section 2.3 as well as the corresponding approximation i.e. Eq. 12. Such approximation enable us to further consider the feature truncation in section 3.2, which removes the redundant highfrequency components of features in CNNs without significant degradation of segmentation performance.3.2 Application on Feature Truncation
Recall that in section 3.1, truncating the components of the features in the decoder at frequency affects only the , which is negligible when is large. Here, we perform the experiments on the DeepLab v3+ and DAN to demonstrate that our feature truncation can not only reduce the computational cost but also maintain its the performance. Notably, the encoder of DeepLab v3+ has 95.6 billion FLOPs (floatingpoint operations) and 60.1 million parameters. In contrast, its decoder has 43.4 billion FLOPs while only has 1.3 million parameters. Similarly, the encoder of DAN has 95.6 billion FLOPs with 60.1 million parameters while the decoder has 11.1 billion FLOPs with only 0.4 million parameters. Although the decoder has small number of parameters, it has a comparable cost with that of the encoder since it upsamples the features for the dense segmentation map and results in large features for computation. The truncation of the features in decoder is thus expected to effectively reduce the computational cost. Moreover, one can combine feature truncation with the typical network pruning method to reduce the computation cost in two different aspects, i.e. the feature size and the redundant parameters.
Here, we adopt the Soft Filter Punning (SFP) method [10] and combine it with our feature truncation as an example. The experiments are done based on the following conditions: for SFP, we use pruning rate 20%60% for the encoder while use 20% for the decoder, since parameters of the encoder is much larger than that of the decoder and is potentially overparameterized; for the feature truncation applied on the decoder, we downsample the decoder features, including the highlevel features of the ASPP module and the lowlevel features of the backbone network, from the original size of 129129 to the efficient lowresolution grids suggested by the analysis in section 3.1. i.e. the grids with . For simplicity, we validate only the grid 6565 and 3333 with and in our experiments. The results for DeepLab v3+ are summarized in Table 2. Similar results for DAN are summarized in Table 4 of supplementary. Clearly, the IoU score and FLOPs decreases as the pruning rate increases or as the feature size decreases. In the baseline model of DeepLab v3+, the feature truncation reduces FLOPs by 23.3% with the feature truncated from 129 to 65 and leads to 0.5% (from 78.8% to 78.1%), 0.3% (from 53.9% to 53.6%) and 5.1% (from 67.8% to 62.7%) mIoU drop for PASCAL, DeepGlobe and Cityscapes datasets, respectively. Here recalls the for PASCAL, DeepGlobe and Cityscapes datasets are 0.018, 0.010 and 0.146, respectively. For these datasets, the mIoU drop is positively correlated to that of which is the relative information loss of CE. Similar correlation is also found when the feature truncated from 129 to 33. As the feature truncated from 129 to 33, the feature truncation leads to 3.0% (from 78.8% to 78.1%), 0.8% (from 53.9% to 53.1%) and 14.5% (from 67.8% to 53.3%) mIoU drop for PASCAL, DeepGlobe and Cityscapes datasets, respectively. Such trend also consists with the trend of , which are 0.040, 0.028 and 0.668 for these datasets, respectively. Clearly, the larger the , the larger the mIoUdrop. Similar trends are also observed for DAN and the pruned models integrating the SFP. The correlation between the loss of CE and the drop of IoU score is consistent to the discussion in section 2.2. We can therefore utilize to estimate the efficacy of feature truncation. Regarding the PASCAL and DeepGlobe datasets that have small , feature truncation effectively reduce around 20% FLOPs with small mIoU drop. Via further integrating the SFP with 20% pruning rate, we reduce the FLOPs by 45.3% with feature size 65 and get 2.6% mIoU drop (from 78.6% to 76.0%) on the PASCAL dataset. Combining SFP with 40% pruning rate, we reduce the FLOPs by 66.1% with feature truncated to 33 and get only 1.9% mIoU drop (from 53.9% to 52.0%) on the DeepGlobe dataset. These results demonstrate that the feature truncation can effectively integrate with the typically network pruning approach and reduce the computational cost.
To further analyze the efficiency of these models, we define the FLOPs per IoU score (FPI) by FLOPs/mIoU. The lower the FPI, the better the efficiency of the model. The FPI for DeepLab v3+ and DAN are illustrated in Fig. 4(a) and Fig. 4(b), respectively; the numeric value of FPI is summarized in Table 2 and Table 4, respectively. The FPI apparently decreases as feature reduced from 129 to 65, suggesting the efficiency cost reduction by the feature truncation. As the feature size reduced from 129 to 33, the similar decrements of FPI is achieved in the experiments upon the PASCAL and DeepGlobe datasets. Yet, the FPI increases in the experiments on the Cityscapes dataset. Such increments of FPI are also compatible with the large for both DeepLab v3+ and DAN. (c.f. Table 1), which indicates the the large as well as the large mIoU drop. Such large mIoU drop leads to an inefficient model reduction.
In summary, the performance of the DeepLab v3+ and DAN with various setups of feature truncation and SFP are evaluated based on the three datasets. These results demonstrate the effective integration of the proposed feature truncation and the typically network pruning approach to reduce the computational cost. Further more, the trend of the segmentation performance consists with our theoretical analysis. To our understanding, this is the first work that provide an theoretical framework to analyze the performance in the aspect of the frequency domain. As mentioned in section 1, the existing segmentation models predict the segmentation maps upon the lowresolution grid to save the computational cost. Our framework serves as an effective analysis tool to estimate the efficient grid size of segmentation maps as well as the features in CNNs to save the computational cost.
Feature  PASCAL  DeepGlobe  Cityscapes  FLOPs  pruned  
size  mIoU  mIoUdrop  FPI ()  mIoU  mIoUdrop  FPI ()  mIoU  mIoUdrop  FPI ()  ()  FLOPs 
Baseline  
129  78.6%  0.0%  177  53.9%  0.0%  258  67.8%  0.0%  205  139  0.0% 
65  78.1%  0.5%  136  53.6%  0.3%  199  62.7%  5.1%  170  107  23.3% 
33  75.6%  3.0%  130  53.1%  0.8%  185  53.3%  14.5%  185  98  29.2% 
20% Pruning rate for encoder  
129  76.6%  2.0%  130  53.7%  0.2%  186  67.2%  0.6%  148  100  28.2% 
65  76.0%  2.6%  100  53.4%  0.5%  143  62.4%  5.3%  122  76  45.6% 
33  73.2%  5.4%  96  52.7%  1.2%  133  53.4%  14.4%  131  70  49.6% 
40% Pruning rate for encoder  
129  74.4%  4.2%  103  53.0%  0.9%  145  66.1%  1.7%  116  77  44.7% 
65  73.9%  4.7%  72  52.7%  1.2%  101  61.5%  6.3%  86  53  61.8% 
33  72.1%  6.5%  65  52.0%  1.9%  91  52.8%  14.9%  89  47  66.1% 
60% Pruning rate for encoder  
129  65.1%  13.5%  90  50.1%  3.8%  117  58.6%  9.1%  100  59  57.7% 
65  64.7%  13.9%  54  49.9%  4.0%  70  54.8%  13.0%  64  35  74.8% 
33  63.3%  15.3%  46  49.4%  4.5%  59  47.4%  20.4%  61  29  79.1% 
3.3 Application on Blockwise annotation
In section 3.1, we determine the efficient lowresolution grid for the segmentation maps by analyzing , which estimates the discrepancy between CE and the truncated CE. In this section, we apply these lowresolution grids to the groundtruth annotation, resulting the proposed blockwise annotation, as a novel weak annotation. Moreover, we demonstrate that the performance of the semantic segmentation network trained with blockwise annotation can be estimated by . We perform the experiment that trains the DeepLab v3+ and DAN with the blockwise annotation at various band limit (from 1 to 256) and evaluates it based on the original pixelwise annotation. Examples of the blockwise annotation and the prediction of the two models upon the PASCAL, DeepGlobe and Cityscapes datasets are illustrated in Fig. 7. Please note that the blockwise annotation at is actually equivalent to the original pixelwise groundtruth. The experimental results are summarized in Table 3. For each , we evaluate IoU score and relative IoU score. Particularly, relative IoU score is the ratio of IoU score over that at , i.e. the IoU score of the network trained with pixelwise groundtruth annotation. As the band limit goes lower, IoU score as well as the relative IoU score goes smaller. For the PASCAL and DeepGlobe datasets, IoU score as well as the relative IoU score apparently drop when the band limit decreases from 16 to 8, where the relative IoU score drops from 94.9% to 86.8% on the PASCAL dataset and from 99.2% to 95.2% on the DeepGlobe dataset. Such observation is also compatible with the large increments of in Table 1. As the decreases from 16 to 8, increase from 0.040 to 0.275 on the PASCAL dataset and from 0.028 to 0.102 on the DeepGlobe dataset. For the Cityscapes dataset, the relative IoU score dramatically drop from 100% to 88.2% when the band limit decreases from 256 to 32. Such large decrements is consistent with the large . These large increments of indicates the large information loss of CE as well as the loss of IoU score. Besides, the relative IoU score on the DeepGlobe dataset is larger than that on the PASCAL dataset. This result reflects the fact that the DeepGlobe dataset contains mainly the lowfrequency component, which agrees with the observation of in section 3.1. The similar observation holds for the experiments upon DAN. To summarize all experiments, we illustrate the correlation between relative mIoU and (c.f. Table 1) in Fig. 6. Clearly, the larger the , the lower the relative IoU score. As a result, one can estimate the performance of the semantic segmentation network trained with the blockwise annotation by simply evaluating without thoroughly performing the experiments over all band limits. In summary, the proposed spectral analysis enables the advanced analysis of weak annotation in the frequency domain. Our analysis reveals the correlation between the segmentation performance and the low resolution grid of segmentation maps. Based on the analysis, we propose the blockwise annotation, as a novel approach of weak annotation that apply the lowresolution grid to the segmentation maps. Notably, these low resolution grids actually corresponds to the coarse contour of instances in the segmentation maps, which are greatly utilized in the existing weak annotation. In this work, we aim to provide the theoretical background of the spectral analysis and the blockwise annotation. Further research should be undertaken to investigate the spectral analysis upon the existing weak annotations [20, 11] in the future.


256  32  16  8  
PASCAL  












DeepGlobe  












Cityscapes  











4 Conclusion
Our proposed spectral analysis for semantic segmentation network correlate CE, IoU score and gradient backpropagation in the spectrum point of view. We discover that CE is mainly contributed by the lowfrequency component of the segmentation maps, which associates with the features in CNNs at the same frequency. As a result, the quantitative estimation of the sampling efficiency for the segmentation map becomes possible. We test our theory on two applications: feature truncation and block annotation. Our results show that combination of the feature truncation and the network pruning can save computational cost significantly with small accuracy lost. In addition, the block annotation can potentially save more in labeling cost, since the network trained using the blockwise annotation in an efficient grid performs close to the original network.
References

[1]
(2018)
Cocostuff: thing and stuff classes in context.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1.  [2] (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. ArXiv:1412.7062. Cited by: §1.
 [3] (2017) Rethinking atrous convolution for semantic image segmentation. ArXiv:1706.05587. Cited by: §1.
 [4] (2018) Encoderdecoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision (ECCV), Cited by: §1, §1, Figure 3, §3.

[5]
(2016)
The cityscapes dataset for semantic urban scene understanding
. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.  [6] (2018) Deepglobe 2018: a challenge to parse the earth through satellite images. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), Cited by: §3.
 [7] (2015) The pascal visual object classes challenge: a retrospective. International Journal of Computer Vision (IJCV). Cited by: §1, §3.
 [8] (2011) Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision (ICCV), Cited by: §3.
 [9] (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.
 [10] (2019) Asymptotic soft filter pruning for deep convolutional neural networks. IEEE Transactions on Cybernetics. Cited by: §1, §3.2.
 [11] (2017) Simple does it: weakly supervised instance and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §3.3.
 [12] (2019) PointRend: image segmentation as rendering. ArXiv:1912.08193. Cited by: §1.
 [13] (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1.
 [14] (2018) Deep aggregation net for land cover classification.. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), Cited by: §3.
 [15] (2019) Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), Cited by: §1.
 [16] (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1.
 [17] (2019) Theory of the frequency principle for general deep neural networks. arXiv preprint arXiv:1906.09235. Cited by: §1.
 [18] (2013) Machine learning for aerial image labeling. Ph.D. thesis, University of Toronto. Cited by: §1.
 [19] (2014) The role of context for object detection and semantic segmentation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.

[20]
(2015)
Weaklyand semisupervised learning of a deep convolutional network for semantic image segmentation
. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §3.3.  [21] (2018) On the spectral bias of neural networks. arXiv preprint arXiv:1806.08734. Cited by: §1.
 [22] (2019) The convergence rate of neural networks for learned functions of different frequencies. In Advances in Neural Information Processing Systems, pp. 4763–4772. Cited by: §1.
 [23] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention (MICCAI), Cited by: §1.
 [24] (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV). Cited by: §3.
 [25] (2019) Symmetric cross entropy for robust learning with noisy labels. In IEEE International Conference on Computer Vision (ICCV), pp. 322–330. Cited by: §1.

[26]
(2019)
L_DMI: a novel informationtheoretic loss function for training deep nets robust to label noise
. In Advances in Neural Information Processing Systems (NIPS), pp. 6222–6233. Cited by: §1.  [27] (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in Neural Information Processing Systems (NIPS), pp. 8778–8788. Cited by: §1.
 [28] (2017) Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.

[29]
(2017)
Scene parsing through ade20k dataset
. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
5 Supplementary
5.1 Fourier transform of spatial integral
Lemma 1.
Given two functional and in spatial domain , the overlapping integral can be transformed into the frequency domain as
(15) 
where and .
Proof.
By the convolution lemma, integral can be written as
(16) 
; where denotes the convolution operation as ; is the inverse Fourier transform operator. Eq. 16 can now be written as
(17)  
By the orthogonality of Fourier basis, we have , where is the Dirac delta function:
(18) 
and its integral property is . Hence, Eq. 17 is given as
(19)  
∎
Lemma 2.
Given functional in spatial domain , the integral can be transformed into the frequency domain as
(20) 
where .
Proof.
5.2 Gradient propagation for a convolution layer
Consider a convolution layer consists of the convolutional kernel and the softplus activation function ; is the spatial location. Let denote the input, the output of convolution layer is written as
(22)  
Lemma 3.
Assuming is small and , the spectral gradient can be approximated as
(23) 
where , and are , and , respectively.
Proof.
The spectral gradient of a convolution layer consists of the spectral gradient for the convolution operator and that for the activation function. We will show two gradient and combine it in the end of derivation.
For the convolution operator, it can be written as in the frequency domain , where , and are , and , respectively. Without loss of generality, in the discrete frequency domain, the gradient of under a specific frequency with respect to the under frequency is defined as
(24) 
where is delta function in which it equals to if and otherwise.
For the softplus function, it can be first expressed as Taylor series
(25)  
in which is small since the kernel is small and by the assumption. Hence, becomes negligible. The Fourier transform of is thus given as
(26)  
and its spectral gradient is
(27)  
where
is a dummy variable for the convolution and
is the spectrum size of features. By Eq. 24 and Eq. 27, the spectral gradient of a convolutional layer in Eq. 9 is then written as(28)  
where are the frequency indices. Since is small as argued above, the corresponding spectrum should also be small. We can therefore neglect the second term of Eq. 28,i.e. , and approximate Eq. 28 as
(29) 
. ∎
5.3 Gradient propagation for the frequency component of CE
Lemma 4.
Given a convolutional layer that satisfies the assumption of lemma 3. Let denote the spectrum of input feature. For each semantic class in segmentation maps, let and denote the spectrum of kernel and that of the segmentation output, respectively. The spectral gradient for the frequency component of CE, is
(30)  