1 Introduction
Recent progress in hardware technology has made running efficient deep learning models on mobile devices possible. This has enabled many ondevice experiences relying on deep learningbased computer vision systems. However, many tasks including semantic segmentation still require downsampling of the input image trading off accuracy in finer details for better inference speed
[26, 56]. We show that uniform downsampling is suboptimal and propose an alternative contentaware adaptive downsampling technique driven by semantic boundaries. We hypothesize that for better segmentation quality more pixels should be picked near semantic boundaries. With this intuition, we formulate a neural network model for learning contentadaptive sampling from ground truth semantic boundaries, see Fig.
1.(a) original image  (b) ground truth labels 

(c) target semantic boundaries and adaptive sampling locations 
(d) interpolation of sparse classifications (target is white) 
The advantages of our nonuniform downsampling over the uniform one are twofold. First, the common uniform downsampling complicates accurate localization of boundaries in the original image. Indeed, assuming uniformly sampled points over an image of diameter , the distance between neighboring points gives a bound for the segmentation boundary localization errors . In contrast, analysis in Appendix A shows that the error bound decreases significantly faster with respect to the number of sample points
assuming they are uniformly distributed near the segment boundary of max curvature
and length . Our nonuniform boundaryaware sampling approach selects more pixels around semantic boundaries reducing quantization errors on the boundaries.Second, our nonuniform sampling implicitly accounts for scale variation via reducing the portion of the downsampled image occupied by larger segments and increasing that of smaller segments. It is wellknown that presence of the same object class at different scales complicates automatic image understanding [23, 46, 19, 18, 55, 6, 7, 57, 8]. Thus, the scale equalizing effect of our adaptive downsampling simplifies learning. As shown in Fig. 1(c,d), our approach samples many pixels inside the cyclist, while the uniform downsampling may miss that person all together.
With the proposed contentadaptive sampling, a semantic segmentation system consists of three parts, see Fig.2. The first is our nonuniform downsampling block trained to sample pixels near semantic boundaries of target classes. The second part segments the downsampled image and can be based on practically any existing segmentation model. The last part upsamples the segmentation result producing a segmentation map at the original (or any given) resolution. Since we need to invert the nonuniform sampling, standard CNN interpolation techniques are not applicable.
Our contributions in this paper are as follows:

We propose adaptive downsampling aiming at accurate representation of targeted semantic boundaries. We use an efficient CNN to reproduce such downsampling.

Most segmentation architectures can benefit from nonuniform downsampling by incorporating our contentadaptive sampling and interpolation components.

We apply our framework to semantic segmentation and show consistent improvements on many architectures and datasets. Our costperformance analysis accounts for the computational overhead. We also analyze improvements from our adaptive downsampling at semantic boundaries and on objects of different sizes.
2 Prior work
Semantic segmentation requires a class assignment for each pixel in an image. This problem is important for many automated navigational applications. We first review some related literature on this topic. We then provide a brief review of some relevant nonuniform sampling methods.
Many segmentation networks are built upon basic image classification networks, [36, 53, 6, 7, 57, 8]. These approaches modify the base model to produce dense higher resolution features maps. For example, Long [36] used fully convolutional network [33] and trainable deconvolution layers producing higher resolution dense feature maps. They also note that algorithme à trous, a technique well knowing in signal processing [25], is a way to increase resolution of the feature maps. This idea was studied in [6] where the authors introduced dilated convolutions
that allow removal of max pooling layers from a trained model producing higher resolution feature maps with higher field of view without the need to retrain the model.
twostage  ours  singlestage  

[19, 18, 21]  Sec. 3  [36, 44, 6]  
accuracy  ++  +   
speed    +  ++ 
multiobject speed     +  ++ 
simplicity    +  ++ 
multiscale  ++  +   
boundary precision  ++  +   
Segmentation models built upon classification models inherit one limiting property, that is the base classification models [32, 48, 22] tend to have many features in the deeper layers. That results in an extensive resources consumption when increasing the resolution of later feature maps (using for example algorithme à trous [25]). As a result, the final output is typically chosen to be of lower resolution, with an interpolation employed to upscale the final score map.
The alternative direction for segmentation (and more generally for pixellevel prediction) is based on “hourglass models” that first produce low resolution deep features and then gradually upsample the features employing common network operations and skip connections
[44, 5, 39, 38].The need and advantage of aggregating information from different scales have been long recognized in the computer vision literature [23, 46, 19, 18, 55, 54, 6, 7, 57, 8]. One way to tackle the multiscale challenge is to first detect the location of objects and then segment the image using either a cropped original image [19, 54] or cropped feature maps [18, 21]. These twostage approaches separate the problems of scale learning and segmentation, making the latter task easier. As a result the accuracy of segmentation improves and instance level segmentation is straightforward. However, such an approach comes with a significant computational cost when many objects are present since each object needs to be segmented individually. Our method improves upon the singlestage approach for a small computational cost and, thus, is positioned in between of these two approaches. Table 1 outlines pros and cons of our approach compared with twostage and singlestage methods.
Spatial Transformer Networks [28, 42] learn spatial transformations (warping) of the CNN input. They explore different parameterizations for spatial transformation including affine, projective, splines [28] or specially designed saliencybased layers [42]. Their focus is to undo different data distortions or to “zoomin” on salient regions, while our approach is focused on efficient downsampling retaining as much information around semantic boundaries as possible. They do not use their approach in the context of pixellevel predictions (segmentation) and do not consider the inverse transformations (Sec. 3.3 in our case).
Deformable convolutions [11, 29] augment the spatial sampling locations in the standard convolutions with additional adaptive offsets. In their experiments the deformable convolutions replace traditional convolutions in the last few layers of the network making their approach complementary to ours. The goal is to allow the new convolution to pick the features from the best locations in the previous layer. Our approach focuses on choosing the best locations in the original image and thus has access to more information.
Other complementary approaches include skipping some layers at some pixels [17] and early stopping of network computation for some spatial regions of the image [16, 34]. Similarly, these methods modify computation at deeper network layers and do not concern image downsampling.
Pascal [41]
showed that an advanced extrapolation method (based on PDEs) applied to a smartly selected small number of pixels reproduces the original image with low error leading to a stateoftheart compression scheme. Their method selects pixels around strong edges of the image. In contrast, we do not use edges of the image when deciding sampling locations. Instead, we rely on machine learning based on semantic boundaries to predict sampling locations.
3 Boundary Driven Adaptive Downsampling
Fig. 2 shows three main stages of our system: contentadaptive downsampling, segmentation and upsampling. The downsampler, described in Sec. 3.1
, determines nonuniform sampling locations and produces a downsampled image. The segmentation model then processes this (nonuniformly) downsampled image. We can use any existing segmentation model for this purpose. The results are treated as sparsely classified locations in the original image. The third part, described in Sec.
3.3, uses interpolation to recover segmentation at the original resolution, see Fig. 1(d).Let us introduce notation. Consider a highresolution image of size with channels. Assuming relative coordinate system, all pixels have spatial coordinates that form a uniform grid covering square . Let be the value of the pixel that has spatial coordinates closest to for . Consider tensor . We denote elements of by for , , . We refer to such tensors as sampling tensors. Let be the point . Fig. 1(c) shows an example of such points.
The sampling operator
maps a pair of image and sampling tensor to the corresponding sampled image such that
(1) 
The uniform downsampling can be defined by a sampling tensor such that and .
3.1 Sampling Model
Our nonuniform sampling model should balance between two competing objectives. On one hand we want our model to produce finer sampling in the vicinity of semantic boundaries. On the other hand, the distortions due to the nonuniformity should not preclude successful segmentation of the nonuniformly downsampled image.
Assume for image (Fig. 1(a)) we have the ground truth semantic labels (Fig. 1(b)). We compute a boundary map (white in Fig. 1(c)) from the semantic labels. Then for each pixel we compute the closest pixel on the boundary. Let be the spatial coordinates of a pixel on the semantic boundary that is the closest to coordinates (distance transform). We define our contentadaptive nonuniform downsampling as sampling tensor minimizing the energy
(2) 
subject to covering constraints
(3) 
The first term in (2) ensures that sampling locations are close to semantic boundaries, while the second term ensures that the spatial structure of the sampling locations is not distorted excessively. The constraints provide that the sampling locations cover the entire image. This least squares problem with convex constraints can be efficiently solved globally via a set of sparse linear equations. Red dots in Figs. 1(c) and 3 illustrate solutions for different values of .
We train a relatively small auxiliary network to predict the sampling tensor without boundaries. The auxiliary network can be significantly smaller than the base segmentation model as it solves a simpler problem. It learns cues indicating presence of the semantic boundaries. For example, the vicinity of vanishing points is more likely to contain many small objects (and their boundaries). Also, small mistakes in the sampling locations are not critical as the final classification decision is left for the segmentation network.
As an auxiliary network, we propose two UNet [44] subnetworks stacked together (Fig. 6). The motivation for stacking subnetworks is to model the sequential processes of boundary computation and sampling points selection. We train this network with squared L2 loss between the network prediction and a tensor “proposal” minimizing (2) subject to (3)^{1}^{1}1The network prediction is projected onto constraints (3) during testing.. Alternatively, one can directly use objective (2
) as a regularized loss function
[52, 50]. Our proposal generation approach can be seen as a one step of ADM procedure for such a loss [37].Once the sampling tensor is computed the original image is downsampled via sampling operator (1). Application of sampling tensor of size yields sampled image of size . If this is not the desired size of downsampled image, we still can employ for sampling. To that end, we obtain a new sampling tensor of shape by resizing using bilinear interpolation, see example in Fig.5.
Fig.4 shows the architecture of our downsampling block.
3.2 Segmentation Model
3.3 Upsampling
In keeping with prior work, we assume that the base segmentation model produces a final score map of the same size as its downsampled input. Thus, we need to upsample the output to match the original input resolution. In case of standard downsampling this step is a simple upscaling, commonly performed via bilinear interpolation. In our case, we need to “invert” the nonuniform transformation. Covering constraints (3) ensure that the convex hall of the sampling locations covers the entire image, thus we can use interpolation to recover the score map at the original resolution. We use Scipy [2] to interpolate the unstructured multidimensional data, which employs Delaunay [12] triangulation and barycentric interpolation within triangles [47].
An important aspect of our contentadaptive downsampling method in Sec. 3.1 is that it preserves the grid topology. Thus, an efficient implementation can skip the triangulation step and use the original grid structure. The interpolation problem reduces to a computer graphics problem of rendering a filled triangle, which can be efficiently solved by Bresenham’s algorithm [47].
4 Experiments
In this section we describe several experiments with our adaptive downsampling for semantic segmentation on many highresolution datasets and stateoftheart approaches. Figure 7 shows a few qualitative examples.
(a) image & our sampling locations 
(b) ground truth 
(c) predictions with uniform downsampling 
(d) predictions with our adaptive downsampling 
4.1 Experimental Setup
Dataset and evaluation.
We evaluate and compare the proposed method on several public semantic segmentation datasets. Computational requirements of the contemporaneous approaches and the cost of annotations conditioned the low resolution of images or imprecise (rough) annotations in popular semantic segmentation datasets, such as Caltech [15], [3], Pascal VOC [14, 13, 20], COCO [35]. With rapid development of autonomous driving, a number of new semantic segmentation datasets focusing on road scenes [10, 27] or synthetic datasets [45, 43] have been made available. These recent datasets provide highresolution data and high quality annotations. In our experiments, we mainly focus on datasets with highresolution images, namely ApolloScapes [27], CityScapes [10], Synthia [45] and Supervisely (person segmentation) [49] datasets.
The main evaluation metric is mean
Intersection over Union (mIoU). The metric is always evaluated on segmentation results at the original resolution. We compare performance at various downsampling resolutions to emulate different operating requirements. Occasionally we use other metrics to demonstrate different features of our approach.Implementation details:
Our main implementation is in Caffe2 [1]. For both the nonuniform sampler network and segmentation network, we use Adam [30]
optimization method with (base learning rate, #epochs) of (
, 33), (, 1000), (, 500) for datasets ApolloScape, Supervisely, and Synthia, respectively. We employ exponential learning rate policy. The batch size is as follows:input resolution 16 32 64 128 256 512 batch size 128 128 128 32 24 12 .
Experiments with PSPNet [57] and Deeplabv3+ [8] use public implementations with the default parameters.
UNet backbone 
PSPNet backbone 

Deeplabv3+ backbone 
UNet backbone 
UNet backbone 
In all experiments, we consider segmentation networks fed with uniformly downsampled images as our baseline. We replace the uniform downsampling with adaptive one as described in Sec. 3.1. The interpolation of the predictions follows Sec. 3.3 in both cases. The auxiliary network is separately trained with ground truth produced by (2) where we set . The auxiliary network predicts a sampling tensor of size , which is then resized to a required downsampling resolution. During training of the segmentation network we do not include upsampling stage (for both baseline and proposed models) but instead downsample the label map. We use the softmaxentropy loss.
During training we randomly crop largest square from an image. For example, if the original image is we select a patch of size . During testing we crop the central largest square. Additionally, during training we augment data by random leftright flipping, adjusting the contrast, brightness and adding saltandpepper noise.
4.2 Costperformance Analysis
ApolloScape [27]
is an open dataset for autonomous driving. The dataset consists of approximately 105K training and 8K validation images of size . The annotations contain 22 classes for evaluation. The annotations of some classes (cars, motorbikes, bicycles, persons, riders, trucks, buses and tricycles) are of high quality. These occupy of pixels in evaluation set. We refer to these as target classes. Other classes annotations are noisy. Since the noise in pixel labels greatly magnifies the noise of segments boundaries, we chose to define our sampling model based on the target classes boundaries. This exploits an important aspect of our method, an ability to focus on boundaries of specific semantic classes of interest. Following [27] we give separate metrics for these classes.
Tab. 2 shows that our adaptive downsampling based on semantic boundaries improves overall quality of semantic segmentation. Our approach achieves a mIoU gain of 35% for target classes and up to 2% overall. This improvement comes at negligible computational cost. Fig. 9 shows that our approach consistently produces better results even under fixed computational budgets.
It is not surprising that target classes benefit more. Indeed, focusing on boundaries of some classes may lower performance on other classes. This gives one a flexibility of reflecting importance of certain classes over the others depending on the application.
CityScapes [10]
is another commonly used open road scene dataset providing 5K annotated images of size with 19 classes in evaluation. Following the same test protocol, we evaluated our approach using PSPNet [57, 4] (with ResNet50 [22] backbone) and Deeplabv3+ [8] (with Xception65 [9] backbone) as the base segmentation model. The mIoU results are shown in Tab. 5 and Fig. 9 where we again see consistent improvements of up to 4%.
Synthia
[45] is a synthetic dataset of 13K HD images taken from an array of cameras moving randomly through a city. The results in Tab. 5 show that our approach improves upon the baseline model. The costperformance analysis in Fig. 11 shows that our method improves segmentation quality of target classes by 1.5% to 3% at negligible cost.
Person segmentation
The Supervisely Person Dataset [49] is a collection of highresolution images with 6884 highquality annotated person instances. The dataset set contains pictures of people taken in different conditions, including portraits, land and cityscapes. We have randomly split the dataset into training () and testing subsets (). The dataset has only two labels: person and background. Segmentation results for this dataset are shown in Tab. 5 with a costperformance analysis with respect to the baseline shown in Fig. 11. The experiment shows absolute mIoU increases up to , confirming the advantages of nonuniform downsampling for person segmentation tasks as well.
downsample resolution 
auxiliary net resolution 
flops, 
mIoU 
downsample resolution 
auxiliary net resolution 
flops, 
mIoU 


backbone  PSPNet[57]  Deeplabv3+[8]  
ours  64  32  4.37  0.32  160  32  17.54  0.58 
baseline    4.20  0.29    17.23  0.54  
ours  128  32  11.25  0.43  192  32  25.12  0.62 
baseline    11.08  0.40    24.81  0.61  
ours  256  32  44.22  0.54  224  32  34.08  0.65 
baseline    44.05  0.54    33.77  0.62 
downsample resolution 
flops,

all classes 
target classes 


ours  32  0.38  0.67  0.61 
baseline  0.31  0.65  0.58  
ours  64  1.40  0.77  0.73 
baseline  1.23  0.76  0.71  
ours  128  5.49  0.86  0.83 
baseline  4.93  0.84  0.81  
ours  256  21.85  0.92  0.91 
baseline  19.74  0.91  0.89 
downsample resolution 
flops,

mIoU 
back ground 
person 


ours  16  0.15  0.73  0.84  0.62 
baseline  0.07  0.69  0.81  0.56  
ours  32  0.35  0.76  0.86  0.67 
baseline  0.30  0.76  0.85  0.66  
ours  64  1.39  0.83  0.90  0.76 
baseline  1.22  0.80  0.88  0.71  
ours  128  5.42  0.87  0.93  0.82 
baseline  4.90  0.85  0.91  0.79  
ours  256  20.11  0.90  0.94  0.86 
baseline  19.59  0.89  0.93  0.84 
4.3 Boundary Accuracy
We design an experiment to show that our method improves boundary precision. We adopt a standard trimap approach [31] where we compute the classification accuracy within a band (called trimap) of varying width around boundaries of segments. We compute the trimap plots for two input resolutions in Fig. 13 for person segmentation dataset described above. Our methods improves mostly in the vicinity of semantic boundaries. Interestingly, for the input resolution of the maximum accuracy improvement is reached around trimap width of 4 pixels. This may be attributed to the fact that downsampling model in Sec. 3.1 does not depend on downsampling resolution and essentially defines the same sampling tensor for all sizes of downsampled image. Thus, the distances between neighboring points for sampling locations are approximately 4 times larger than the respective distances for sampling locations. This leads to reduced gain of accuracy within narrow trimaps.
4.4 Effect of Object Size
Since our adaptive downsampling is trained to select more points around semantic boundaries, it implicitly provides larger support for small objects. This results in better performance of the overall system on these objects. Instance level annotations allow us to verify this by analyzing quality statistics with respect to individual objects. This is in contrast to usual pixelcentric segmentation metrics (mIoU or accuracy). , the recall of a segmentation of an object is defined as ratio of pixels classified correctly (pixel predicted to belong to the true object class) to the total number of pixels in the object^{2}^{2}2Recall usually comes together with precision. Since segmentation does not have instance labels, the objectlevel precision is undefined.. Fig. 12 and 14 show the improvement of recall over baseline for objects of different sizes and categories. Our method degrades more gracefully than the uniform downsampling as the object size decreases.
Conclusions
In this work, we described a novel method to perform nonuniform contentaware downsampling as an alternative method to uniform downsampling to reduce the computational cost for semantic segmentation systems. The adaptive downsampling parameters are computed by an auxiliary CNN that learns from a nonuniform sample geometric model driven by semantic boundaries. Although the auxiliary network requires additional computations, the experimental results show that the network improves segmentation performance while keeping the added cost low, providing a better costperformance balance. Our method significantly improves performance on small objects and produces more precise boundaries. In addition, any offtheshelf segmentation system can benefit from our approach as it is implemented as an additional block enclosing the system.
A potential future research direction is employing more advanced interpolation methods, similar to [41], which can further improve quality of the final result.
Finally, we note that our adaptive sampling may benefit other applications with pixellevel predictions where boundary accuracy is important and downsampling is used to reduce computational cost. This is left for future work.
Appendix Appendix A Nonuniform Sampling Error
As stated in the submission the error bound decreases as assuming sampling points are uniformly distributed near the segment boundary where and are the maximal curvature and length of the boundary respectively. To show the upper bound on the best approximation it suffices to show an example that provides boundary approximation error.
Assuming commonly used linear interpolation method (which we use in the paper) the boundary of segments is piecewise linear. Using sampling points we can define any piecewise linear curve with segments, see Fig. 15.
Let be a curve of length in for and be its linear approximation with segments. We define boundary approximation error as the maximal distance between the curve and its approximation:
We place the ends of the segments of exactly on curve such that they are uniformly distributed. That is, each segment encloses a piece of the curve of length . The error on each segments can be bounded by approximation error of an arc of radius as shown in Fig. 16:
where we used the facts
This immediately leads to the bound .
References
 [1] Caffe2: A New Lightweight, Modular, and Scalable Deep Learning Framework. https://caffe2.ai.
 [2] SciPy is opensource software for mathematics, science, and engineering. https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.griddata.html.
 [3] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via a sparse, partbased representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 26(11):1475–1490, 2004.

[4]
O. Andrienko.
ICNet and PSPNet50 in Tensorflow for realtime semantic segmentation.
https://github.com/oandrienko/fastsemanticsegmentation, 2018.  [5] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(12):2481–2495, Dec 2017.
 [6] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(4):834–848, 2018.
 [7] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
 [8] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. pages 801–818, 2018.

[9]
F. Chollet.
Xception: Deep learning with depthwise separable convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
, pages 1251–1258, 2017. 
[10]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.  [11] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 764–773, 2017.
 [12] B. Delaunay et al. Sur la sphere vide. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskii i Estestvennyka Nauk, 7(793800):1–2, 1934.
 [13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
 [14] M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool, M. Allan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorkó, et al. The 2005 pascal visual object classes challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 117–176. Springer, 2006.
 [15] L. FeiFei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer vision and Image understanding, 106(1):59–70, 2007.
 [16] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1039–1048, 2017.
 [17] M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli. Perforatedcnns: Acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems, pages 947–955, 2016.
 [18] R. Girshick. Fast rcnn. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
 [19] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
 [20] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In International Conference on Computer Vision (ICCV). IEEE, 2011.
 [21] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In IEEE International Conference on Computer Vision (ICCV), pages 2980–2988. IEEE, 2017.
 [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
 [23] X. He, R. S. Zemel, and M. Á. CarreiraPerpiñán. Multiscale conditional random fields for image labeling. In Proceedings of the 2004 IEEE computer society conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages II–II. IEEE, 2004.
 [24] V. HernándezMederos and J. EstradaSarlabous. Sampling points on regular parametric curves with control of their distribution. Computer Aided Geometric Design, 20(6):363 – 382, 2003.
 [25] M. Holschneider, R. KronlandMartinet, J. Morlet, and P. Tchamitchian. A realtime algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286–297. Springer, 1990.
 [26] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [27] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang. The apolloscape dataset for autonomous driving. arXiv preprint arXiv:1803.06184, 2018.
 [28] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
 [29] Y. Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1846–1854. IEEE, 2017.
 [30] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [31] P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision (IJCV), 82(3):302–324, 2009.
 [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [33] Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
 [34] X. Li, Z. Liu, P. Luo, C. Change Loy, and X. Tang. Not all pixels are equal: Difficultyaware semantic segmentation via deep layer cascade. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3193–3202, 2017.
 [35] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.
 [36] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3431–3440, 2015.
 [37] D. Marin, M. Tang, I. B. Ayed, and Y. Boykov. Beyond gradient descent for regularized segmentation losses. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[38]
A. Newell, K. Yang, and J. Deng.
Stacked hourglass networks for human pose estimation.
In European Conference on Computer Vision (ECCV), pages 483–499. Springer, 2016.  [39] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1520–1528, 2015.
 [40] S. M. Obeidat and S. Raman. An intelligent sampling method for inspecting freeform surfaces. The International Journal of Advanced Manufacturing Technology, 40(1112):1125–1136, 2009.
 [41] P. Peter, S. Hoffmann, F. Nedwed, L. Hoeltgen, and J. Weickert. From optimised inpainting with linear pdes towards competitive image compression codecs. In T. Bräunl, B. McCane, M. Rivera, and X. Yu, editors, Image and Video Technology, pages 63–74, Cham, 2016. Springer International Publishing.
 [42] A. Recasens, P. Kellnhofer, S. Stent, W. Matusik, and A. Torralba. Learning to zoom: a saliencybased sampling layer for neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 51–66, 2018.
 [43] S. R. Richter, Z. Hayder, and V. Koltun. Playing for benchmarks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pages 2232–2241, 2017.
 [44] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [45] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes (SYNTHIARand). In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [46] C. Russell, P. Kohli, P. H. Torr, et al. Associative hierarchical crfs for object class image segmentation. In Computer Vision, 2009 IEEE 12th International Conference on, pages 739–746. IEEE, 2009.
 [47] P. Shirley, M. Ashikhmin, and S. Marschner. Fundamentals of Computer Graphics. A. K. Peters, Ltd., Natick, MA, USA, 2nd edition, 2005.
 [48] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [49] Supervise.ly. Releasing “Supervisely Person” dataset for teaching machines to segment humans. https://hackernoon.com/releasingsuperviselypersondatasetforteachingmachinestosegmenthumans1f1fc1f28469, 2018.
 [50] M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y. Boykov. On regularized losses for weaklysupervised cnn segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 507–522, 2018.
 [51] W. Tiller. Knotremoval algorithms for nurbs curves and surfaces. ComputerAided Design, 24(8):445 – 453, 1992.
 [52] J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012.
 [53] Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080, 2016.
 [54] F. Xia, P. Wang, L.C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical autozoom net. In European Conference on Computer Vision (ECCV), pages 648–663. Springer, 2016.
 [55] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
 [56] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for realtime semantic segmentation on highresolution images. arXiv preprint arXiv:1704.08545, 2017.
 [57] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.