The recent record-breaking performance of semantic segmentation using deep networks motivates an ever-growing application demand. However, those segmentation models typically bear a heavy computational cost to run (i.e., inference), making them extremely challenging to be deployed into resource-constrained platforms, ranging from mobile phones to wearable glasses, drones, and autonomous vehicles. Particularly, while existing works on improving inference efficiency are traditionally focused on classification, state-of-the-art (SOTA) segmentation models are even much more costly. For example, a ResNet50 (He et al., 2016) costs 4 GFLOPs for inference with an input size . In comparison, for a DeepLabv3+ (Chen et al., 2018) with the Resnet50 backbone and the same
input (associated with an output stride of 16), the inference cost jumps up to 13.3 GFLOPs; the cost could further soar to 435 GFLOPs if we operate on a higher input resolution of. A similar trend can be expected in terms of the required energy costs. These highly required resource costs prohibit segmentation models from edge device deployments or at least degrade the quality of user experience. Specifically, such expensiveness of segmentation models arises from two aspects:
High input resolution and its proportional costs: segmentation, as a dense prediction task, typically relies on fully convolutional networks whose inference FLOPs are proportional to the input size. Meanwhile, unlike classification, segmentation is well-known to be more resolution-sensitive due to its much finer prediction granularity (Chen et al., 2019b). Therefore, high-resolution inputs are preferable for improving algorithmic performance, which yet contradicts the resource-saving needs.
Multi-scale aggregation: segmentation is well-known for its strong dependency on multiple scale features (Chen et al., 2018; Yu et al., 2018a; Yu and Koltun, 2016; Zhao et al., 2018, 2017) for contextual reasoning in combination with full-resolution outputs. Such a desired feature is often achieved by fusing a multi-resolution stream or aggregating paralleled filters with different sizes. Both the fusion and aggregation modules can incur heavy resource costs.
The expensiveness of segmentation is further amplified when we come to consider its training (e.g., continuous learning and adaptation) in resource-constrained settings. Many applications, such as autonomous vehicles and robots, require real-time and in-situ learning and continuous adaptation to new data, to be considered truly intelligent. As compared to cloud-based (re)training, local (re)training helps avoid transferring data back and forth between data centers and local platforms, reducing communication loads, and enhancing privacy. Besides, the increasingly prohibitive energy, financial and environmental costs of training ML algorithms have become a growing concern even for training in the cloud (Strubell et al., 2019). However, resource-constrained training was not explored much until a few recent efforts on classification (Jiang et al., 2019; Wang et al., 2019; You et al., 2019).
Our contributions. This work aims to push forward the training and inference efficiency of SOTA segmentation models to a new level, from the current practice of merely focusing on light-weight network design, towards a novel data-network co-optimization perspective. Its core driving motivation can be summarized in two points: (1) not all input samples are born equal (Jiang et al., 2019; et.al, 2017); and (2) eliminating input variances reduces the model’s learning workload
eliminating input variances reduces the model’s learning workload(Engelbrecht et al., 1999).
More specifically, we propose DANCE, an efficient training and inference framework that can be applied towards any existing segmentation model. First, DANCE adopts an input adaptive automated data slimming technique. We propose a spatial complexity indicator to adapt the input images’ spatial resolution, training sampling frequency, and weighted coefficients in the loss function. Thus DANCE makes the models focus more on the complicated samples during training, while during testing the input images’ spatial resolution will be similarly reduced (i.e., downsampled).
Meanwhile, adaptively reducing input resolution has direct (proportional) impacts on the training and inference energy costs (i.e., both computation and memory movement costs). The indirect, yet also the important consequence is that the downsampled inputs become more “normalized” in terms of object and feature scales. Current segmentation models strongly rely on built-in multi-scale aggregation modules, to balance between contextual reasoning and fine-detail preservation (Chen et al., 2019b; Yu and Koltun, 2016). Interestingly, with spatial-complexity-adaptive downsampled inputs, further slimming those cost-dominant multi-scale aggregation building blocks save both training and inference costs without hampering the algorithmic performance, that’s our proposed automated network slimming in DANCE.
Below we outline the contributions of the proposed DANCE framework:
DANCE, the first data-network co-optimization framework, boosts efficiency of both training and inference for segmentation models while mostly improving the accuracy. Further, DANCE is general and thus can be applied to any existing segmentation backbone.
DANCE in this paper simultaneously integrates automated data and network slimming to manipulate input images and their contribution to the model while slimming the network architecture in a co-optimization manner. Interestingly, the former can emulate the effect of multi-scale aggregation, thus enabling more aggressive slimming of their corresponding cost-dominant building blocks.
Extensive experiments and ablation studies demonstrate that DANCE can achieve “all-win” (i.e., reduced training and inference costs, and improved model accuracy) towards efficient segmentation, when benchmarking on four SOTA segmentation models and three popular segmentation benchmark datasets. As shown in Fig. 1, DANCE establishes a new record trade-off between segmentation models’ accuracy and training&inference efficiency.
2. Related works
2.1. Efficient CNN inference and training
Extensive works have been proposed to improve the efficiency of CNN inference, most of them focus on the classification tasks. Network compression has been widely studied to speed up CNN inference, e.g., by pruning unimportant network weights (Han et al., 2015a; He et al., 2017), quantizing the network into low bitwidths (Hubara et al., 2017), or distilling lighter-weight networks from teachers (Polino et al., 2018). For example, a representative automated pruning method (Network Slimming (Liu et al., 2017)) imposes
-sparsity making use of the scaling factor from the batch normalization; later progressive pruning methods (i.e., gradually increase pruning ratio) are developed to improve the resulting models’ accuracy(Ye et al., 2018). Another stream of approaches involves designing compact models, such as MobileNet (Sandler et al., 2018) and ShuffleNet (Zhang et al., 2018). Energy cost was leveraged in (Yang et al., 2017) to guide the pruning towards the goal of energy-efficient inference.
Resource-efficient training is different from and more complicated than its inference counterpart. However, many insights gained from the latter can be lent to the former. For example, the recent work (Lym et al., 2019) showed that performing active channel pruning during training can accelerate the empirical convergence. Lately, Wang et al. (Wang et al., 2019) proposed one of the first comprehensive energy-efficient training frameworks, consisting of stochastic data dropping, selective layer updating, and low-precision back-propagation. They demonstrated its success in training several classification models with over 80% energy savings. (Jiang et al., 2019) accelerated training by skipping samples that may lead to low loss values (considered as less informative) at each iteration.
2.2. Semantic segmentation
Multi-scale aggregation in segmentation. Multi-scale aggregation has been proven to be powerful for semantic segmentation (Chen et al., 2018; Yu et al., 2018a; Zhao et al., 2018, 2017), via integrating multi-scale modules and high-/low-level features to capture patterns of different granularities. Pyramid Pooling and Atrous Spatial Pyramid Pooling (ASPP) modules were introduced in (Zhao et al., 2017) and (Chen et al., 2018) to aggregate features learned in different sizes of receptive fields, adapting the models to objects with different semantic sizes. Parallel branches of different downsampling rates were proposed by (Yu et al., 2018a; Zhao et al., 2018) to cover different resolutions. Although multi-scale aggregation contributes to segmentation accuracy improvement, it and its associated header introduce extra overhead during both training and inference (e.g., 52.98% inference FLOPs of Deeplabv3+ with a ResNet50 backbone and output stride of 16). That motivates us to slim such modules in DANCE.
Efficient segmentation models. A handful of efficient semantic segmentation models have been developed: ENet (Paszke et al., 2016) used an asymmetric encoder-decoder structure together with early downsampling; ICNet (Zhao et al., 2018) cascaded feature maps from multi-resolution branches under proper label guidance, together with network compression; and BiSeNet (Yu et al., 2018a) fused a context path with a fast downsampling scheme and a spatial path with smaller filter strides.
Remaining challenges. However, the models above were neither customized for nor evaluated on ultra-high resolution images, and our experiments show that they did not achieve sufficiently satisfactory trade-off in such cases. A knowledge distillation method was also leveraged to boost the performance of a computationally light-weight segmentation model from a teacher network (He et al., 2019). Despite their progress, none of them touches the training efficiency, nor any discussion related to co-optimization with the input data. Besides, the FLOPs number has a correlation to, but is not a faithful indicator of the actual energy cost, as pointed out by many prior works (Yang et al., 2017).
3. The proposed DANCE framework
This section presents our proposed DANCE framework. We will first provide an overview of DANCE in Section 3.1, and then introduce DANCE’s automated data slimming and automated network slimming design in Section 3.2 and Section 3.3, respectively.
3.1. DANCE overview
The driving hypothesis of DANCE is that matching the data and network can potentially boost both the model performance and hardware efficiency by removing redundancy associated with both the data and network. As such, DANCE aims to reduce the computational and energy costs of segmentation tasks during both training and inference, via a joint effort from data-level and network-level. Specifically, as shown in Fig. 2, DANCE integrates both automated data and network slimming, where the former automatically performs complexity-driven data downsampling/dropping before applying the data to a network while the latter automatically and progressively prunes the network to match the slimmed data. A bonus benefit of DANCE is that the resulting data-network pipeline after training (i.e., inference) is also naturally cost-efficient.
3.2. DANCE: automated data slimming
DANCE’s automated data slimming strives to automatically downsample or drop input images and controls their corresponding contribution to the training loss, adapting to
the images’ spatial complexity which is estimated using a spatial complexity indicator.
Spatial complexity indicator. Spatial complexity has been commonly used as the basis for estimating image complexity (Gain and Siegelmann, 2019; Mishra et al., 2019; Xu et al., 2019), such as the one proposed in (Yu and Winkler, 2013):
where and denote gray-scale images filtered with horizontal and vertical Sobel kernels, respectively, and denotes the number of pixels. Developed by (Yu and Winkler, 2013) to predict the image complexity for imaging compression/coding purpose, reflects the pixel-level variances and is extremely efficient to calculate, e.g., account for only FLOPs and energy (on-device measurement when including both computations and data movements) of the DeepLabv3+ model (ResNet50 as the backbone with an output stride of 16) on one RGB image patch of size 224224.
In DANCE, we first compute all training samples’1860), which turns out to be well-matched in all considered datasets as shown in Fig. 3. Statistical analysis of for a specific dataset is an interesting question, which we leave for future works.
Thanks to the fitted CDF, given an input image, we can project its to a variable
via probability integral transform(Dodge and Commenges, 2006). The resulting is then directly used as the corresponding input image’s downsampling ratio, stochastic dropping probability, and weighted coefficient in the training loss.
Complexity-adaptive downsampling. The proposed complexity-adaptive downsampling in DANCE draws inspiration from recent findings which show that not all input samples are born equal (Jiang et al., 2019; et.al, 2017), and is motivated by the fact that downsampling input image sizes can most straightforwardly reduce the training/inference energy costs, as well as directly benefits the memory throughput. Meanwhile, a few recent works learn to adjust resolution or respective fields (Dai et al., 2017; Marin et al., 2019), whose promising results further motivate our complexity-adaptive downsampling.
As prior works show that the minimal acceptable downsampling ratio is 0.5 for most segmentation models (Yu et al., 2018a; Chen et al., 2018), we make use of the spatial complexity indicator to downsample the input images with a ratio of , where is the aforementioned projected value corresponding to the images’ . In contrast to the learning-based approaches in prior works (Dai et al., 2017; Marin et al., 2019) that incur extra training workloads, we seek a reliable indicator that is mostly “training free” and inexpensive to compute, based on which we can estimate a proper downsampling rate per image adaptively. In particular, the energy overhead of our complexity-adaptive downsampling is ¡ when estimated using real-device measurements in all our considered datasets.
Complexity-adaptive stochastic dropping. Recent pioneering CNN efficient works (Jiang et al., 2019; Wang et al., 2019) proposed that dropping a portion of training samples/mini-batches, either randomly or using some loss-based deterministic rules, can reduce the total training costs without notably sacrificing or even improving the algorithmic accuracy. Inspired by the stochastic dropping idea of (Wang et al., 2019), we incorporate the readily available spatial complexity indicator in Eq. (1) to calibrate the dropping probability. Specifically, (Wang et al., 2019) proposes to randomly skip incoming data (in mini-batch) with a default probability of 50% (i.e., 50% of the data is discarded without being fed into the models). The authors demonstrated this naively simple idea (with zero overhead) to be highly effective for efficient training without hurting or even improving the achieved accuracy. We further hypothesize that the images with larger spatial complexity are more informative and likely to favor the achieved accuracy if being more frequently trained than the ones with smaller spatial complexity.
Therefore, instead of adopting a uniformly dropping probability for all images, we propose a simple yet effective heuristic to enable complexity-adaptive stochastic dropping by assigning a smaller dropping probability to input images with larger spatial complexity. In particular, we assignas the dropping probability, where is the aforementioned projected value of the images’ spatial complexity indicator ().
Complexity-adaptive loss. Similarly, the losses produced by images with different complexities have been observed to contribute differently to the training loss (Gain and Siegelmann, 2019) or convergence in training (Jiang et al., 2019). We thus prioritize the updates generated by samples with larger spatial complexity, and adopt an adaptive weighted loss as below:
where is a scalar weighted coefficient, and is the cross-entropy loss of samples, corresponding to the -th image of the current mini-batch with images. Similar to the dropping probability assignment in complexity-adaptive stochastic dropping, input image with larger spatial complexity will be assigned a larger weighted coefficient than the one with smaller spatial complexity. As such, we adopt weighted coefficients equal to the aforementioned projected value of the images’ spatial complexity indicator (), i.e., .
3.3. DANCE: automated network slimming
Various ways to aggregate multi-scale features (Chen et al., 2018; Yu et al., 2018a; Zhao et al., 2018, 2017) have been proved to improve segmentation accuracy at a cost of extra parameters and computations, leading to a higher training/inference energy burden. Thanks to the developed complexity-adaptive downsampling in DANCE’s automated data slimming (see Section 3.2), the resulting inputs have been re-scaled according to their spatial complexity. We conjecture that such downsampled inputs naturally have more “normalized” object feature scales, i.e., complexity-adaptive downsampling can emulate the effect of multi-scale aggregation, and thus can potentially rely less on multi-scale aggregation modules for improving the segmentation accuracy. We thus expect that the network appears to be more redundant when handling our automated data slimming’s resulting downsampled inputs as the cost dominant building blocks of multi-scale aggregation now becomes less important.
Progressive pruning during training. Motivated by the above conjecture and targeting reduced costs for both the training and inference (e.g., post-training pruning merely reduces inference costs), we propose an automated network slimming with a progressive pruning schedule during the training trajectory to prune the header of the networks for segmentation, which includes the aforementioned multi-scale feature modules and also often dominates both the training and inference costs, e.g., accounts for 52.98% FLOPs in DeepLabv3+ (with a ResNet50 backbone and an output stride of 16). Note that DANCE’s effectiveness and insights extend when other network pruning methods are considered, here we consider progressive pruning without loss of generalization.
To design the progressive pruning schedule, we develop a straightforward heuristic design, following the commonly used schedule in most pruning works (Renda et al., 2020; Luo et al., 2017; Han et al., 2015b). Specially, we first divide the whole training/adaptation process into several stages w.r.t the total number of iterations, and then perform channel-wise pruning (based on (Liu et al., 2017)) at the end of each stage.
Co-optimization affects pruning patterns.
To validate the aforementioned conjecture, we visualize the percentage of pruned channels in layers corresponding to multi-scale aggregation and other layers under different pruning ratios in Fig. 4, when the models are trained with DANCE or merely DANCE’s automated network slimming.
We can see that training with both automated data and network slimming, i.e., DANCE, always prunes more channels in layers corresponding to multi-scale aggregation (e.g., the ASPP module in DeepLabv3+) and fewer channels on other layers, under all the considered seven pruning ratios between 20% and 80%, while merely automated network slimming does opposite. Specifically, as compared to training using merely automated data slimming under the same pruning ratio of 50%, the model trained with both automated data and network slimming, i.e., models trained using DANCE, prunes 1.47 more channels in layers associated with multi-scale aggregation, where the corresponding accuracy is also higher (e.g., a 5.33% higher mIoU on the Cityscapes validation dataset together with a 54.8% lower inference energy with images of 592 592).
The experiment in Fig. 4 together with those in the experiment section verify our conjecture that (1) matching the data with the network can potentially improve the accuracy (thanks to the match between slimmed data and unpruned channels’ distribution) and remove redundant costs associated with both the data and network, thus achieving “all-win”: reducing both the training and inference costs while improving the achieved model accuracy (mIOU); and (2) DANCE’s automated data slimming can (partially) emulate the effect of multi-scale aggregation in segmentation models, enabling a higher pruning ratio on the corresponding multi-scale aggregation modules. The observations are consistent when other pruning methods and different pruning hyperparameters
and different pruning hyperparametersare used in DANCE’s automated network slimming (more details in Section 4.3.4), again verifying that the above conclusion (i.e., “co-optimization affects the optimal pruning patterns”) holds for DANCE regardless of the adopted pruning designs.
In this section, we evaluate DANCE on four segmentation models and three popular urban scene understanding datasets in terms of mIoU and the total training/inference FLOPs and energy cost, where the energy cost is measured when training/inference the corresponding models in a SOTA edge device (JETSON TX2 (NVIDIA Inc., )). We consider both the computational and energy costs because the former is commonly adopted and thus helps to benchmark with prior works while the latter better capture the real hardware cost.
4.1. Experiment setting
Considered models and datasets. Our evaluation of DANCE considers four SOTA segmentation models (two complicated models: DeepLabv3+ (Chen et al., 2018), PSPNet (Zhao et al., 2017), and two compact models: ICNet (Zhao et al., 2018), and BiSeNet (Yu et al., 2018a)) and three commonly used urban scene understanding datasets (Cityscapes (Cordts et al., 2016a), CamVid (Brostow et al., 2008), and BDD (Yu et al., 2018b)) in many efficient segmentation models (Chen et al., 2019a; Zhao et al., 2018; Yu et al., 2018a).
Experimental platforms and training details.
All experiments (except the energy measurements) are performed on a workstation with NVIDIA 2080Ti GPU cards using the PyTorch framework(Paszke et al., 2017) for a fair comparison. We use an SGD optimizer with a learning rate of for training all models except ICNet, which adopts a learning rate of
due to the unavailability of the corresponding ImageNet pre-trained model; and a minibatch size of (1) 8 for the DeepLabv3+ and PSPNet models and (2) 16 for the BiSeNet and ICNet models.
|Train. (P)||Infer. (G)||Train. (MJ)||Infer. (J)||(%)|
4.2. Performance on various datasets/models
In this subsection, we apply DANCE to the four segmentation models and three datasets and compare the resulting segmentation accuracies and inference/training costs with those of the base models.
4.2.1. DANCE on the Cityscapes dataset
Table 1 compares the segmentation accuracy, and computational and energy costs of DANCE on the four models, i.e., DeepLabv3+ (Chen et al., 2018), PSPNet (Zhao et al., 2017), ICNet (Zhao et al., 2018), and BiSeNet (Yu et al., 2018a), when evaluated on the Cityscapes dataset. We can see that (1) DANCE saves about 36% - 39% and 35% - 40% computational and energy costs in training (a similar trend in inference), while boosting the mIoU in the cases of DeepLabv3+ (Chen et al., 2018) and PSPNet (Zhao et al., 2017)
by 3.44% and 1.93%, respectively; (2) In the case of ICNet, DANCE achieves a 13.34% higher mIoU with up to 45% energy savings than those of the base model, where the lower mIoU of the base model might be due to the lack of a corresponding ImageNet pre-trained model; and (3) Though DANCE doesn’t boost the mIoU on the compact model of BiSeNet, it does save in training energy cost and win bigger (saving up to 31% energy) in inference.
|Train. (P)||Infer. (G)||Train. (MJ)||Infer. (J)||(%)|
4.2.2. DANCE on the CamVid dataset
Under smaller images () in CamVid (vs. in Cityscapes), we can still observe similar trends as those in Cityscapes (see Table 1). Specifically, our DANCE can still save 32% - 49% energy cost, as shown in Table 2, while achieving improved mIoU (over 1.4%). For the compact model BiSeNet, with a comparable mIoU, our DANCE still stably brings 32% and 33% energy savings in training and inference, respectively.
4.2.3. DANCE on the BDD dataset for adaptation
As Section 1 stated, for most on-device learning applications, training from scratch is not necessary and the ability to adapt to new data can be more interesting for some applications, especially for autonomous vehicles and robots.
|Train. (P)||Infer. (G)||Train. (MJ)||Infer. (J)||(%)|
Here, we choose the BDD (Yu et al., 2018b) for the adaptation experiments. We use pre-trained models on Cityscapes to adapt to unseen images in BBD. For a fair comparison, we choose the same checkpoints as the pre-trained model for each model in experiments. The adaptation performance is summarized in Table 3, which shows that while being similar to the performance on Cityscapes, DANCE saves up to 77% energy cost while achieving a slightly better (+0.12%) mIoU over the baseline, or boosts the mIoU by 5.51% when requiring even a 25% lower energy cost than the baseline.
The extensive results in Tables 1 - 3 show that DANCE can achieve “all-win” on all the three datasets when applying to both DeepLabv3+ and PSPNet: lower training cost (energy savings: 77% - 25%), more efficient inference (energy savings: 40% - 45%), and improved mIOU (0.12% - 5.51%), demonstrating the consistent superiority of DANCE on complicated models. As for the performance on compact models, DANCE can improve efficiency of both training (energy savings: 25% - 61%) and inference (energy savings: 31% - 56%) with a slightly dropped or even better mIoU (-0.71% - 13.34%) on all the three datasets, indicating that DANCE can benefit energy efficiency of even compact models.
4.3. Ablation studies of DANCE
In this subsection, we perform ablation studies of DANCE for evaluating the effectiveness of its data-network co-optimization, indicator, and automated data slimming.
4.3.1. Ablation study on the effectiveness of DANCE’s data-network co-optimization
|Train. (P)||Infer. (G)||(%)|
|Optimize Network After Data||+0.85%||-61.10%||+0.66|
|Optimize Data After Network||+12.99%||-55.88%||+3.08|
DANCE vs. only automated network/data slimming. As shown in Fig. 5, combining both automated data and network slimming (i.e., DANCE) achieves (1) better performance (in terms of training cost, inference cost, and mIoU) than the standalone implementation of either of these two techniques integrated into DANCE (i.e., automated data and network slimming); and (2) a much higher mIoU than the baseline (+3.41%) while requiring 39% and 45% less energy in training and inference, respectively. This set of experiments indicates the advantage of jointly matching the data and network for co-optimization.
DANCE vs. optimizing network and data separately. Table 4 compares co-optimization (DANCE) with separate optimization (optimizing network/data and then data/network sequentially), showing that network and data need to be jointly co-optimized to achieve the best mIoU-cost trade-off, while optimizing (i.e., slimming) the network and data sequentially will cause a 0.33% - 2.75% mIoU drop on DeepLabv3+@Cityscapes at an even higher computational cost (e.g., +48.74%) than DANCE.
4.3.2. Ablation study of DANCEon objects with different scales
|Method||w/o DANCE||w/ DANCE|
|IoU of Wall (%)||50.56||48.68||46.40||52.69|
|IoU of Motorcycle (%)||53.96||57.51||55.09||58.23|
|IoU of Traffic Sign (%)||70.17||72.41||72.94||73.58|
Here we compare the inference mIoU when turning off and on our DANCE applied on top of DeepLabv3+, when testing representative large, medium, and small scales (i.e., wall, motorcycle, and traffic sign) of objects in Cityscapes. As shown in Table 5, we can see that (1) small/large scales of objectives favor/degrade the achieved inference mIoU of applying DeepLabv3+ to the selected objects of different scales; and (2) DANCE, which inherently incorporates dynamic scales to its applied data, consistently outperforms its baselines even for the manually selected objectives which have static scales by design, indicating the advantage of DANCE’s automated choices of adaptive scales of data, validating DANCE’s inherent advantages in handling datasets/tasks of which the objects have different scales , which is common for semantic segmentation datasets (e.g., Cityscapes (Cordts et al., 2016a), CamVid (Brostow et al., 2008), and BBD (Yu et al., 2018b)).
|Method||Train. FLOPs (P)||mIoU(%)|
|Inverse indicator, i.e., 1-||+0.00%||-11.03|
4.3.3. Ablation study of the indicator’ effectiveness
The spatial complexity indicator presented in Section 3.2 is to provide a variable for estimating a given image’s complexity, which will be directly used to guide the slimming direction (e.g., image’s downsampling ratio). As shown in Table 6, we apply inverse or random to replace the proposed indicator in DANCE, and find that their resulting mIoU drops 11.03% or 4.08% under the same training cost budget, respectively, validating the advantageous effectiveness of our proposed indicator. Additionally, Fig. 6 visualizes 24 image samples randomly selected from the image groups with the largest 33%, medium 3%, and smallest 33% spatial complexity in the Cityscapes (Cordts et al., 2016a) training dataset. Interestingly, we can see as expected that the image complexity identified by the adopted indicator is consistent with that by human eyes, e.g., images with spatial complexity falling within the smallest 33% of the dataset have a simpler background and include fewer objectives.
4.3.4. Ablation study of DANCE’s effectiveness regardless of the adopted pruning methods
|Pruning||#pruned weights (DANCE’s - AMS’s)|
|Ratio||ASPP Module Layers||Other Layers|
We consistently find that DANCE’s advantages in enabling data model co-optimization is effectiveness regardless of the adopted pruning methods. For example, Table 7 summarizes the pruning results when turning on and off DANCE’s automated data slimming during pruning, where we adopt the unstructured pruning in (Han et al., 2015a). Again, similar observations can be made as those in (Liu et al., 2017) when using channel-wise pruning. Specifically, training with both automated data and model slimming, i.e., DANCE, always prunes more weights in layers corresponding to multi-scale aggregation (e.g., the ASPP module in DeepLabv3+) and fewer weights on other layer, under all the considered seven pruning ratios between 20% and 80%, whereas merely using DANCE’s automated model slimming (AMS) does the opposite. This set of experiment results further confirm that (1) DANCE’s automated data slimming can (partially) emulate the effect of multi-scale aggregation in segmentation models, and thus enable a higher pruning ratio on the corresponding multi-scale aggregation modules, and (2) matching the data with model can potentially improve the model accuracy and remove redundant costs associated with both the data and model, thus achieving “all-win”, which is consistent with the results in Fig. 1.
4.3.5. Ablation study of DANCE’s automated data slimming
As described in Section 3.2, DANCE’s automated data slimming integrates three techniques, including complexity-adaptive downsampling (CAD), complexity-adaptive stochastic dropping (CASD), and complexity-adaptive loss (CAL), which are guided by the adopted spatial complexity indicator. In this subsection, we evaluate the efficacy of these techniques and their different combinations on top of DANCE’s automated network slimming (ANS) (see Section 3.3), in terms of the resulting task accuracy (mIoU), and computational and energy savings of both inference and training, as summarized in Table 8. Note that all the task accuracy and computational and energy savings are normalized to those of the standard DeepLabv3+ (Chen et al., 2018) model and Cityscapes dataset (See row No. 1 of Table 8). We next discuss the observations in terms of the “all-win” goal (i.e., reducing both the training and inference costs while improving the achieved model accuracy (mIoU)):
|No.||ANSa||CADb||CALc||CASDd||RDe||Train. FLOPs||Train. Energy||Infer. FLOPs||Infer. Energy||mIoU|
|1||198.3 (P)||45.21 (MJ)||743.6 (G)||41.32 (J)||69.71(%)|
ANS: Automated Network Slimming
CAD: Complexity-Adaptive Downsampling
CASD: Complexity-Adaptive Stochastic Dropping
CAL: Complexity-Adaptive Loss
RD: Randomly Drop 50% (Wang et al., 2019)
1. Complexity-Adaptive Downsampling (CAD): Comparing the results in Rows No. 2 and No. 3 shows that CAD+ANS (see Row No. 3, i.e., applying CAD, which has the advantage of “training free”, on top of DANCE’s automated network slimming (ANS)), can save 42.92% and 32.16% energy cost in training and inference, respectively, whereas decreasing the mIoU by 1.02% (i.e., -1.52% vs. -2.54%), as compared to merely performing ANS (see Row No. 2), indicating that CAD offers a new trade-off between the achieved energy efficiency and mIoU.
2. Complexity-Adaptive Loss (CAL): Comparing the results in Rows No. 3, and No. 4 shows that CAL+CAD+ANS (see Row No. 4, i.e., applying CAL on top of ANS and CAD) can boost the mIoU by 1.91% as compared to merely combining CAD and ANS (i.e., CAD+ANS in Row No. 3), while still reducing 51.45% and 46.96% energy cost in training and inference, respectively, as compared to the DeepLabv3+ baseline (Row No. 1), indicating that adding CAL on top of CAD and ANS can further boost the model accuracy while keeping the achieved energy efficiency.
3. Complexity-Adaptive Stochastic Dropping (CASD): First, comparing the results in Rows No. 5 and No. 6 shows that the proposed CASD (Row No. 6) can achieve a 1.37% higher mIoU than the random dropping technique in (Wang et al., 2019) (Row No. 5) under the same energy cost of both training and inference, indicating the advantage of complexity-adaptive stochastic dropping over random dropping in (Wang et al., 2019). Second, comparing the results in Rows No. 4, and No. 7 shows that applying CASD on top of CAL+CAD+ANS (Row No. 4) can boost the mIoU by 4.04% as compared to merely combining ANS, CAD, and CAL (Row No. 4), and by 3.41% as compared to the DeepLabv3+ baseline (Row No. 1), while obtaining 39.69% and 45.02% energy savings in training and inference, respectively, as compared to the DeepLabv3+ baseline (Row No. 1).
This set of comparisons indicates the effectiveness of the proposed DANCE’s automated data slimming, i.e., integrating all three component techniques of DANCE’s automated data slimming can achieve the most favorable data-network co-optimization benefits as it achieves the “all-win” goal as shown in Fig. 1.
5. Ablation study of DANCE’s hyperparameters
In this subsection, we perform experiments for evaluating DANCE with different hyperparameters by changing the ranges of (1) the weighted coefficients in complexity-adaptive loss (CAL) and (2) dropping probability in complexity-adaptive stochastic dropping (CAL) (as described in Section 3.2), and summarize the results in Table 9. To better study the effect of each of the aforementioned hyperparameters, we fix others with the default ones (as described in Section 3.2) when tuning one of them.
Note that the larger the ratio of endpoints in the dynamic range of both the weighted coefficients and dropping probabilities are, the more (less) frequent images with a higher (lower) spatial complexity would be used. And the largest ratio is and for the weighted coefficients and dropping probabilities, respectively, which is also the default setting as mentioned in Section 3.2.
The results in Table 9 show that increasing the frequency of training images with a higher spatial complexity (defined in Eq. 1), by increasing the ratio of endpoints in the dynamic range of the weighted coefficient or dropping probability, favors the segmentation accuracy (i.e., a higher mIoU). This observation is consistent with that of (Jiang et al., 2019; Gain and Siegelmann, 2019). Specifically, changing the dropping probability range from 60% - 40% to 100% - 0% boosts the achieved mIoU by 1.99%, while changing the weighted coefficient range from 2.0 - 1.0 to 1.0 - 0.0 leads to an improved mIoU of 0.82%, while the training and inference costs of both cases mostly stay the same.
|Settings||Train. FLOPs||Train. Energy||Infer. FLOPs||Infer. Energy||mIoU|
|Range of the Weighted Coefficients in CAL||2.0 - 1.0||124.66 (P)||26.43 (MJ)||358.39 (G)||23.57 (J)||72.30%|
|4.0 - 1.0||-1.76%||-2.65%||+0.78%||+0.52%||+0.64%|
|1.0 - 0.0||+2.21%||+3.18%||-3.87%||-3.60%||+0.82%|
|Range of the Dropping Probability in CASD||60% - 40%||128.05 (P)||27.55 (MJ)||345.21 (G)||22.74 (J)||71.13%|
|75% - 25%||-0.88%||-0.57%||-1.37%||-1.22%||+0.27%|
We proposed DANCE for boosting segmentation efficiency during both training and inference, leveraging the hypothesis that maximum model accuracy and efficiency should be achieved when the data and model are optimally matched. On the “data-level”, DANCE’s automated data slimming not only halve the computational and energy costs, but also boost the segmentation accuracy. Interestingly, DANCE’s automated data slimming can emulate the effect of multi-scale feature extraction yet at a much lower cost. This further motivates DANCE’s automated network slimming on the “model-level” that advocates automatically pruning the model adapting to the resulting data slimmed by DANCE’s automated data slimming and leads to more pruning in the cost-dominant building blocks for multi-scale feature extraction, validating our hypothesis and further reducing both training and inference costs. Extensive experiments and ablation studies validate DANCE’s effectiveness and superiority, which resides in its capability to automatically match the data and network via automated co-optimization.
Segmentation and recognition using structure from motion point clouds.
European conference on computer vision, pp. 44–57. Cited by: §4.1, §4.3.2.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: 2nd item, §1, §2.2, §3.2, §3.3, §4.1, §4.2.1, §4.3.5.
- FasterSeg: searching for faster real-time semantic segmentation. arXiv preprint arXiv:1912.10917. Cited by: §4.1.
Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8924–8933. Cited by: 1st item, §1.
- The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 6, §4.1, §4.3.2, §4.3.3.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: Figure 1.
- Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §3.2, §3.2.
- The oxford dictionary of statistical terms. Oxford University Press on Demand. Cited by: §3.2.
- Variance analysis of sensitivity information for pruning multilayer feedforward neural networks. In IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No. 99CH36339), Vol. 3, pp. 1829–1833. Cited by: §1.
- Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §1, §3.2.
- Relating information complexity and training in deep neural networks. In Micro-and Nanotechnology Sensors, Systems, and Applications XI, Vol. 10982, pp. 109822H. Cited by: §3.2, §3.2, §5.
- Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.1, §4.3.4.
- Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §3.3.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
- Knowledge adaptation for efficient semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 578–587. Cited by: §2.2.
- Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2.1.
- Quantized neural networks: training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §2.1.
Accelerating deep learning by focusing on the biggest losers. Cited by: §1, §1, §2.1, §3.2, §3.2, §3.2, §5.
- Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §2.1, §3.3, §4.3.4.
- Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §3.3.
- PruneTrain: fast neural network training by dynamic sparse model reconfiguration. Cited by: §2.1.
- Efficient segmentation: learning downsampling near semantic boundaries. arXiv preprint arXiv:1907.07156. Cited by: §3.2, §3.2.
- V. illustrations of the dynamical theory of gases.—part i. on the motions and collisions of perfectly elastic spheres. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 19 (124), pp. 19–32. Cited by: §3.2.
- CC-net: image complexity guided network compression for biomedical image segmentation. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 57–60. Cited by: §3.2.
-  NVIDIA Jetson TX2. Note: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2/, accessed 2019-09-01 Cited by: §4.
- Enet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147. Cited by: §2.2.
- Automatic differentiation in pytorch. In NIPS-W, Cited by: §4.1.
- Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §2.1.
- Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations, Cited by: §3.3.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §2.1.
- Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243. Cited by: §1.
- E2-train: training state-of-the-art cnns with over 80% less energy. In Advances in Neural Information Processing Systems, Cited by: §1, §2.1, §3.2, item e, §4.3.5.
- ApproxNet: content and contention aware video analytics system for the edge. arXiv preprint arXiv:1909.02068. Cited by: §3.2.
Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5687–5695. Cited by: §2.1, §2.2.
- Progressive weight pruning of deep neural networks using admm. arXiv preprint arXiv:1810.07378. Cited by: §2.1.
- Drawing early-bird tickets: towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957. Cited by: §1.
- Bisenet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 325–341. Cited by: 2nd item, §2.2, §2.2, §3.2, §3.3, §4.1, §4.2.1.
- Multi-scale context aggregation by dilated convolutions. ICLR. Cited by: 2nd item, §1.
- Bdd100k: a diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687. Cited by: §4.1, §4.2.3, §4.3.2.
- Image complexity and spatial information. In 2013 Fifth International Workshop on Quality of Multimedia Experience (QoMEX), Vol. , pp. 12–17. External Links: Cited by: §3.2, §3.2.
- Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §2.1.
- Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 405–420. Cited by: 2nd item, §2.2, §2.2, §3.3, §4.1, §4.2.1.
- Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: 2nd item, §2.2, §3.3, §4.1, §4.2.1.