CondenseNet: An Efficient DenseNet using Learned Group Convolutions

by   Gao Huang, et al.
cornell university

Deep neural networks are increasingly used on mobile devices, where computational resources are limited. In this paper we develop CondenseNet, a novel network architecture with unprecedented efficiency. It combines dense connectivity between layers with a mechanism to remove unused connections. The dense connectivity facilitates feature re-use in the network, whereas learned group convolutions remove connections between layers for which this feature re-use is superfluous. At test time, our model can be implemented using standard grouped convolutions - allowing for efficient computation in practice. Our experiments demonstrate that CondenseNets are much more efficient than stateof-the-art compact convolutional networks such as MobileNets and ShuffleNets.


page 3

page 8


ChannelNets: Compact and Efficient Convolutional Neural Networks via Channel-Wise Convolutions

Convolutional neural networks (CNNs) have shown great capability of solv...

Dynamic Group Convolution for Accelerating Convolutional Neural Networks

Replacing normal convolutions with group convolutions can significantly ...

Building Efficient Deep Neural Networks with Unitary Group Convolutions

We propose unitary group convolutions (UGConvs), a building block for CN...

Multiscale Hierarchical Convolutional Networks

Deep neural network algorithms are difficult to analyze because they lac...

Exploiting Learned Symmetries in Group Equivariant Convolutions

Group Equivariant Convolutions (GConvs) enable convolutional neural netw...

Merging and Evolution: Improving Convolutional Neural Networks for Mobile Applications

Compact neural networks are inclined to exploit "sparsely-connected" con...

QuickNet: Maximizing Efficiency and Efficacy in Deep Architectures

We present QuickNet, a fast and accurate network architecture that is bo...

Code Repositories


CondenseNet: Light weighted CNN for mobile devices

view repo

1 Introduction

The high accuracy of convolutional networks (CNNs) in visual recognition tasks, such as image classification [38, 12, 19], has fueled the desire to deploy these networks on platforms with limited computational resources, e.g.

, in robotics, self-driving cars, and on mobile devices. Unfortunately, the most accurate deep CNNs, such as the winners of the ImageNet

[6] and COCO [31] challenges, were designed for scenarios in which computational resources are abundant. As a result, these models cannot be used to perform real-time inference on low-compute devices.

This problem has fueled development of computationally efficient CNNs that, e.g., prune redundant connections [27, 11, 29, 32, 9], use low-precision or quantized weights [21, 36, 4], or use more efficient network architectures [22, 12, 19, 5, 16, 47]. These efforts have lead to substantial improvements: to achieve comparable accuracy as VGG [38] on ImageNet, ResNets [12] reduce the amount of computation by a factor , DenseNets [19] by a factor of , and MobileNets [16] and ShuffleNets [47] by a factor of

. A typical set-up for deep learning on mobile devices is one where CNNs are trained on multi-GPU machines but deployed on devices with limited compute. Therefore, a good network architecture allows for fast parallelization during training, but is compact at test-time.

Recent work [4, 20] shows that there is a lot of redundancy in CNNs. The layer-by-layer connectivity pattern forces networks to replicate features from earlier layers throughout the network. The DenseNet architecture [19] alleviates the need for feature replication by directly connecting each layer with all layers before it, which induces feature re-use. Although more efficient, we hypothesize that dense connectivity introduces redundancies when early features are not needed in later layers. We propose a novel method to prune such redundant connections between layers and then introduce a more efficient architecture. In contrast to prior pruning methods, our approach learns a sparsified network automatically during the training process, and produces a regular connectivity pattern that can be implemented efficiently using group convolutions. Specifically, we split the filters of a layer into multiple groups, and gradually remove the connections to less important features per group during training. Importantly, the groups of incoming features are not predefined, but learned. The resulting model, named , can be trained efficiently on GPUs, and has high inference speed on mobile devices.

Our image-classification experiments show that s consistently outperform alternative network architectures. Compared to DenseNets, s use only of the computation at comparable accuracy levels. On the ImageNet dataset [6], a with 275 million FLOPs111Throughout the paper, FLOPs refers to the number of multiplication-addition operations. achieved a 29% top-1 error, which is comparable to the error of a MobileNet that requires twice as much compute.

2 Related Work and Background

We first review related work on model compression and efficient network architectures, which inspire our work. Next, we review the DenseNets and group convolutions that form the basis for .

2.1 Related Work

Weights pruning and quantization. s are closely related to approaches that improve the inference efficiency of (convolutional) networks via weight pruning [27, 11, 29, 32, 14] and/or weight quantization [21, 36]. These approaches are effective because deep networks often have a substantial number of redundant weights that can be pruned or quantized without sacrificing (and sometimes even improving) accuracy. For convolutional networks, different pruning techniques may lead to different levels of granularity [34]. Fine-grained pruning, e.g., independent weight pruning [27, 10], generally achieves a high degree of sparsity. However, it requires storing a large number of indices, and relies on special hardware/software accelerators. In contrast, coarse-grained pruning methods such as filter-level pruning [29, 1, 32, 14] achieve a lower degree of sparsity, but the resulting networks are much more regular, which facilitates efficient implementations.

s also rely on a pruning technique, but differ from prior approaches in two main ways: First, the weight pruning is initiated in the early stages of training, which is substantially more effective and efficient than using regularization throughout. Second, s have a higher degree of sparsity than filter-level pruning, yet generate highly efficient group convolution—reaching a sweet spot between sparsity and regularity.

Efficient network architectures. A range of recent studies has explored efficient convolutional networks that can be trained end-to-end [19, 46, 16, 47, 49, 22, 48]. Three prominent examples of networks that are sufficiently efficient to be deployed on mobile devices are MobileNet [16], ShuffleNet [47], and Neural Architecture Search (NAS) networks [49]. All these networks use depth-wise separable convolutions, which greatly reduce computational requirements without significantly reducing accuracy. A practical downside of these networks is depth-wise separable convolutions are not (yet) efficiently implemented in most deep-learning platforms. By contrast, uses the well-supported group convolution operation [25], leading to better computational efficiency in practice.

Architecture-agnostic efficient inference has also been explored by several prior studies. For example, knowledge distillation [3, 15] trains small “student” networks to reproduce the output of large “teacher” networks to reduce test-time costs. Dynamic inference methods [2, 8, 7, 17] adapt the inference to each specific test example, skipping units or even entire layers to reduce computation. We do not explore such approaches here, but believe they can be used in conjunction with s.

Figure 1: The transformations within a layer in DenseNets (left), and s at training time (middle) and at test time (right). The Index and Permute operations are explained in Section 3.1 and 4.1, respectively. (L-Conv: learned group convolution; G-Conv: group convolution)

Figure 2: Standard convolution (left) and group convolution (right). The latter enforces a sparsity pattern by partitioning the inputs (and outputs) into disjoint groups.

Figure 3: Illustration of learned group convolutions with groups and a condensation factor of . During training a fraction of connections are removed after each of the condensing stages. Filters from the same group use the same set of features, and during test-time the index layer rearranges the features to allow the resulting model to be implemented as standard group convolutions.

2.2 DenseNet

Densely connected networks (DenseNets; [19]) consist of multiple dense blocks, each of which consists of multiple layers. Each layer produces features, where is referred to as the growth rate of the network. The distinguishing property of DenseNets is that the input of each layer is a concatenation of all feature maps generated by all preceding layers within the same dense block. Each layer performs a sequence of consecutive transformations, as shown in the left part of Figure 1. The first transformation (BN-ReLU

, blue) is a composition of batch normalization


and rectified linear units

[35]. The first convolutional layer in the sequence reduces the number of channels to save computational cost by using the

filters. The output is followed by another BN-ReLU transformation and is then reduced to the final

output features through a convolution.

2.3 Group Convolution

Group convolution is a special case of a sparsely connected convolution, as illustrated in Figure 2. It was first used in the AlexNet architecture [25], and has more recently been popularized by their successful application in ResNeXt [43]. Standard convolutional layers (left illustration in Figure 2) generate output features by applying a convolutional filter (one per output) over all input features, leading to a computational cost of . In comparison, group convolution (right illustration) reduces this computational cost by partitioning the input features into mutually exclusive groups, each producing its own outputs—reducing the computational cost by a factor to .

3 s

Group convolution works well with many deep neural network architectures [43, 47, 46] that are connected in a layer-by-layer fashion. For dense architectures group convolution can be used in the convolutional layer (see Figure 1, left). However, preliminary experiments show that a naïve adaptation of group convolutions in the convolutional layer leads to drastic reductions in accuracy. We surmise that this is caused by the fact that the inputs to the convolutional layer are concatenations of feature maps generated by preceding layers. Therefore, they differ in two ways from typical inputs to convolutional layers: 1. they have an intrinsic order; and 2. they are far more diverse. The hard assignment of these features to disjoint groups hinders effective re-use of features in the network. Experiments in which we randomly permute input feature maps in each layer before performing the group convolution show that this reduces the negative impact on accuracy — but even with the random permutation, group convolution in the convolutional layer makes DenseNets less accurate than for example smaller DenseNets with equivalent computational cost.

It is shown in [19] that making early features available as inputs to later layers is important for efficient feature re-use. Although not all prior features are needed at every subsequent layer, it is hard to predict which features should be utilized at what point. To address this problem, we develop an approach that learns the input feature groupings automatically during training. Learning the group structure allows each filter group to select its own set of most relevant inputs. Further, we allow multiple groups to share input features and also allow features to be ignored by all groups. Note that in a DenseNEt, even if an input feature is ignored by all groups in a specific layer, it can still be utilized by some groups At different layers. To differentiate it from regular group convolutions, we refer to our approach as learned group convolution.


We learn group convolutions through a multi-stage process, illustrated in Figures 3 and 4. The first half of the training iterations comprises of condensing stages. Here, we repeatedly train the network with sparsity inducing regularization for a fixed number of iterations and subsequently prune away unimportant filters with low magnitude weights. The second half of the training consists of the optimization stage, in which we learn the filters after the groupings are fixed. When performing the pruning, we ensure that filters from the same group share the same sparsity pattern. As a result, the sparsified layer can be implemented using a standard group convolution once training is completed (testing stage). Because group convolutions are efficiently implemented by many deep-learning libraries, this leads to high computational savings both in theory and in practice. We present details on our approach below.

Filter Groups.

We start with a standard convolution of which filter weights form a 4D tensor of size

, where , , , and denote the number of output channels, the number of input channels, and the width and the height of the filter kernels, respectively. As we are focusing on the convolutional layer in DenseNets, the 4D tensor reduces to an matrix . We consider the simplified case in this paper. But our procedure can readily be used with larger convolutional kernels. Before training, we first split the filters (or, equivalently, the output features) into groups of equal size. We denote the filter weights for these groups by ; each has size and corresponds to the weight of the th input for the th output within group . Because the output features do not have an implicit ordering, this random grouping does not negatively affect the quality of the layer.

Condensation Criterion.

During the training process we gradually screen out subsets of less important input features for each group. The importance of the th incoming feature map for the filter group is evaluated by the averaged absolute value of weights between them across all outputs within the group, i.e., by . In other words, we remove columns in (by zeroing them out) if their -norm is small compared to the -norm of other columns. This results in a convolutional layer that is structurally sparse: filters from the same group always receive the same set of features as input.

Group Lasso.

To reduce the negative effects on accuracy introduced by weight pruning, regularization is commonly used to induce sparsity [29, 32]. In s, we encourage convolutional filters from the same group to use the same subset of incoming features, i.e., we induce group-level sparsity instead. To this end, we use the following group-lasso regularizer [44] during training:

The group-lasso regularizer simultaneously pushes all the elements of a column of to zero, because the term in the square root is dominated by the largest elements in that column. This induces the group-level sparsity we aim for.

Condensation Factor.

In addition to the fact that learned group convolutions are able to automatically discover good connectivity patterns, they are also more flexible than standard group convolutions. In particular, the proportion of feature maps used by a group does not necessarily need to be . We define a condensation factor , which may differ from , and allow each group to select of inputs.

Condensation Procedure.

In contrast to approaches that prune weights in pre-trained networks, our weight pruning process is integrated into the training procedure. As illustrated in Figure 3 (which uses ), at the end of each condensing stages we prune of the filter weights. By the end of training, only

of the weights remain in each filter group. In all our experiments we set the number of training epochs of the condensing stages to

, where denotes the total number of training epochs—such that the first half of the training epochs is used for condensing. In the second half of the training process, the Optimization stage, we train the sparsified model.222 In our implementation of the training procedure we do not actually remove the pruned weights, but instead mask the filter by a binary tensor of the same size using an element-wise product. The mask is initialized with only ones, and elements corresponding to pruned weights are set to zero. This implementation via masking is more efficient on GPUs, as it does not require sparse matrix operations. In practice, the pruning hardly increases the wall time needed to perform a forward-backward pass during training.

Learning rate.

We adopt the cosine shape learning rate schedule of Loshchilov et al. [33], which smoothly anneals the learning rate, and usually leads to improved accuracy [18, 49]. Figure 4 visualizes the learning rate as a function of training epoch (in magenta), and the corresponding training loss (blue curve) of a trained on the CIFAR-10 dataset [24]. The abrupt increase in the loss at epoch 150 is causes by the final condensation operation, which removes half of the remaining weights. However, the plot shows that the model gradually recovers from this pruning step in the optimization stage.

Index Layer.

After training we remove the pruned weights and convert the sparsified model into a network with a regular connectivity pattern that can be efficiently deployed on devices with limited computational power. For this reason we introduce an index layer

that implements the feature selection and rearrangement operation (see Figure 

3, right). The convolutional filters in the output of the index layer are rearranged to be amenable to existing (and highly optimized) implementations of regular group convolution. Figure 1 shows the transformations of the layers during training (middle) and during testing (right). During training the convolution is a learned group convolution (L-Conv), but during testing, with the help of the index layer, it becomes a standard group convolution (G-Conv).

Figure 4: The cosine shape learning rate and a typical training loss curve with a condensation factor of .

Figure 5: The proposed DenseNet variant. It differs from the original DenseNet in two ways: (1) layers with different resolution feature maps are also directly connected; (2) the growth rate doubles whenever the feature map size shrinks (far more features are generated in the third, yellow, dense block than in the first).

3.2 Architecture Design

In addition to the use of learned group convolutions introduced above, we make two changes to the regular DenseNet architecture. These changes are designed to further simplify the architecture and improve its computational efficiency. Figure 5 illustrates the two changes that we made to the DenseNet architecture.

Exponentially increasing growth rate.

The original DenseNet design adds new feature maps at each layer, where is a constant referred to as the growth rate. As shown in [19], deeper layers in a DenseNet tend to rely on high-level features more than on low-level features. This motivates us to improve the network by strengthening short-range connections. We found that this can be achieved by gradually increasing the growth rate as the depth grows. This increases the proportion of features coming from later layers relative to those from earlier layers. For simplicity, we set the growth rate to , where is the index of the dense block, and is a constant. This way of setting the growth rate does not introduce any additional hyper-parameters. The “increasing growth rate” (IGR) strategy places a larger proportion of parameters in the later layers of the model. This increases the computational efficiency substantially but may decrease the parameter efficiency in some cases. Depending on the specific hardware limitations it may be advantageous to trade-off one for the other [22].

Fully dense connectivity.

To encourage feature re-use even more than the original DenseNet architecture does already, we connect input layers to all subsequent layers in the network, even if these layers are located in different dense blocks (see Figure 5). As dense blocks have different feature resolutions, we downsample feature maps with higher resolutions when we use them as inputs into lower-resolution layers using average pooling.

4 Experiments

We evaluate s on the CIFAR-10, CIFAR-100 [24], and the ImageNet (ILSVRC 2012; [6]) image-classification datasets. The models and code reproducing our experiments are publicly available at


The CIFAR-10 and CIFAR-100 datasets consist of RGB images of size 3232 pixels, corresponding to 10 and 100 classes, respectively. Both datasets contain 50,000 training images and 10,000 test images. We use a standard data-augmentation scheme [30, 37, 28, 39, 41, 20, 26]

, in which the images are zero-padded with 4 pixels on each side, randomly cropped to produce 32

32 images, and horizontally mirrored with probability


The ImageNet dataset comprises 1000 visual classes, and contains a total of 1.2 million training images and 50,000 validation images. We adopt the data-augmentation scheme of [12] at training time, and perform a rescaling to followed by a center crop at test time before feeding the input image into the networks.

Figure 6: Ablation study on CIFAR-10 to investigate the efficiency gains obtained by the various components of .

4.1 Results on CIFAR

We first perform a set of experiments on CIFAR-10 and CIFAR-100 to validate the effectiveness of learned group convolutions and the proposed architecture.

Model configurations.

Unless otherwise specified, we use the following network configurations in all experiments on the CIFAR datasets. The standard DenseNet has a constant growth rate of following [19]; our proposed architecture uses growth rates to ensure that the growth rate is divisable by the number of groups. The learned group convolution is only applied to the first convolutional layer (with filter size , see Figure 1) of each basic layer, with a condensation factor of , i.e., 75% of filter weights are gradually pruned during training with a step of 25%. The convolutional layers are replaced by standard group convolution (without applying learned group convolution) with four groups. Following [47, 46], we permute the output channels of the first learned group convolutional layer, such that the features generated by each of its groups are evenly used by all the groups of the subsequent group convolutional layer .

Training details.

We train all models with stochastic gradient descent (SGD) using similar optimization hyper-parameters as in

[12, 19]

. Specifically, we adopt Nesterov momentum with a momentum weight of 0.9 without dampening, and use a weight decay of

. All models are trained with mini-batch size 64 for 300 epochs, unless otherwise specified. We use a cosine shape learning rate which starts from 0.1 and gradually reduces to 0. Dropout [40] with a drop rate of was applied to train s with million parameters (shown in Table 1).

Component analysis.

Figure 6 compares the computational efficiency gains obtained by each component of CondenseNet: learned group convolution (LGR), exponentially increasing learning rate (IGR), full dense connectivity (FDC). Specifically, the figure plots the test error as a function of the number of FLOPs (i.e., multiply-addition operations). The large gap between the two red curves with dot markers shows that learned group convolution significantly improves the efficiency of our models. Compared to DenseNets, only requires half the number of FLOPs to achieve comparable accuracy. Further, we observe that the exponentially increasing growth rate, yields even further efficiency. Full dense connectivity does not boost the efficiency significantly on CIFAR-10, but there does appear to be a trend that as models getting larger, full connectivity starts to help. We opt to include this architecture change in the model, as it does lead to substantial improvements on ImageNet (see later).

Comparison with state-of-the-art efficient CNNs.

In Table 1, we show the results of experiments comparing a 160-layer and a 182-layer with alternative state-of-the-art CNN architectures. Following [49], our models were trained for 600 epochs. From the results, we observe that requires approximately fewer parameters and FLOPs to achieve a comparable accuracy to DenseNet-190. seems to be less parameter-efficient than , but is more compute-efficient. Somewhat surprisingly, our model performs on par with the NASNet-A, an architecture that was obtained using an automated search procedure over candidate architectures composed of a rich set of components, and is thus carefully tuned on the CIFAR-10 dataset [49]. Moreover, (or ) does not use depth-wise separable convolutions, and only use simple convolutional filters with size and . It may be possible to include as a meta-architecture in the procedure of [49] to obtain even more efficient networks.

Model Params FLOPs C-10 C-100
ResNet-1001[13] 16.1M 2,357M 4.62 22.71
Stochastic-Depth-1202[20] 19.4M 2,840M 4.91 -
Wide-ResNet-28[45] 36.5M 5,248M 4.00 19.25
ResNeXt-29 [43] 68.1M 10,704M 3.58 17.31
DenseNet-190[19] 25.6M 9,388M 3.46 17.18
NASNet-A[49] 3.3M - 3.41 -
-160 3.1M 1,084M 3.46 17.55
-182 4.2M 513M 3.76 18.47
Table 1: Comparison of classification error rate (%) with other convolutional networks on the CIFAR-10(C-10) and CIFAR-100(C-100) datasets. * indicates models that are trained with cosine shape learning rate for 600 epochs.

Comparison with existing pruning techniques.

In Table 2, we compare our s and s with models that are obtained by state-of-the-art filter-level weight pruning techniques [29, 32, 14]. The results show that, in general, is about more efficient in terms of FLOPs than ResNets or DenseNets pruned by the method introduced in [32]. The advantage over the other pruning techniques is even more pronounced. We also report the results for in the second last row of Table 2. It uses only half the number of parameters to achieve comparable performance as the most competitive baseline, the 40-layer DenseNet described by [32].

Model FLOPs Params C-10 C-100
VGG-16-pruned [29] 206M 5.40M 6.60 25.28
VGG-19-pruned [32] 195M 2.30M 6.20 -
VGG-19-pruned [32] 250M 5.00M - 26.52
ResNet-56-pruned [14] 62M 8.20 -
ResNet-56-pruned [29] 90M 0.73M 6.94 -
ResNet-110-pruned [29] 213M 1.68M 6.45 -
ResNet-164-B-pruned [32] 124M 1.21M 5.27 23.91
DenseNet-40-pruned [32] 190M 0.66M 5.19 25.28
-94 122M 0.33M 5.00 24.08
-86 65M 0.52M 5.00 23.64
Table 2: Comparison of classification error rate (%) on CIFAR-10 (C-10) and CIFAR-100 (C-100) with state-of-the-art filter-level weight pruning methods.
Feature map size

Conv (stride

average pool, stride 2
average pool, stride 2
average pool, stride 2
average pool, stride 2
global average pool
1000-dim fully-connected, softmax
Table 3: architectures for ImageNet.

4.2 Results on ImageNet

In a second set of experiments, we test on the ImageNet dataset.

Model configurations.

Detailed network configurations are shown in Table 3. To reduce the number of parameters, we prune 50% of weights from the fully connected (FC) layer at epoch 60 in a way similar to the learned group convolution, but with (as the FC layer could not be split into multiple groups) and . Similar to prior studies on MobileNets and ShuffleNets, we focus on training relatively small models that require less than 600 million FLOPs to perform inference on a single image.

Training details.

We train all models using stochastic gradient descent (SGD) with a batch size of 256. As before, we adopt Nesterov momentum with a momentum weight of 0.9 without dampening, and a weight decay of . All models are trained for 120 epochs, with a cosine shape learning rate which starts from 0.1 and gradually reduces to 0. We use group lasso regularization in all experiments on ImageNet; the regularization parameter is set to .

Comparison with state-of-the-art efficient CNNs.

Table 4 shows the results of s and several state-of-the-art, efficient models on the ImageNet dataset. We observe that a with 274 million FLOPs obtains a 29.0% Top-1 error, which is comparable to the accuracy achieved by MobileNets and ShuffleNets that require twice as much compute. A with 529 million FLOPs produces to a 3% absolute reduction in top-1 error compared to a MobileNet and a ShuffleNet of comparable size. Our even achieves a the same accuracy with slightly fewer FLOPs and parameters than the most competitive NASNet-A, despite the fact that we only trained a very small number of models (as opposed to the study that lead to the NASNet-A model).

Actual inference time.

Table 5 shows the actual inference time on an ARM processor for different models. The wall-time to inference an image sized at is highly correlated with the number of FLOPs of the model. Compared to the recently proposed MobileNet, our () with 274 million FLOPs inferences an image faster, while without sacrificing accuracy.

Model FLOPs Params Top-1 Top-5
Inception V1 [42] 1,448M 6.6M 30.2 10.1
1.0 MobileNet-224 [16] 569M 4.2M 29.4 10.5
ShuffleNet 2x [47] 524M 5.3M 29.1 10.2
NASNet-A (N=4) [49] 564M 5.3M 26.0 8.4
NASNet-B (N=4) [49] 488M 5.3M 27.2 8.7
NASNet-C (N=3) [49] 558M 4.9M 27.5 9.0
() 274M 2.9M 29.0 10.0
() 529M 4.8M 26.2 8.3
Table 4: Comparison of Top-1 and Top-5 classification error rate (%) with other state-of-the-art compact models on ImageNet.
Model FLOPs Top-1 Time(s)
VGG-16 15,300M 28.5 354
ResNet-18 1,818M 30.2 8.14
1.0 MobileNet-224 [16] 569M 29.4 1.96
() 529M 26.2 1.89
() 274M 29.0 0.99
Table 5: Actual inference time of different models on an ARM processor. All models are trained on ImageNet, and accept input with resolution 224224.
Figure 7: Classification error rate (%) on CIFAR-10. Left: Comparison between our condense method with traditional pruning approach, under varying condensation factors. Middle: s with different number of groups for the learned group convolution. All the models have the same number of parameters. Right: s with different condensation factors.

4.3 Ablation Study

We perform an ablation study on CIFAR-10 in which we investigate the effect of (1) the pruning strategy, (2) the number of groups, and (3) the condensation factor. We also investigate the stability of our weight pruning procedure.

Pruning strategy.

The left panel of Figure 7 compares our on-the-fly pruning method with the more common approach of pruning weights of fully converged models. We use a DenseNet with layers as the basis for this experiment. We implement a “traditional” pruning method in which the weights are pruned in the same way as in as in s, but the pruning is only done once after training has completed (for 300 epochs). Following [32], we fine-tune the resulting sparsely connected network for another 300 epochs with the same cosine shape learning rate that we use for training s. We compare the traditional pruning approach with the approach, setting the number of groups is set to . In both settings, we vary the condensation factor between and .

The results in Figure 7 show that pruning weights gradually during training outperforms pruning weights on fully trained models. Moreover, gradual weight pruning reduces the training time: the “traditional pruning” models were trained for epochs, whereas the s were trained for epochs. The results also show that removing 50% the weights (by setting ) from the convolutional layers in a DenseNet incurs hardly any loss in accuracy.

Number of groups.

In the middle panel of Figure 7, we compare four s with exactly the same network architecture, but a number of groups, , that varies between and . We fix the condensation factor, , to 8 for all the models, which implies all models have the same number of parameters after training has completed. In s with a single group, we discard entire filters in the same way that is common in filter-pruning techniques [29, 32]. The results presented in the figure demonstrate that test errors tends to decrease as the number of groups increases. This result is in line with our analysis in Section 3, in particular, it suggests that grouping filters gives the training algorithm more flexibility to remove redundant weights.

Figure 8: Norm of weights between layers of a CIFAR-10 per filter group (top) and per filter block (bottom). The three columns correspond to independent training runs.

Effect of the condensation factor.

In the right panel of Figure 7, we compare s with varying condensation factors. Specifically, we set the condensation factor to 1, 2, 4, or 8; this corresponds to removing 0%, 50%, 75%, or 87.5% of the weights from each of the convolutional layers, respectively. A condensation factor corresponds to a baseline model without weight pruning. The number of groups, , is set to 4 for all the networks. The results show that a condensation factors larger than 1 consistently lead to improved efficiency, which underlines the effectiveness of our method. Interestingly, models with condensation factors 2, 4 and 8 perform comparably in terms of classification error as a function of FLOPs. This suggests that whilst pruning more weights yields smaller models, it also leads to a proportional loss in accuracy.


As our method removes redundant weights in early stages of the training process, a natural question is whether this will introduce extra variance into the training. Does early pruning remove some of the weights simply because they were initialized with small values?

To investigate this question, Figure 8 visualizes the learned weights and connections for three independently trained s on CIFAR-10 (using different random seeds). The top row shows detailed weight strengths (averaged absolute value of non-pruned weights) between a filter group of a certain layer (corresponding to a column in the figure) and an input feature map (corresponding to a row in the figure). For each layer there are four filter groups (consecutive columns). A white pixel in the top-right corner indicates that a particular input feature was pruned by that layer and group. Following [19], the bottom row of Figure fig:learned-weights-stablity shows the overall connection strength between two layers in the condensed network. The vertical bars correspond to the linear classification layer on top of the . The gray vertical dotted lines correspond to pooling layers that decrease the feature resolution.

The results in the figure suggest that while there are differences in learned connectivity at the filter-group level (top row), the overall information flow between layers (bottom row) is similar for all three models. This suggests that the three training runs learn similar global connectivity patterns, despite starting from different random initializations. Later layers tend to prefer more recently generated features, do however utilize some features from very early layers.

5 Conclusion

In this paper, we introduced : an efficient convolutional network architecture that encourages feature re-use via dense connectivity and prunes filters associated with superfluous feature re-use via learned group convolutions. To make inference efficient, the pruned network can be converted into a network with regular group convolutions, which are implemented efficiently in most deep-learning libraries. Our pruning method is simple to implement, and adds only limited computational costs to the training process. In our experiments, s outperform recently proposed MobileNets and ShuffleNets in terms of computational efficiency at the same accuracy level. even slightly outperforms a network architecture that was discovered by empirically trying tens of thousands of convolutional network architectures, and with a much simpler structure.


The authors are supported in part by grants from the National Science Foundation ( III-1525919, IIS-1550179, IIS-1618134, S&AS 1724282, and CCF-1740822), the Office of Naval Research DOD (N00014-17-1-2175), and the Bill and Melinda Gates Foundation. We are thankful for generous support by SAP America Inc. We also thank Xu Zou, Weijia Chen, Danlu Chen for helpful discussions.