1 Introduction
Convolutional neural networks (CNNs) have demonstrated great capability of solving visual recognition tasks. Since AlexNet Krizhevsky et al. (2012) achieved remarkable success on the ImageNet Challenge Deng et al. (2009), various deeper and more complicated networks Simonyan and Zisserman (2015); Szegedy et al. (2015); He et al. (2016) have been proposed to set the performance records. However, the higher accuracy usually comes with an increasing amount of parameters and computational cost. For example, the VGG16 Simonyan and Zisserman (2015) has million parameters and requires
million floating point operations (FLOPs) to classify an image. In many realworld applications, predictions need to be performed on resourcelimited platforms such as sensors and mobile phones, thereby requiring compact models with higher speed. Model compression aims at exploring a tradeoff between accuracy and efficiency.
Recently, significant progress has been made in the field of model compression Iandola et al. (2016); Rastegari et al. (2016); Wu et al. (2016); Howard et al. (2017); Zhang et al. (2017). The strategies for building compact and efficient CNNs can be divided into two categories; those are, compressing pretrained networks or designing new compact architectures that are trained from scratch. Studies in the former category were mostly based on traditional compression techniques such as product quantization Wu et al. (2016), pruning See et al. (2016), hashing Chen et al. (2015), Huffman coding Han et al. (2015), and factorization Lebedev et al. (2014); Jaderberg et al. (2014).
The second category has already been explored before model compression. Inspired by the NetworkInNetwork architecture Lin et al. (2013), GoogLeNet Szegedy et al. (2015)
included the Inception module to build deeper networks without increasing model sizes and computational cost. Through factorizing convolutions, the Inception module was further improved by
Szegedy et al. (2016). The depthwise separable convolution, proposed in Sifre and Mallat (2014), generalized the factorization idea and decomposed the convolution into a depthwise convolution and a convolution. The operation has been shown to be able to achieve competitive results with fewer parameters. In terms of model compression, MobileNets Howard et al. (2017) and ShuffleNets Zhang et al. (2017) designed CNNs for mobile devices by employing depthwise separable convolutions.In this work, we focus on the second category and build a new family of lightweight CNNs known as ChannelNets. By observing that the fullyconnected pattern accounts for most parameters in CNNs, we propose channelwise convolutions, which are used to replace dense connections among feature maps with sparse ones. Early work like LeNet5 LeCun et al. (1998) has shown that sparselyconnected networks work well when resources are limited. To apply channelwise convolutions in model compression, we develop group channelwise convolutions, depthwise separable channelwise convolutions, and the convolutional classification layer. They are used to compress different parts of CNNs, leading to our ChannelNets. ChannelNets achieve a better tradeoff between efficiency and accuracy than prior compact CNNs, as demonstrated by experimental results on the ImageNet ILSVRC 2012 dataset. It is worth noting that ChannelNets are the first models that attempt to compress the fullyconnected classification layer, which accounts for about % of total parameters in compact CNNs.
2 Background and Motivations
The trainable layers of CNNs are commonly composed of convolutional layers and fullyconnected layers. Most prior studies, such as MobileNets Howard et al. (2017) and ShuffleNets Zhang et al. (2017), focused on compressing convolutional layers, where most parameters and computation lie. To make the discussion concrete, suppose a 2D convolutional operation takes feature maps with a spatial size of as inputs, and outputs
feature maps of the same spatial size with appropriate padding.
and are also known as the number of input and output channels, respectively. The convolutional kernel size isand the stride is set to
. Here, without loss of generality, we use square feature maps and convolutional kernels for simplicity. We further assume that there is no bias term in the convolutional operation, as modern CNNs employ the batch normalization
Ioffe and Szegedy (2015) with a bias after the convolution. In this case, the number of parameters in the convolution is and the computational cost in terms of FLOPs is . Since the convolutional kernel is shared for each spatial location, for any pair of input and output feature maps, the connections are sparse and weighted by shared parameters. However, the connections among channels follow a fullyconnected pattern, i.e., all input channels are connected to all output channels, which results in the term. For deep convolutional layers, and are usually large numbers like and , thus is usually very large.Based on the above insights, one way to reduce the size and cost of convolutions is to circumvent the multiplication between and . MobileNets Howard et al. (2017) applied this approach to explore compact deep models for mobile devices. The core operation employed in MobileNets is the depthwise separable convolution Chollet (2016), which consists of a depthwise convolution and a convolution, as illustrated in Figure 1(a). The depthwise convolution applies a single convolutional kernel independently for each input feature map, thus generating the same number of output channels. The following convolution is used to fuse the information of all output channels using a linear combination. The depthwise separable convolution actually decomposes the regular convolution into a depthwise convolution step and a channelwise fuse step. Through this decomposition, the number of parameters becomes
(1) 
and the computational cost becomes
(2) 
In both equations, the first term corresponds to the depthwise convolution and the second term corresponds to the convolution. By decoupling and , the amounts of parameters and computations are reduced.
While MobileNets successfully employed depthwise separable convolutions to perform model compression and achieve competitive results, it is noted that the term still dominates the number of parameters in the models. As pointed out in Howard et al. (2017), convolutions, which lead to the term, account for 74.59% of total parameters in MobileNets. The analysis of regular convolutions reveals that comes from the fullyconnected pattern, which is also the case in convolutions. To understand this, first consider the special case where . Now the inputs are units as each feature map has only one unit. As the convolutional kernel size is , which does not change the spatial size of feature maps, the outputs are also units. It is clear that the operation between the input units and the output units is a fullyconnected operation with parameters. When , the fullyconnected operation is shared for each spatial location, leading to the convolution. Hence, the convolution actually outputs a linear combination of input feature maps. More importantly, in terms of connections between input and output channels, both the regular convolution and the depthwise separable convolution follow the fullyconnected pattern.
As a result, a better strategy to compress convolutions is to change the dense connection pattern between input and output channels. Based on the depthwise separable convolution, it is equivalent to circumventing the convolution. A simple method, previously used in AlexNet Krizhevsky et al. (2012), is the group convolution. Specifically, the input channels are divided into mutually exclusive groups. Each group goes through a convolution independently and produces output feature maps. It follows that there are still output channels in total. For simplicity, suppose both and are divisible by . As the convolution for each group requires parameters and FLOPs, the total amount after grouping is only as compared to the original convolution. Figure 1(b) describes a group convolution where the number of groups is .
However, the grouping operation usually compromises performance because there is no interaction among groups. As a result, information of feature maps in different groups is not combined, as opposed to the original convolution that combines information of all input channels. To address this limitation, ShuffleNet Zhang et al. (2017) was proposed, where a shuffling layer was employed after the group convolution. Through random permutation, the shuffling layer partly achieves interactions among groups. But any output group accesses only input feature maps and thus collects partial information. Due to this reason, ShuffleNet had to employ a deeper architecture than MobileNets to achieve competitive results.
3 ChannelWise Convolutions and ChannelNets
In this work, we propose channelwise convolutions in Section 3.1, based on which we build our ChannelNets. In Section 3.2, we apply group channelwise convolutions to address the information inconsistency problem caused by grouping. Afterwards, we generalize our method in Section 3.3, which leads to a direct replacement of depthwise separable convolutions in deeper layers. Through analysis of the generalized method, we propose a convolutional classification layer to replace the fullyconnected output layer in Section 3.4, which further reduces the amounts of parameters and computations. Finally, Section 3.5 introduces the architecture of our ChannelNets.
3.1 ChannelWise Convolutions
We begin with the definition of channelwise convolutions in general. As discussed above, the convolution is equivalent to using a shared fullyconnected operation to scan every locations of input feature maps. A channelwise convolution employs a shared 1D convolutional operation, instead of the fullyconnected operation. Consequently, the connection pattern between input and output channels becomes sparse, where each output feature map is connected to a part of input feature maps. To be specific, we again start with the special case where . The input units (feature maps) can be considered as a 1D feature map of size . Similarly, the output becomes a 1D feature map of size . Note that both the input and output have only channel. The channelwise convolution performs a 1D convolution with appropriate padding to map the units to the units. In the cases where , the same 1D convolution is computed for every spatial locations. As a result, the number of parameters in a channelwise convolution with a kernel size of is simply and the computational cost is . By employing sparse connections, we avoid the term. Therefore, channelwise convolutions consume a negligible amount of computations and can be performed efficiently.
3.2 Group ChannelWise Convolutions
We apply channelwise convolutions to develop a solution to the information inconsistency problem incurred by grouping. After the group convolution, the outputs are groups, each of which includes feature maps. As illustrated in Figure 1(b), the groups are computed independently from completely separate groups of input feature maps. To enable interactions among groups, an efficient information fusion layer is needed after the group convolution. The fusion layer is expected to retain the grouping for following group convolutions while allowing each group to collect information from all the groups. Concretely, both inputs and outputs of this layer should be feature maps that are divided into groups. Meanwhile, the output channels in any group should be computed from all the input channels. More importantly, the layer must be compact and efficient; otherwise the advantage of grouping will be compromised.
Based on channelwise convolutions, we propose the group channelwise convolution, which serves elegantly as the fusion layer. Given input feature maps that are divided into groups, this operation performs independent channelwise convolutions. Each channelwise convolution uses a stride of and outputs feature maps with appropriate padding. Note that, in order to ensure all input channels are involved in the computation of any output group of channels, the kernel size of channelwise convolutions needs to satisfy . The desired outputs of the fusion layer is obtained by concatenating the outputs of these channelwise convolutions. Figure 1(c) provides an example of using the group channelwise convolution after the group convolution, which replaces the original convolution.
To see the efficiency of this approach, the number of parameters of the group convolution followed by the group channelwise convolution is , and the computational cost is . Since in most cases we have , our approach requires approximately training parameters and FLOPs, as compared to the second terms in Eqs. 1 and 2.
3.3 DepthWise Separable ChannelWise Convolutions
Based on the above descriptions, it is worth noting that there is a special case where the number of groups and the number of input and output channels are equal, i.e., . A similar scenario resulted in the development of depthwise convolutions Howard et al. (2017); Chollet (2016). In this case, there is only one feature map in each group. The group convolution simply scales the convolutional kernels in the depthwise convolution. As the batch normalization Ioffe and Szegedy (2015) in each layer already involves a scaling term, the group convolution becomes redundant and can be removed. Meanwhile, instead of using independent channelwise convolutions with a stride of as the fusion layer, we apply a single channelwise convolution with a stride of . Due to the removal of the group convolution, the channelwise convolution directly follows the depthwise convolution, resulting in the depthwise separable channelwise convolution, as illustrated in Figure 1(d).
In essence, the depthwise separable channelwise convolution replaces the convolution in the depthwise separable convolution with the channelwise convolution. The connections among channels are changed directly from a dense pattern to a sparse one. As a result, the number of parameters is , and the cost is , which saves dramatic amounts of parameters and computations. This layer can be used to directly replace the depthwise separable convolution.
3.4 Convolutional Classification Layer
Most prior model compression methods pay little attention to the very last layer of CNNs, which is a fullyconnected layer used to generate classification results. Taking MobileNets on the ImageNet dataset as an example, this layer uses a
component feature vector as inputs and produces
logits corresponding to classes. Therefore, the number of parameters is million, which accounts for % of total parameters as reported in Howard et al. (2017). In this section, we explore a special application of the depthwise separable channelwise convolution, proposed in Section 3.3, to reduce the large amount of parameters in the classification layer.We note that the secondtothelast layer is usually a global average pooling layer, which reduces the spatial size of feature maps to . For example, in MobileNets, the global average pooling layer transforms input feature maps into output feature maps, corresponding to the component feature vector fed into the classification layer. In general, suppose the spatial size of input feature maps is . The global average pooling layer is equivalent to a special depthwise convolution with a kernel size of , where the weights in the kernel is fixed to . Meanwhile, the following fullyconnected layer can be considered as a convolution as the input feature vector can be viewed as feature maps. Thus, the global average pooling layer followed by the fullyconnected classification layer is a special depthwise convolution followed by a convolution, resulting in a special depthwise separable convolution.
As the proposed depthwise separable channelwise convolution can directly replace the depthwise separable convolution, we attempt to apply the replacement here. Specifically, the same special depthwise convolution is employed, but is followed by a channelwise convolution with a kernel size of whose number of output channels is equal to the number of classes. However, we observe that such an operation can be further combined using a regular 3D convolution Ji et al. (2013).
In particular, the input feature maps can be viewed as a single 3D feature map with a size of . The special depthwise convolution, or equivalently the global average pooling layer, is essentially a 3D convolution with a kernel size of , where the weights in the kernel is fixed to . Moreover, in this view, the channelwise convolution is a 3D convolution with a kernel size of . These two consecutive 3D convolutions follow a factorized pattern. As proposed in Szegedy et al. (2016), a convolution can be factorized into two consecutive convolutions with kernel sizes of and , respectively. Based on this factorization, we combine the two 3D convolutions into a single one with a kernel size of . Suppose there are classes, to ensure that the number of output channels equals to the number of classes, is set to with no padding on the input. This 3D convolution is used to replace the global average pooling layer followed by the fullyconnected layer, serving as a convolutional classification layer.
While the convolutional classification layer dramatically reduces the number of parameters, there is a concern that it may cause a signification loss in performance. In the fullyconnected classification layer, each prediction is based on the entire feature vector by taking all features into consideration. In contrast, in the convolutional classification layer, the prediction of each class uses only features. However, our experiments show that the weight matrix of the fullyconnected classification layer is very sparse, indicating that only a small number of features contribute to the prediction of a class. Meanwhile, our ChannelNets with the convolutional classification layer achieve much better results than other models with similar amounts of parameters.
3.5 ChannelNets
With the proposed group channelwise convolutions, the depthwise separable channelwise convolutions, and the convolutional classification layer, we build our ChannelNets. We follow the basic architecture of MobileNets to allow fair comparison and design three ChannelNets with different compression levels. Notably, our proposed methods are orthogonal to the work of MobileNetV2 Sandler et al. (2018). Similar to MobileNets, we can apply our methods to MobileNetV2 to further reduce the parameters and computational cost. The details of network architectures are shown in Table 4 in the supplementary material.
ChannelNetv1: To employ the group channelwise convolutions, we design two basic modules; those are, the group module (GM) and the group channelwise module (GCWM). They are illustrated in Figure 3. GM simply applies group convolution instead of
convolution and adds a residual connection
He et al. (2016). As analyzed above, GM saves computations but suffers from the information inconsistency problem. GCWM addresses this limitation by inserting a group channelwise convolution after the second group convolution to achieve information fusion. Either module can be used to replace two consecutive depthwise separable convolutional layers in MobileNets. In our ChannelNetv1, we choose to replace depthwise separable convolutions with larger numbers of input and output channels. Specifically, six consecutive depthwise separable convolutional layers with input and output channels are replaced by two GCWMs followed by one GM. In these modules, we set the number of groups to . The total number of parameters in ChannelNetv1 is about million.ChannelNetv2: We apply the depthwise separable channelwise convolutions on ChannelNetv1 to further compress the network. The last depthwise separable convolutional layer has input channels and output channels. We use the depthwise separable channelwise convolution to replace this layer, leading to ChannelNetv2. The number of parameters reduced by this replacement of a single layer is million, which accounts for about % of total parameters in ChannelNetv1.
ChannelNetv3: We employ the convolutional classification layer on ChannelNetv2 to obtain ChannelNetv3. For the ImageNet image classification task, the number of classes is , which means the number of parameters in the fullyconnected classification layer is million. Since the number of parameters for the convolutional classification layer is only thousand, ChannelNetv3 reduces 1 million parameters approximately.
4 Experimental Studies
In this section, we evaluate the proposed ChannelNets on the ImageNet ILSVRC 2012 image classification dataset Deng et al. (2009), which has served as the benchmark for model compression. We compare different versions of ChannelNets with other compact CNNs. Ablation studies are also conducted to show the effect of group channelwise convolutions. In addition, we perform an experiment to demonstrate the sparsity of weights in the fullyconnected classification layer.
4.1 Dataset
The ImageNet ILSVRC 2012 dataset contains million training images and thousand validation images. Each image is labeled by one of classes. We follow the same data augmentation process in He et al. (2016). Images are scaled to . Randomly cropped patches with a size of are used for training. During inference, center crops are fed into the networks. To compare with other compact CNNs Howard et al. (2017); Zhang et al. (2017), we train our models using training images and report accuracies computed on the validation set, since the labels of test images are not publicly available.
4.2 Experimental Setup
We train our ChannelNets using the same settings as those for MobileNets except for a minor change. For depthwise separable convolutions, we remove the batch normalization and activation function between the depthwise convolution and the
convolution. We observe that it has no influence on the performance while accelerating the training speed. For the proposed GCWMs, the kernel size of group channelwise convolutions is set to . In depthwise separable channelwise convolutions, we set the kernel size to . In the convolutional classification layer, the kernel size of the 3D convolution is. All models are trained using the stochastic gradient descent optimizer with a momentum of 0.9 for 80 epochs. The learning rate starts at
and decays by at the 45, 60, 65, 70, and 75 epoch. Dropout Srivastava et al. (2014) with a rate of is applied after convolutions. We use 4 TITAN Xp GPUs and a batch size of for training, which takes about days.4.3 Comparison of ChannelNetv1 with Other Models
Models  Top1  Params  FLOPs 

GoogleNet  0.698  6.8m  1550m 
VGG16  0.715  128m  15300m 
AlexNet  0.572  60m  720m 
SqueezeNet  0.575  1.3m  833m 
1.0 MobileNet  0.706  4.2m  569m 
ShuffleNet 2x  0.709  5.3m  524m 
ChannelNetv1  0.705  3.7m  407m 
We compare ChannelNetv1 with other CNNs, including regular networks and compact ones, in terms of the top1 accuracy, the number of parameters and the computational cost in terms of FLOPs. The results are reported in Table 1. We can see that ChannelNetv1 is the most compact and efficient network, as it achieves the best tradeoff between efficiency and accuracy.
We can see that SqueezeNet Iandola et al. (2016) has the smallest size. However, the speed is even slower than AlexNet and the accuracy is not competitive to other compact CNNs. By replacing depthwise separable convolutions with GMs and GCWMs, ChannelNetv1 achieves nearly the same performance as MobileNet with a % reduction in parameters and a % reduction in FLOPs. Here, the represents the width multiplier in MobileNets, which is used to control the width of the networks. MobileNets with different width multipliers are compared with ChannelNets under similar compression levels in Section 4.4. ShuffleNet 2x can obtain a slightly better performance. However, it employs a much deeper network architecture, resulting in even more parameters and FLOPs than MobileNets. This is because more layers are required when using shuffling layers to address the information inconsistency problem in group convolutions. Thus, the advantage of using group convolutions is compromised. In contrast, our group channelwise convolutions can overcome the problem without more layers, as shown by experiments in Section 4.5.
4.4 Comparison of ChannelNets with Models Using Width Multipliers
Models  Top1  Params 

0.75 MobileNet  0.684  2.6m 
0.75 ChannelNetv1  0.678  2.3m 
ChannelNetv2  0.695  2.7m 
0.5 MobileNet  0.637  1.3m 
0.5 ChannelNetv1  0.627  1.2m 
ChannelNetv3  0.667  1.7m 
The width multiplier is proposed in Howard et al. (2017) to make the network architecture thinner by reducing the number of input and output channels in each layer, thereby increasing the compression level. This approach simply compresses each layer by the same factor. Note that most of parameters lie in deep layers of the model. Hence, reducing widths in shallow layers does not lead to significant compression, but hinders model performance, since it is important to maintain the number of channels in the shallow part of deep models. Our ChannelNets explore a different way to achieve higher compression levels by replacing the deepest layers in CNNs. Remarkably, ChannelNetv3 is the first compact network that attempts to compress the last layer, i.e., the fullyconnected classification layer.
We perform experiments to compare ChannelNetv2 and ChannelNetv3 with compact CNNs using width multipliers. The results are shown in Table 2. We apply width multipliers on both MobileNet and ChannelNetv1 to illustrate the impact of applying width multipliers. In order to make the comparison fair, compact networks with similar compression levels are compared together. Specifically, we compare ChannelNetv2 with 0.75 MobileNet and 0.75 ChannelNetv1, since the numbers of total parameters are in the same 2.x million level. For ChannelNetv3, 0.5 MobileNet and 0.5 ChannelNetv1 are used for comparison, as all of them contain 1.x million parameters.
We can observe from the results that ChannelNetv2 outperforms MobileNet with an absolute % gain in accuracy, which demonstrates the effect of our depthwise separable channelwise convolutions. In addition, note that using depthwise separable channelwise convolutions to replace depthwise separable convolutions is a more flexible way than applying width multipliers. It only affects one layer, as opposed to all layers in the networks. ChannelNetv3 has significantly better performance than MobileNet by % in accuracy. It shows that our convolutional classification layer can retain the accuracy to most extent while increasing the compression level. The results also show that applying width multipliers on ChannelNetv1 leads to poor performance.
4.5 Ablation Study on Group ChannelWise Convolutions
Models  Top1  Params 

ChannelNetv1()  0.697  3.7m 
ChannelNetv1  0.705  3.7m 
To demonstrate the effect of our group channelwise convolutions, we conduct an ablation study on ChannelNetv1. Based on ChannelNetv1, we replace the two GCWMs with GMs, thereby removing all group channelwise convolutions. The model is denoted as ChannelNetv1(). It follows exactly the same experimental setup as ChannelNetv1 to ensure fairness. Table 3 provides comparison results between ChannelNetv1() and ChannelNetv1. ChannelNetv1 outperforms ChannelNetv1() by %, which is significant as ChannelNetv1 has only more parameters with group channelwise convolutions. Therefore, group channelwise convolutions are extremely efficient and effective information fusion layers for solving the problem incurred by group convolutions.
4.6 Sparsity of Weights in FullyConnected Classification Layers
In ChannelNetv3, we replace the fullyconnected classification layer with our convolutional classification layer. Each prediction is based on only features instead of all features, which raises a concern of potential loss in performance. To investigate this further, we analyze the weight matrix in the fullyconnected classification layer, as shown in Figure 4 in the supplementary material. We take the fully connected classification layer of ChannelNetv1 as an example. The analysis shows that the weights are sparsely distributed in the weight matrix, which indicates that each prediction only makes use of a small number of features, even with the fullyconnected classification layer. Based on this insight, we propose the convolutional classification layer and ChannelNetv3. As shown in Section 4.4, ChannelNetv3 is highly compact and efficient with promising performance.
5 Conclusion and Future Work
In this work, we propose channelwise convolutions to perform model compression by replacing dense connections in deep networks. We build a new family of compact and efficient CNNs, known as ChannelNets, by using three instances of channelwise convolutions; namely group channelwise convolutions, depthwise separable channelwise convolutions, and the convolutional classification layer. Group channelwise convolutions are used together with group convolutions as information fusion layers. Depthwise separable channelwise convolutions can be directly used to replace depthwise separable convolutions. The convolutional classification layer is the first attempt in the field of model compression to compress the the fullyconnected classification layer. Compared to prior methods, ChannelNets achieve a better tradeoff between efficiency and accuracy. The current study evaluates the proposed methods on image classification tasks, but the methods can be applied to other tasks, such as detection and segmentation. We plan to explore these applications in the future.
References

Chen et al. [2015]
Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen.
Compressing neural networks with the hashing trick.
In
International Conference on Machine Learning
, pages 2285–2294, 2015. 
Chollet [2016]
François Chollet.
Xception: Deep learning with depthwise separable convolutions.
arXiv preprint, 2016. 
Deng et al. [2009]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
ImageNet: A LargeScale Hierarchical Image Database.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2009.  Han et al. [2015] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations, 2015.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Iandola et al. [2016] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Jaderberg et al. [2014] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
 Ji et al. [2013] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, 2013.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Lebedev et al. [2014] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
 Lin et al. [2013] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
 Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv preprint arXiv:1801.04381, 2018.

See et al. [2016]
Abigail See, MinhThang Luong, and Christopher D Manning.
Compression of neural machine translation models via pruning.
In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 291–301, 2016.  Sifre and Mallat [2014] Laurent Sifre and PS Mallat. Rigidmotion scattering for image classification. PhD thesis, Citeseer, 2014.
 Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. Proceedings of the International Conference on Learning Representations, 2015.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
 Wu et al. [2016] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016.
 Zhang et al. [2017] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.
Comments
There are no comments yet.