I Introduction and Motivation
Convolutional Neural Networks (CNNs) play a significant role in driving a variety of technologies including computer vision[he2016deep, girshick2014rich, krizhevsky2012imagenet]
, natural language processing (NLP)[young2018recent], and speech recognition [abdel2014convolutional]. In an effort to improve classification performance, CNN sizes have grown dramatically from 1 million (1M) in 1998 [lecun1998mnist], to 60M in 2012 [krizhevsky2012imagenet], to more recent networks with 10 billion trainable parameters [coates2013deep]. Complexity has become an important consideration as CNNs have come into use in practical systems. In particular, the storage and computational complexities, as well as energy consumption, are important considerations in both training and inference modes. As the size of datasets increases and the range of inference tasks broadens, training complexity will continue to be an important consideration. Similarly, many emerging applications require inference processing of trained networks on edge computational platforms that are severely constrained in terms of storage, computation, and energy consumption [hegde2018ucnn].
The methods proposed to address the issue of large models can be categorized into three general areas. One general approach is pre-defined constrained filter design, wherein the standard CNN filter kernels are constrained in some fashion to reduce the computational and storage complexity. Primary examples in the category are MobileNet and ShuffleNet which rely on separable and grouped filter kernels, respectively. A second group of techniques is pruning during training, wherein parameters are removed from the model as training is performed to produce a low-complexity trained model for inference. A third category for complexity reduction is weight quantization, wherein the trained weights are grouped and/or quantized to save storage, possibly including an iterative training procedure [courbariaux2015binaryconnect, zhou2017incremental, rastegari2016xnor].
An important distinction between the pre-defined constrained filter approach and the pruning during training approach is that only the former can be used to significantly reduce the complexity of training. Furthermore, architectural implementation can be considered in pre-defined constrained filters. For example, structure can be imposed so that storage and retrieval of the weights is simplified in software or hardware implementations. Pruning, in contrast, typically results in patterns of connectivity that are not structured or predictable [han2015learning, guo2016dynamic, mao2017exploring]. Moreover, pruning has been shown to be more effective in the fully-connected layers than in the convolutional layers [wen2016learning] whereas the pre-defined constrained filter design approach directly implements complexity reduction in the convolutional layers. Weight quantization is a more general optimization method that can be combined with the concept of pruning as well as pre-defined constrained filter design. In all cases of parameter reduction, one is typically concerned with understanding the trade-off between complexity reduction and inference performance.
In this paper, we propose a new method of model complexity reduction which is in the category of pre-defined constrained filter design approaches – i.e., pre-defined Sparse Convolutional (pSConv) layers. The approach is to pre-define sparse patterns in the 2D filter kernels. For example, if a filter is considered in a standard CNN layer (i.e., 2D kernels combined across 20 input channels), we would set, for example, 5 of the 9 coefficients in each channel plane to be zero. These zero locations are defined before training111The precise pattern used is chosen pseudo-randomly in this work. and are held fixed during the training and inference processes. This work extends our previous investigation into using pre-defined sparsity in fully-connected layers [dey2019pre] and demonstrates similar benefits for CNN layers. While efficacy in fully-connected layers motivated this work, it is not obvious that pre-defined sparsity would be effective in CNNs because CNNs, by definition, are sparse in a spatial sense. In this paper, we demonstrate the effectiveness of pSConv layers in variants of ResNet [he2016deep] and VGG [simonyan2014very] on the CIFAR-10 [krizhevsky2009learning] and Tiny ImageNet [le2015tiny] datasets. Most notably, we have reproduced ShuffleNet [zhang2018shufflenet] for ResNet experiments and the proposed pSConv CNNs consistently outperform ShuffleNet in terms of the complexity-performance trade-off. Specifically, pSConv networks with complexity comparable to ShuffleNet typically outperform ShuffleNet by 4-5% in absolute accuracy or 6-7% in relative accuracy. ShuffleNet has been shown to outperform MobileNet [howard2017mobilenets] under similar trade-off metrics which implies that the pre-defined sparse convolutional approach is an attractive state of the art model complexity reduction method for both training and inference. Furthermore, we observe complexity reductions of up to 70% with negligible degradation in accuracy relative to standard convolution based CNNs.
The remainder of the paper is organized as follows. In Section II, we describe pre-defined constrained filter design methods proposed in the literature and describe how MobileNet and ShuffleNet are designed using these tools. Section III describes our proposed pSConv approach and includes a discussion of considerations for realizing the complexity reduction advantages in practice as well as expressions characterizing the computational complexity (in floating point operations (FLOPs)) for ShuffleNet and pSConv networks. Experimental results are presented in Section IV while conclusions and suggestions for further research are summarized in Section V.
Ii Related Work and Background
In subsection A of this section, we describe various constraints on convolutional filters and in subsection B we describe how these approaches are combined to produce various efficient network architectures, including MobileNet and ShuffleNet.
Ii-a Pre-defined Constrained Filters
Convolutional filters (CONVs) of CNN architecture can be broadly classified into four categories based on constraints as shown in Fig.1: (a) standard full channel (SFCC) [lecun1999object], (b) depth-wise (DWC) [vanhoucke2014learning], (c) group-wise (GWC) [krizhevsky2012imagenet], and (d) point-wise (PWC) [szegedy2015going] convolutions. In all cases illustrated, the input feature map (IFM) dimension is assumed to be , where is the channel depth. In Fig. 1(a), a SFCC filter of dimension produces a single channel of output feature map (OFM) of size , after being convolved with the IFM. To obtain multiple channels in the output feature map, multiple of these filters are used. For the DWC shown in Fig. 1(b), each 2D kernel () is convolved with a single channel of the IFM to produce the corresponding OFM; thus 2D kernels will produce an OFM of dimension . This requires times less computations as compared to SFCC, but the output features capture no information across channels.
Group-wise convolution, which provides a compromise between SFCC and DWC, is shown in Fig. 1(c). A single channel of the OFM is computed by convolving a group of channels from the IFM with channels, with a CONV filter of size . Thus, with a total number of groups , the same CONV filter of dimension provides an OFM of size . With a channel per group of one, the GWC based approach reduces to DWC, while GWC with a single group is equivalent to SFCC. Typically, the number of groups is within the set and the best choice is network architecture dependent [ioannou2017deep]. In fact, in many cases, the optimal number depends on the location of the filter within the network [ioannou2017deep].
Finally, Fig. 1(d) illustrates PWC in which the 2D kernel dimension is , thus generating a single OFM channel using far less computation. For example, compared to a 2D kernel dimension, the PWC has less computational complexity. However, output feature points generated through this approach do not contain any embedded information within a channel.
Ii-B Efficient Network Architectures
Many well-known network architectures use a combination of the approaches shown in Fig. 1 that have been empirically evaluated and optimized through numerous experiments. A combination of GWC and PWC was used in [ioannou2017deep]
and in the Inception module[szegedy2015going, szegedy2017inception]. ResNext [xie2017aggregated] was recently proposed based on modifications of ResNet [he2016deep] where each layer was replaced with a combination of GWC and PWC. MobileNet [howard2017mobilenets, sandler2018mobilenetv2], a popular low complexity architecture designed to be implemented in mobile devices, replaces a SFCC layer with a DWC followed by a PWC. For example, consider a input feature map with eight channels that is mapped to sixteen channel output feature map, each also . With a standard CONV layer, this would take sixteen CONV filters, each . In a MobileNet-like layer, the same input and output feature map dimensions would be accomplished first using eight DWC convolutions, followed by sixteen point-wise convolutions. Note that in this example the standard CONV layer has 1,152 parameters and the MobileNet-like layer has only 200 parameters. ShuffleNet [zhang2018shufflenet] uses a combination of GWC, a channel shuffling layer, and then DWC to provide more information flow across channels. Continuing with the above example, consider a ShuffleNet-like layer with CONV filters of four groups. Each group comprises two channels, so that a CONV filter is made up of four group filters. Performing the associated GWC yields four output channels. Using four such CONV filters yields 16 channels, which are each convolved with a different filter (i.e., DWC) to produce the final OFM. Thus it would have only 432 parameters.
Some previous research has indicated that the two-stage processing shown in Fig. 2(a)-(b) may add latency to the computation [singh2019hetconv], which could impact the speed-up achieved in practice. This, however, is highly dependent on the implementation. For example, in an implementation with a multi-core processor (e.g., GPU), the PWC in Fig 2(b) cannot proceed until the GWC is completed. However, in a custom hardware implementation, careful optimization may largely alleviate the performance cost of this bottleneck. Our proposed pre-defined sparse approach, illustrated in Fig. 2(c) and detailed in the next section, avoids this drawback as well.
Iii Pre-defined Sparse Convolutional Kernels
We now describe pSConv, our proposed approach to reduce the number of parameters in convolutional neural networks via pre-defined sparse kernels. In pSConv, we exploit the structural sparsity of the input receptive-field (RF) [brendel2019approximating] and propose pre-defined sparse 2D kernel based CONVs to form the channels of each convolutional layer.
Iii-a Pre-Defined Sparse CONV Filters
Assume a convolutional layer with CONV filters where each filter is of dimension (here, = = ).222Typically, in modern convolutional neural networks. We define a pre-defined sparse CONV filter as one in which some of the elements are fixed to be zero and this pattern is fixed before training and held fixed throughout training and inference. A regular pre-defined sparse CONV filter has the same kernel support size for each 2D kernel that comprises the CONV filter. Here, the kernel support size (KSS), , is defined to be the number of non-zero weight entries in each 2D kernel of the CONV filter.333The kernel support is the set of indices where the kernel is not constrained to be zero. For example, Fig. 3(a) and (b) illustrate instances of kernels with KSS of 4 and 3, respectively. The specific method for designing the patterns of kernel support (i.e., constrained zero element patterns) in the CONV filter may vary with implications discussed briefly below.
For example, for , there are possible values of , with denoting the standard 2D kernel without any pre-defined sparsity, and denotes that only four non-zero entries are allowed in the kernel space while the remaining five entries are . Furthermore, for a given KSS, the pattern of non-zero entries is pre-defined at the start of the training process. All results presented in this paper were generated using regular pre-defined sparse CONV filters wherein the kernel support (i.e., patterns of defining the sparsity) were selected in a constrained, pseudo-random manner. The constraint enforced ensured that at least one of the 2D kernels has non-zero element for each of the locations. Thus, this procedure provides patterns for sparse CONV layers with significantly fewer parameters, while the receptive field of the CONV filter remains . The pSConv concept is illustrated in Fig. 4 where regular pre-defined sparse CONV filters using a KSS of 4 (i.e., 5 zero locations in each kernel) map an IFM of dimension to an OFM of dimension .
Iii-B Training for pSConv
Pre-defined sparsity in CONV filters has the potential to reduce storage and computation complexity, both during training and inference. Storage may be reduced since only the potentially nonzero elements need to be stored, for instance, with a KSS of 3 for a 2D kernel, only of the weights need to be stored. During the forward processing, only the KSS weights need to be multiplied with the input feature maps. Furthermore, during back-propagation of training, the gradient flow only needs to pass through the kernel support. Realizing these potential complexity reductions in practice, however, requires careful design of the kernel support patterns and the associated memory and computational resources. In software, for example, the storage and computation demands for linked-list storage and the associated array referencing may eliminate a significant fraction of these advantages associated with sparsity. Similar issues will arise in custom hardware acceleration. However these challenges can be alleviated by designing kernel support patterns algorithmically with a small complexity overhead. For example, a very similar problem was solved in [dey2019pre]
where it was also shown that these structured connection patterns typically outperformed randomly generated patterns. Another approach is to utilize more generic sparse tensor product accelerators that may be available in hardware or software libraries[han2016eie]. However, this paper focuses on assessing the potential efficacy of pSConv layers and therefore we do not address the above issues. Specifically, our implementations targeted rapid software development for experimentation and did not attempt to optimize the complexity for sparsity.
Our specific implementation used Pytorch as development package and we zeroed out the weights in the locations outside of kernel support at the start of training and at the end of each mini-batch update – i.e., the weights were initially updated without a sparsity assumption and then those outside the kernel support we reset to zero before the next mini-batch.
Iii-C FLOPs Count
The FLOPs count for various CONV filter types are shown in Table I. This assumes a layer with IFM size of and OFM size . This is the number of FLOPs required to perform the forward (i.e. inference) processing. These expressions assume an efficient implementation as discussed above and are therefore ideal. Specifically, the overhead with generating address and permutations – i.e., in ShuffleNet and pSConv – are not included in these expressions. Note the the reduction in FLOPs for pSConv relative to the standard SFCC layer is simply the kernel density – ı.e. the KSS divided by the size of the standard 2D kernel.
|Approach||FLOP Count (Forward, Ideal)|
Iv Experimental Evaluations
In this section we first give an overview of the experimental settings, and then discuss the results in detail for each experiment.
Iv-a Datasets, Architectures, and Hyperparameters
We use CIFAR-10 and Tiny ImageNet, two popular image classification datasets, for our evaluation. Both of these datasets comprise three-channel colored images. CIFAR-10 has distinct classes, and each image has a size of , where ,, and denote the input height, width, and number of channels, respectively. Tiny ImageNet has classes, and each image has a size of .
We evaluate the benefits of pSConv on two state of the art neural network architectures – ResNet18 [he2016deep] and VGG16 [simonyan2014very]. We defer the details of the architectures until the associated subsections that follow. For both ResNet18 and VGG16, we modified the level of sparsity in all of the 2D kernels using the constrained, pseudo-random method described in Section III, while the fully-connected and down-sampling layers are not modified.
In each experiment, the number of training epochs is. We use an initial learning rate of with a step decay of after and
epoch. We use the stochastic gradient descent optimizer with a momentum ofand weight decay of for all experiments. A training dataset size of 40,000 with 10,000 each for validation and test is used for CIFAR-10 and training batch size is . For Tiny ImageNet we use a training dataset of 100,000 images and 5000 images each for validation and test. The batch size for training is set to . For each experiment, the final results have been presented for a single training experiment. In the following tables and figures, we denote the networks explored by the base network name, followed by the kernel support size – i.e., . For example, ResNet denotes the ResNet18 architecture with non-zeros ( zeros) in each 2D kernel.
Iv-B Experiments with ResNet18
|Model||Test Acc||FLOPs||FLOPs Reduced (%)||Parameters||Parameters Reduced (%)|
|ResNet18pSC9||91.3||0.559 G||—||11.17 M||—|
|ResNet18pSC4||91.5||0.227 G||59.4||5.07 M||54.6|
|ResNet18pSC2||90.8||0.117 G||79.1||2.63 M||76.5|
|ResNet18HCpSC9||89.8||0.136 G||75.7||2.8 M||74.9|
|ResNet18HCpSC4||89.3||0.056 G||89.9||1.27 M||88.6|
|ResNet18HCpSC2||88.3||0.030 G||94.6||0.66 M||94.1|
|ShuffleNet||85.0||0.098 G||82.5||2.16 M||80.7|
|Model||Test Acc||FLOPs||FLOPs Reduced (%)||Parameters||Parameters Reduced (%)|
|ResNet18pSC9||61.3||2.22 G||—||11.58 M||—|
|ResNet18pSC4||60.7||0.902 G||59.4||5.47 M||52.8|
|ResNet18pSC2||59.4||0.463 G||79.1||3.03 M||73.8|
|ResNet18HCpSC9||58.5||0.561 G||74.7||3.00 M||74.1|
|ResNet18HCpSC4||57.3||0.228 G||89.7||1.47 M||87.3|
|ResNet18HCpSC2||54.5||0.117 G||94.7||0.863 M||92.6|
|ShuffleNet||54.3||0.390 G||82.4||3.37 M||70.9|
convolution layer with stride of 2 to match the dimension before adding. The output of the final basic block is flattened and fed into a fully connected layer with softmax, which has 10 output neurons for CIFAR-10 andoutput neurons for Tiny ImageNet.
For standard ResNet18 the number of output channels in the first computational layer of 2 basic blocks is 64 and increases to 128, 256, and 512, for the second, third, and fourth computational layers, respectively. In order to more closely match the parameter count with ShuffleNet, we also use a modified version of ResNet18 with a channel width multiplier value of . So, this modified version of ResNet18 uses half the number of channels in each of these computational layers (i.e., 32, 64, 128, 256). We refer to this modified ResNet18 as “half-channel” ResNet18 – i.e., ResNet18HC. For our work, we use a variant of ShuffleNet with group-size of 3 and channel depths of the stages as 384, 768 and 1536 which has comparable parameters count as ResNet18HCpSC9.
We first present our results on CIFAR-10. In Fig. 6, we plot the validation accuracy vs epoch for ShuffleNet and ResNet18HC combined with pSConv. It shows both the ResNet18HC variants perform better at all the checkpoint epochs. Also, it is clear from Table II that although ResNet18HCpSC4 (with only 4 non-zeros in each 2D kernel) and ResNet18HCpSC2 (with only 2 non-zeros in each 2D kernel) have fewer parameters and FLOPs, they have significantly better test accuracy (89.3 and 88.3 , respectively) compared to ShuffleNet ().
In Fig. 7, we plot validation accuracy vs epoch for the standard ResNet18, ResNet18HC and ResNet18 combined with pSConv. We see similar trend in validation accuracy improvement (including the jumps at epoch 40 and 55) for all the variants.
Next we discuss our experimental results with Tiny ImageNet. In Fig. 8, we plot the validation accuracy vs epoch for ShuffleNet and ResNet18HC combined with pSConv. Table III shows the comparison of FLOPs, parameter count and test accuracy. It is noteworthy that ResNet18HCpSC2 (with only 2 non-zeros in each Conv2D kernel) with 3.3 lesser FLOP count provides similar accuracy as ShuffleNet.
Fig. 9 illustrates the validation accuracy vs epoch for the standard ResNet18, ResNet18HC and ResNet18 combined with pSConv. Clearly, ResNet18pSC2 and ResNet18pSC4 have similar validation accuracy improvement trend as the standard ResNet18, ResNet18HC. Furthermore, as illustrated in Table III, even for larger datasets pSConv can reduce the parameter and FLOP counts of the standard architectures without significant drop in test accuracy.
Finally, in Fig. 10, we plot the test accuracy as a function of parameter count for all the experiments related to ResNet18. For both CIFAR-10 and Tiny ImageNet, ResNet18HCpSC2 (half-channel ResNet18 with KSS of 2) has the lowest parameter count. With a 3.27 and 3.9 reduced parameter count compared to ShuffleNet, ResNet18HCpSC2 provides 3.3% improved and similar accuracy on CIFAR-10 and Tiny ImageNet dataset, respectively. Furthermore, the parameters and FLOPs for ResNet18HCpSC9 and ResNet18pSC2 are quite similar, whereas the latter has around 1% improved test accuracy for both the datasets. This is somewhat in agreement with the trends observed in [dey2019pre]
for sparse multi-layer perceptrons (MLPs) where it was observed that using more neurons that are sparsely connected typically yields better performance than fewer neurons with full connectivity. Also somewhat surprisingly, the half-channel version of ResNet18 is an attractive, and simple, design alternative to ShuffleNet.
Iv-C Experiments with VGG16
|Model||Test Acc||FLOPs||FLOPs Reduced (%)||Parameters||Parameters Reduced (%)|
|VGG16pSC9||90.2||0.333 G||—||33.65 M||—|
|VGG16pSC4||90.0||0.145 G||56.5||25.47 M||24.3|
|VGG16pSC2||88.8||0.082 G||75.4||22.20 M||34.0|
In VGG16, there are
convolutional layers, where each convolutional layer is followed by batch normalization and ReLU. There are max pooling layers after ReLU corresponding to the convolutional layers, , , and . The last max pooling layer feeds an average pooling layer. The next two layers are fully connected layers with ReLUs, while the final layer is a fully connected layer with softmax having 10 output neurons for CIFAR-10 and output neurons for Tiny ImageNet.
For VGG16 with CIFAR-10, we summarize the results in Table IV, where we can clearly see that with a decrease in KSS, both parameter and FLOPs decrease significantly, with a modest decrease in test accuracy. As illustrated by Fig.11, architectures with pSConv have a similar convergence curve for validation accuracy as that of the standard architecture.
|Model||Test Acc||FLOPs||FLOPs Reduced (%)||Parameters||Parameters Reduced (%)|
|VGG16pSC9||54.2||1.28 G||—||40.72 M||—|
|VGG16pSC4||53.5||0.528 G||58.8||32.54 M||20.1|
|VGG16pSC2||51.8||0.277 G||78.4||29.28 M||28.1|
Similar results are obtained for training over Tiny ImageNet. As shown in Table V, as KSS is decreased, both parameter and FLOPs count decrease significantly, with only a modest decrease in test accuracy. Specifically, we see a reduction of parameter and FLOPs count by and , respectively.444VGG has a larger portion of its parameters in the fully-connected classification layers than ResNet18, so the FLOPs reduction and parameter count reductions differ less for VGG than for ResNet18. Furthermore, as illustrated by Fig. 12, architectures with pSConv have a similar convergence curve for validation accuracy as that of the standard architecture.
V Conclusions and future work
The proposed pre-defined sparse kernel based filter design approach (pSConv) can achieve reduced complexity for training and inference while yielding higher accuracy than start-of-the-art resource-constrained alternatives. In our approach, only a subset of kernel weights, referred to as the kernel support, are not fixed at zero. Our evaluations with CIFAR-10 dataset have shown a ResNet18 with half the channel size at each layer and with a kernel support size of 2 out performs ShuffleNet by an absolute accuracy margin of around with a reduction in the number of parameters. Similar trends were observed for Tiny ImageNet dataset where a similar ResNet18 architecture with a kernel support size 4 provides higher accuracy with reduction in the number of parameters.
While the results of this work demonstrate the potential of pre-defined sparsity in CNNs, there are several interesting areas for further research. First, since pre-defined sparsity and group-wise/point-wise convolutions are orthogonal methods, a more complete network architecture search that optimizes over these pre-defined constrained filter methods could be fruitful. Second, investigating efficient implementations of pSConv layers for software (GPU) and custom hardware implementations is an important area for future work.
This paper is an extension of work done by the first three authors in the Spring 2019 offering of EE599: Deep Learning at University of Southern California, taught by Dr. Keith Chugg. We would like to thank Dr. Brandon Franzke, and Dr. Massoud Pedram for their helpful feedback on this work. Financial support from the National Science Foundation grant #1763747 and Amazon Educate program which provided cloud compute credits for the course is gratefully acknowledged.