Pre-defined-sparseCNN
This repository constains a Pytorch implementation of the paper, titled "Pre-defined Sparsity for Low-Complexity Convolutional Neural Networks"
view repo
The high energy cost of processing deep convolutional neural networks impedes their ubiquitous deployment in energy-constrained platforms such as embedded systems and IoT devices. This work introduces convolutional layers with pre-defined sparse 2D kernels that have support sets that repeat periodically within and across filters. Due to the efficient storage of our periodic sparse kernels, the parameter savings can translate into considerable improvements in energy efficiency due to reduced DRAM accesses, thus promising significant improvements in the trade-off between energy consumption and accuracy for both training and inference. To evaluate this approach, we performed experiments with two widely accepted datasets, CIFAR-10 and Tiny ImageNet in sparse variants of the ResNet18 and VGG16 architectures. Compared to baseline models, our proposed sparse variants require up to 82 5.6times fewer FLOPs with negligible loss in accuracy for ResNet18 on CIFAR-10. For VGG16 trained on Tiny ImageNet, our approach requires 5.8times fewer FLOPs and up to 83.3 only 1.2 architectures with that of ShuffleNet andMobileNetV2. Using similar hyperparameters and FLOPs, our ResNet18 variants yield an average accuracy improvement of 2.8
READ FULL TEXT VIEW PDF
The high demand for computational and storage resources severely impede ...
read it
The high demand for computational and storage resources severely impede ...
read it
While deep neural networks (DNNs) have proven to be efficient for numero...
read it
While recent advances in deep learning have led to significant improveme...
read it
Herein, a bit-wise Convolutional Neural Network (CNN) in-memory accelera...
read it
One of the main barriers for deploying neural networks on embedded syste...
read it
Binarized Neural Networks (BNNs) can significantly reduce the inference
...
read it
This repository constains a Pytorch implementation of the paper, titled "Pre-defined Sparsity for Low-Complexity Convolutional Neural Networks"
In recent years, deep convolutional neural networks (CNNs) have become critical components in many real world vision applications ranging from object recognition [krizhevsky2012imagenet, simonyan2014very, szegedy2015going, he2016deep] and detection [girshick2014rich, sermanet2013overfeat, redmon2017yolo9000] to image segmentation [tao2018image]. With the demand for high classification accuracy, current state-of-the-art CNNs have evolved to have hundreds of layers [krizhevsky2012imagenet, simonyan2014very, coates2013deep, szegedy2015going, szegedy2016rethinking], requiring millions of weights and billions of FLOPs. However, because a wide variety of neural network applications are heavily resource constrained, such as those for embedded and IoT devices, there is increasing interest in CNN architectures that balance implementation efficiency with accuracy and associated hardware accelerators that target CNNs [han2016eie, chen2016eyeriss, chen2018eyeriss]. In particular, because energy is often the primary limited resource, researchers have focused on minimizing the number of non-zero model parameters and the accelerator’s access to off-chip DRAM, which consumes around 200 more energy than access to on-chip SRAM [chen2017using].
Previous work has focused on accelerating inference and proposed model pruning [yang2017designing, han2015learning, wen2016learning, liu2018rethinking, zhang2018systematic] and quantization [zhou2017incremental, leng2018extremely, ren2019admm, alemdar2017ternary, courbariaux2015binaryconnect, rastegari2016xnor] to reduce the number of non-zero parameters. Recently, a more detailed analysis showed that such unstructured pruning may not reduce energy consumption because of the overhead required to manage sparse matrix representations [wen2016learning].
This motivates structured pruning [wen2016learning] which favors structure in the sparsity patterns that can more efficiently be managed in inference hardware.
Other work focused on the efficiency of both inference and training acceleration by defining notions of pre-defined sparsity [dey2019pre] in which a subset of the weights are fixed at zero before training and remain zero through inference. For example, a recent work [dey2019pre]
showed that neural networks can be trained with pre-defined hardware-friendly sparse connectivity in the fully connected multilayer perceptrons layers that avoids costly sparse matrix representations and thus can both speed-up and reduce the energy consumption of both inference and training. Other researchers have tried to address convolution (CONV) layers’ computation complexity issue, which contribute the largest number of FLOPs for deep networks, exemplified by the CONV layer in ResNet18
[he2016deep] which accounts for of the total FLOPs for Tiny ImageNet classification. In particular many investigations have focused on efficient pre-defined computationally-limited filter designs to reduce complexity of training and inference at the cost of accuracy, including MobileNet [howard2017mobilenets], MobileNetV2 [sandler2018mobilenetv2], and ShuffleNet [zhang2018shufflenet].This paper proposes pre-defined sparse convolutions to improve energy and storage efficiency during both training and inference. We refer to this approach as pSConv and presented initial simulation results that show negligible performance degradation compared to fully-connected baseline models in [pSConv2019kundu]. However, as mentioned earlier, unstructured forms of pSConv may not lead to energy reductions due to the overhead of managing their sparse matrix representations.
Motivated by this fact, we extend pSConv by proposing a form of periodicity, repeating a relatively small pattern of pre-defined sparse kernels within a 3D filter such that fixed zero-weights occur repeatedly with a constant interval across the 3D filter. This periodicity can greatly reduce the overhead associated with managing sparsity, allowing the proposed CNN architecture to exhibit significant reductions in energy consumption compared to baseline CNNs with dense filters.
Finally, we present a convolutional channel modification to boost the accuracy of pSConv-based CNNs. In particular, the accuracy loss incurred due to the added periodicity constraint may be non-negligible in some cases. To combat this phenomenon, we introduce fully-connected (FC) 2D kernels at fixed intervals within a 3D filter. In particular, extending the periodic pattern of pre-defined sparse kernels with a fully connected (FC) kernel boosts accuracy while maintaining relatively low storage overhead.
To evaluate the effectiveness of our proposed sparsity based CONVs, we run image classification tasks on variants of VGG [simonyan2014very] and ResNet [he2016deep] with CIFAR-10 [krizhevsky2009learning] and Tiny ImageNet [le2015tiny] datasets. We also show that we achieve higher test accuracy than MobileNetV2 [sandler2018mobilenetv2] with similar network hyperparameter settings on these datasets. Finally, we analytically quantify the benefits of our algorithm compared to traditional approaches in terms of both FLOPs and storage, the latter assuming a variety of well-known sparse matrix representations.
The remainder of this paper is structured as follows. Section 2 provides notable related work in the domain of CNN architectures and efficient sparse matrix representations. Section 3 describes our proposed architecture in detail and is followed by our analytical evaluation of FLOPs and storage requirements in Section 4. We present our simulation results in Section 5 and conclude in Section 6.
CONV layers in neural network architectures transform the input images into abstract representations known as feature maps. To generate the output feature maps (OFMs) the filters of a layer are convolved with input feature maps (IFMs) which is comprised of the element wise product of filter and IFMs and the accumulation of partial sums. In particular, the following equation shows the computation of each OFM element in a standard fully-connected convolution (SFCC) layer.
Variable | Description |
---|---|
batch-size of a 3D feature map | |
, | height, width of IFM to a layer |
, | height, width of a 2D kernel in a layer |
, | height, width of OFM from a layer |
# of IFM channels/# of 3D filter channels | |
# of OFM channels/# of 3D filters | |
# of channels in a group from GWC | |
# of parameters per kernel not pre-defined to be zero |
Descriptions of tensor dimensions in a convolutional layer
(1) |
Here, O, I, W are the 4D OFM, IFM, and filter weight tensors, respectively and B is the 1D bias tensor added to each 3D filter result. Also, represents the OFM element in the output channel corresponding to the input batch . Note the extensive data reuse both in IFM and weights, for which optimized dataflow is needed to ensure energy efficiency [chakradhar2010dynamically, gupta2015deep, chen2017using]
. The number of FLOPs necessary to generate the OFM for a SFCC layer can be estimated as
(2) |
where, represents both height () and width () of the 2D kernel and the meaning of the other variables are defined in Table I.
Also, in this paper we assumed stride size of 1
and consider a FLOP and a multiply-accumulate operation to be equivalent.Because the SFCC [lecun1999object], shown in Fig. 1
(a), is computationally intensive, several pre-defined computationally-limited filters have been proposed to reduce the complexity of convolution. These filters can be broadly classified into three different categories, as shown in Fig.
1. The first category is depth-wise convolution (DWC) [vanhoucke2014learning], shown in Fig. 1(b). Here, each 2D kernel of size is convolved with a single channel of the IFM to produce the corresponding OFM; thus 2D kernels will produce an OFM of dimension . This requires times less computations compared to SFCC, but the output features capture no information across channels.The second category is group-wise convolution (GWC) [krizhevsky2012imagenet], shown in Fig. 1(c), which provides a compromise between SFCC and DWC. Here, a single channel of the OFM is computed by convolving groups of channels from the IFM with partitions of the 3D filters, each of size . Thus, with a total number of groups , a 3D filter of dimension provides an OFM of size . Interestingly, SFCC can be viewed as GWC with and DWC can be viewed as GWC with . Typically, the number of groups is chosen to be a small power of 2, but the choice is highly network architecture dependent [ioannou2017deep].
Finally, Fig. 1(d) illustrates PWC in which the 2D kernel dimension has size , thus generating a single OFM channel with low complexity. In particular, compared to a 2D kernel dimension, the PWC has lower computational complexity. However, OFMs generated through this approach do not contain any embedded information within a channel.
Many well known network architectures have taken advantage of the benefits of pre-defined computationally-limited filters. For example, a combination of GWC and PWC was used in [ioannou2017deep]
and in the Inception modules
[szegedy2017inception, szegedy2016rethinking]. The ResNext architecture [xie2017aggregated] also uses a combination of GWC and PWC to replace each CONV layer of ResNet [he2016deep]. A class of scaled-down, reduced parameter architectures that replace most of the 33 filters with PWC filters was dubbed SqueezeNet in [iandola2016squeezenet]. MobileNet and MobileNetV2, two popular variants of low complexity architectures designed to be implemented in mobile devices, replace the SFCC layer with a DWC followed by a PWC layer to gather information across channels. ShuffleNet [zhang2018shufflenet] uses a combination of GWC, a channel shuffling for information sharing across channels, followed by a DWC layer.Most hardware platforms that process deep neural networks can benefit from sparse weight matrices only when such weights are represented through sparse matrix storage formats. These formats typically store non-zero elements of a given matrix in a vector while auxiliary vectors describe the locations of non-zero elements. This section explains three such methods commonly employed.
The COO format [cheng2014professional] uses three vectors to represent a sparse matrix: a data vector which keeps the values of non-zero elements, a row vector which stores the row indices of non-zero elements, and finally, a column vector which keeps track of column indices of non-zero elements. For example, consider the sparse matrix shown below.
The data, row, and column vectors for this matrix are as follows:
In this representation, the size of all three vectors are the same and equal to the number of non-zero elements in the original sparse matrix.
Similar to the COO format, the CSR format [cheng2014professional] uses three vectors to represent a sparse matrix. The data vector stores values of non-zero elements in the order they are encountered when traversing the elements of the original matrix from left to right and top to bottom. The column vector keeps track of the column indices of non-zero elements, and the index vector stores additional information used to identify the indices of the elements of each row of the matrix within the data vector. In fact, the column vector is the same as the one in the COO while the index vector stores the row vector in the COO format in a compressed manner, hence the name CSR. As an example, the data vector and the auxiliary vectors for the sparse matrix are as follows:
Here, the bold entries in the data vector indicate the first nonzero elements of a new row of and occur at indices 0, 3, and 7 in the data vector. Thus, storing these indices in the index vector, along with the column vector, determine both the row and column for each element of the data vector. The index vector always begins with zero and ends with the length of the data vector. If a row of has no nonzero elements, the corresponding element in the index vector is repeated.
The CSC format [cheng2014professional] is very similar to the CSR and, in fact, is equivalent to CSR storage of the transpose of . The column vector for CSR storage of is the row vector for CSC storage of as follows
Similar to the CSR format, in the CSC format, the size of the data and row vectors are the same and equal to the number of non-zero elements. However, the size of the index vector is equal to the number of columns in the sparse matrix plus one.
Some of the existing deep neural network (inference) accelerators such as Cambricon-X [zhang2016cambricon], Eyeriss [chen2016eyeriss]^{1}^{1}1Note that Eyeriss [chen2016eyeriss] uses run-length coding (RLC) to represent sparse vectors, in particular sparse activations, whereas and are better suited to represent sparse matrices., and Eyeriss v2 [chen2019eyeriss] have hardware support for processing values represented using sparse storage formats. The periodic sparsity introduced in this work allows us to further compress sparse representations such as the CSR and CSC formats by reusing the auxiliary vectors. This not only decreases the storage required for keeping model parameters in memory but also reduces the energy associated with transferring them from the main memory to processing elements (PEs). Furthermore, the proposed optimized sparse storage formats can be integrated into some of the existing accelerators such as Eyeriss v2 with minor modifications to the controller logic or PEs. Section 4 details the storage and energy savings achieved through deployment of the proposed formats.
This section first describes pSConv, a form of pre-defined sparse kernel based convolution that we initially proposed in [pSConv2019kundu]. It then describe how we introduce periodicity to this framework to reduce the overhead of managing sparse matrix representations. Finally, the section presents a method to boost accuracy by periodically introducing a fully connected kernel into the 3D filters.
We define the kernel support as the set of entries in a 2D kernel that are not constrained to be zero. The size of this set is defined as kernel support size (KSS). The kernel variant size (KVS) is defined as the number of kernels with unique kernel support in a 3D filter.
We say a 3D filter of size has pre-defined sparsity if some of the parameters are fixed to be zero before training and held fixed throughout training and inference. A regular pre-defined sparse 3D filter has the same KSS for each kernel that comprises the 3D filter.^{2}^{2}2We only consider the convolutional weights when defining sparsity. Bias and other variables associated with batchnorm are not considered because they add negligible complexity. This regularity can help reduce the workload imbalance across different PEs performing multiply-accumulates and thus can help improve throughput of CNN accelerators [chen2018eyeriss]. Fig. 2 shows an example of kernel variants. Here, , meaning denotes the standard kernel without any pre-defined sparsity and signifies that seven of the nine kernel entries are fixed at zero. The choice of kernel variants can be viewed as a model search problem, however, in this paper we adopted a lower complexity approach of choosing them in a constrained pseudo-random manner which ensures every possible locations in 2D kernel space (9 in this case) has at-least one entry in a 3D filter which is not pre-defined to be zero. As an example, Fig. 3 illustrates how an OFM of size is generated through convolution of pre-defined sparse kernels of size with an IFM of size .
The challenge with efficiently implementing this scheme is how to avoid processing the weight entries that are fixed at zero. Because the kernel variants are chosen randomly from a potential set of options and KVS could be as large as for each 3D filter, the non-zero weight index memories can represent considerable overhead. We propose to address this problem by introducing periodicity within a 3D filter, as described below.
In order to reduce the overhead of storing the sparsity patterns, we propose to repeat the sparsity patterns, using only a small number of kernel variants across all filters. This is particularly beneficial in the compressed sparse weight formats because the same index values can be used for multiple filters.
Fig. 4 shows an example of periodically repeating kernel patterns, with a periodicity . Notice to retain periodicity across different 3D filters and while still providing some diversity, we rotate the sequence of kernel variants, starting each filter (of consecutive filters) with a different kernel variant. For instance, if the first 3D filter starts with KV1 followed by KV2, KV3, and KV4, and then repeats the order, we start the second 3D filter with KV2 to create a repeating sequence of [KV2, KV3, KV4, KV1]. Thus, we maintain the sequence of repeating kernels modulo rotation.
Our specific choice of sparse KVs in our experiments are obtained by sequentially picking non-zero 2D entries randomly constrained such that no non-zero 2D entry is chosen twice until all entries of the kernel are chosen at least once. Furthermore, we ensure that every pixel in the input frame has an opportunity to affect the outcome of our sparse-periodic network which constrains the minimum value of periodicity . For example, for a kernel, with of 1, the minimum value of necessary to ensure every entry of the kernel is chosen is 9. More specifically, the nine sparse 2D kernels in this example each must have a different single non-zero entry such that together they cover all entries.
Although the periodicity in sparse patterns is beneficial for overhead management of the sparsity, the choice of KSS and the simplistic way of choosing kernel variants may sometimes cost significant classification performance. Methods to find suitable sparse patterns and KVS values through pattern pruning, inspired by image smoothing filters, were recently considered in [ma2019pconv]. However, here, we propose a complementary approach in which we periodically introduce fully connected (FC) kernels, i.e., kernels with KSS = , within each 3D filter. We use to denote the number of dense or FC kernels in a period and, in principle, it can have any value between 0 and . In the case of standard (dense) convolution filter based models, . In contrast, implies no boosting. Fig. 5 illustrates the idea where one fully connected kernel ( = 1) is introduced every kernels and with other sparse kernels, repeating this pattern throughout the 3D filter.
Note that our selection of the period is premised on the fact that balancing the computational requirements across 3D filters is preferred for hardware implementations because it enables more efficient scheduling across parallel computational units. It is therefore desirable to have a fixed number of non-zero weights per filter which implies having an equal number of FC kernels per filter. Given the approach illustrated in Fig. 4 it is therefore preferred to have be divisible by . In our experiments, detailed in Section 5, the layers where boosting is applied have . Thus our preferred values of are {2, 4, 8, 16, 32, 64}.
To choose the sparse kernel variants we follow the same principle as described in Section 3.2 before adding the FC kernels. However, in the presence of an FC kernel, can, in principle, be lower than the minimum without boosting because the FC kernel covers all entries. Moreover, when , we propose to randomly reuse some of the sparse kernel variants to maintain periodicity.
Approach | FLOP count (forward, ideal) |
---|---|
MobileNet-like [howard2017mobilenets] | |
(DWC+PWC) | |
ShuffleNet-like [zhang2018shufflenet] | |
(GWC+PWC) |
The total FLOPs for MobileNet-like and ShuffleNet-like CONV layers can be estimated as shown in Table II. The total FLOPs for sparse (both periodic and aperiodic variants) kernel based CONV layers with KSS of can be estimated as
(3) |
To estimate the FLOPs of sparse kernel based CONVs with boosting^{3}^{3}3Here each period is assumed to have only one FC kernel., we start with the number of elements in a period () that are allowed to be non zero which can be computed as (shown in Fig. 5),
(4) |
Now, with total number of dense, and sparse kernels in each 3D filter of a layer the FLOPs can be computed as,
(5) |
The ratio of the FLOP counts for MobileNet-like and ShuffleNet-like layers to that of sparse kernel based CONVs with boosting is
(6) |
(7) |
It is clear that we will have computational saving when the values of and are greater than 1. When is large and , (6) and (7) can be approximated as
(8) | ||||
(8a) |
which shows the complexity increment due to periodic insertion of FC kernels is negligible for relatively wide networks with large periods. Fig. 6 shows a 3D illustration of the per layer FLOP ratios ( and ) as a function of and . Note that even though the per layer ratio can be less than 1, the total parameter count for MobileNet or ShuffleNet-like networks can be larger due to the presence of more layers.
Sparsity leads to savings in storage only when the overhead of storing the auxiliary vectors to manage sparsity is negligible. This section presents a new sparse representation specifically tailored to periodic sparse kernels and compares it with existing formats. It also analyzes storage requirements of different sparse representations analytically, allowing the study of the effectiveness of such formats at different levels of density. Furthermore, it explains how the proposed representation can be exploited in CNN accelerators.
The periodic pattern of kernels introduced in Section 3.2 allows reusing the column/row vector in the CSR/CSC format. For example, assume a convolutional layer with kernels, 128 input channels, 128 output channels, and a period of four. The 4D weight tensor corresponding to this convolutional layer can be represented by a flattened weight matrix where each row corresponds to a flattened filter. As a result, the number of rows in the flattened weight matrix is equal to 128 while the number of columns is . Because of the periodicity across filters, the structure of the rows of the flattened weight matrix will also repeat with a period of four. Therefore, one can simply store the column vector of the CSR format for the first four rows and reuse them for the subsequent rows. We refer to this new sparse storage format as CSR with a periodic column vector and denote it with , where denotes the period of repetition of the column vector.
Similarly, because of the periodicity of kernels within a filter, the columns of the flattened matrix also repeat with a period of . As a result, one can choose to use the CSC format to represent the flattened sparse matrix and reuse the row vector for groups of 36 columns. We refer to this new format as CSC with a periodic row vector and denote it with , where the here denotes the period of repetition of the row vector.
Table III summarizes the notation used for comparing the storage cost of different storage formats. Using the notation introduced here, Table IV explains storage requirements of different storage formats.
Variable | Description |
---|---|
, | height, width of a flattened weight matrix |
density () | |
number of bits for representing data values | |
number of bits for representing row, column values | |
number of bits for representing index values | |
number of bits for representing the period |
Format | Storage Requirement (bits) |
---|---|
Dense | |
COO | |
CSR | |
CSC | |
Based on Table IV, the COO format is expected to have higher overhead than that of the CSR and CSC formats, which have similar storage overhead. Furthermore, it is evident that the introduction of periodicity to the CSR and CSC formats can significantly decrease the storage overhead.
As noted above, a convolutional layer with periodic sparse kernels induces a flattened weight matrix that also has periodically repeating columns and rows. In a CNN accelerator, the processing of a convolutional layer is often broken down into smaller operations where subsets of the flattened weight matrix are processed across multiple PEs. This processing requires accessing a sub-matrix of the flattened weight matrix. If this sub-matrix is large enough, it will also have row or column vectors that are repeated periodically. For example, Fig. 7 demonstrates a subset of a flattened weight matrix that is used in a single processing element of an architecture like Eyeriss v2 [chen2018eyeriss] (the original flattened weight matrix is built using the first four kernel variants shown in Fig. 2). This sub-matrix corresponds to processing the first (top) row of four kernels of 16 filters. Specifically, the sub-matrix consists of 16 rows corresponding to 16 filters and 12 columns corresponding to the top row of four kernels per filter. Note in Fig. 7, the four kernels have been rotated as described in Section 3.2. Based on the periodic pattern across filters, the sub-matrix shown in Fig. 7 has repeating rows with a period of four and can be represented using .
Because each PE in a CNN accelerator processes a small portion of the flattened weight matrix, , , and have small ranges and therefore can be represented using a small number of bits. For example, assuming , , and , Fig. (a)a compares storage requirements of various existing storage formats at different levels of filter density.
It is observed that the CSR and CSC formats yield lower total storage when the original matrix is at most 62% and 65% dense, respectively.
Fig. (b)b compares storage requirement of dense, CSR, and formats for the same matrix that was shown in Fig. (a)a, for different values of , and . It is observed that the and yield lower total storage when the original matrix is at most 82% and 73% dense, respectively. Furthermore, at 62% density, and yield lower total storage compared to by 23% and 16%, respectively^{4}^{4}4Interestingly, has similar storage requirements as . In particular, as implemented in Eyeriss [chen2016eyeriss], at 62% density, would lead to 0.14% more storage than .. This is equivalent to 60.04% and 39.86% reduction in the overhead of storing auxiliary vectors for the and compared to the format, respectively.
Because the energy cost associated with transferring from the DRAMs is well-modeled as proportional to the number of bits read [greenberg2013lpddr3], the reduced storage requirements of / lead to a proportional reduction in the energy cost associated with DRAM access. For example, a 50% savings in storage will result in a reduction in energy consumption related to DRAM access. For this reason, in the remainder of this paper, we focus on savings in storage requirements with the energy savings being implicit.
The low-complexity storage formats introduced in Section 4.2.1, i.e. /, cannot be integrated into existing accelerators without ensuring they can support the proposed periodic sparse format. For example, in Eyeriss v2, each weight value (i.e. data) is coupled with its corresponding index and they are read as a whole from the main memory. On the other hand, the / store the column/row vector separately from the data vector and read the auxiliary vectors once for all data values. This not only requires proper adjustment of the bus that transfers data from the DRAM to the chip but also may require a minor modification in either the control logic or PEs.
One approach to make an accelerator like Eyeriss v2 compatible with periodic sparsity is to store the weights in DRAM using the proposed sparse periodic format and modify the system-level control logic to expand the column/row vector before storing them in the PE’s scratchpad memory. In other words, the sparse column/row vector is read from the DRAM only once, but replicated before being written into the scratchpad memory corresponding the the column/row vector so that they adhere to the CSR/CSC format. In this manner, the scratchpad memory within each PE remains the same and stores bundled (data, index) pairs. Because DRAM accesses consume two orders of magnitude more energy than on-chip communication, we can thus achieve close to the optimal energy savings without requiring any change in the PE array or its associated control structures.
A more comprehensive approach to supporting periodic sparsity involves ensuring the PEs can use the column/row scratchpad memory as a configurable circular buffer, which, to support periodicity, will be configured to have length . This type of support may already exist because in many cases, the size of the weight matrix processed within each PE is smaller than the size of the corresponding scratchpad memory and therefore, only a portion of the scratchpad memory is used. In this approach, the periodic column/row vector is read from the DRAM once, written into the scratchpad memory, and accessed multiple times for different rows of the weight matrix. This reduces the required on-chip communication and thus may save more memory compared to storing the expanded column/row vectors in the scratchpad memory.
While the presented approaches enable compression of the column/row vectors, one may be able to compress the index vector as well, as suggested by the row periodicity illustrated in Fig. 7. However, this may require more complex hardware support to expand the index vector before storing them in the PEs or adding support for the compressed index vectors within the PE.
This section describes our simulation results and analysis. We first detail the datasets, architecture, and important hyperparameters used for our experiments, followed by our experimental results of our proposed pSConv approach, the introduction of periodicity, and our performance boosting technique. Finally, we compare our modified network architectures with MobileNetV2 [sandler2018mobilenetv2]
, a popular low-complexity CNN variant for image classification, in terms of FLOPs, model parameters, and accuracy. We used Pytorch
[paszke2017automatic] to design the models and trained/tested the models on AWS EC2 P3.2x large instances that have an NVIDIA Tesla V100 GPU.To evaluate our models we used CIFAR-10 [krizhevsky2009learning] and Tiny ImageNet [le2015tiny], two widely popular image classification datasets. The input image dimensions of CIFAR-10 and Tiny ImageNet are () and (), respectively. The number of different output classes for these two datasets are 10 and 200, respectively. We chose variants of VGG16 [simonyan2014very] and ResNet18 [he2016deep] as the base network models to apply our architectural modifications. The VGG16 architecture has thirteen kernel based convolutional layers. The flattened output of final CONV layer is fed to the fully connected part having three fully connected (FC) layers.^{5}^{5}5In VGG16 for CIFAR-10 dataset, we used only one FC layer because the input image dimension is smaller than Tiny ImageNet and multiple FC layers are not needed to achieve high accuracy. The CONVs of ResNet18 architecture consists of four layers each containing two basic blocks, where each basic block has two convolutional layers along with a skip connection path. We used pre-defined sparse kernels on all CONV layers where
but excluded the first layer, as it is connected to the primary inputs and is thus more sensitive to zero weights. Training was performed for 120 and 100 epochs for CIFAR-10 and Tiny ImageNet, respectively. The initial learning rate was set to 0.1 with momentum of 0.9 and weight decay value to
. The image datasets were augmented through random cropping and horizontal flips before being fed into the network in batches of 128 and 100 for CIFAR-10 and Tiny ImageNet, respectively. All results reported are the average over two training experiments. Table V provides the names of each variant of network model and corresponding architecture descriptions.We analyzed three different variants of regular sparse kernel based CONVs with KSS values of 4, 2 and 1 along side the baseline standard convolution based network. As stated earlier, in our choice of kernel patterns we ensure each of the possible kernel entries are covered by at least one sparse kernel variant. Table VI provides the results in terms of accuracy and parameter count^{6}^{6}6We considered the convolution layer parameters only to report in the tables of this section without considering the overhead of indexing. with the KSS variants applied in VGG16 and ResNet18 architectures. The ResNet18-based results show that even with KSS of only 4, the test accuracy degradation is within for CIFAR-10 dataset, and within for Tiny ImageNet. The same results for VGG16 show a test accuracy degradation is within for CIFAR-10 dataset, and within for Tiny ImageNet.
The storage and energy advantage associated with periodically repeating kernels with some specific set of kernel variants, analysed in Section 4.2, motivated us to evaluate its performance in terms of test accuracy. We leveraged the observation provided by [ma2019pconv] and kept the KVS small for different KSS based architectures. In particular, as KSS of 4 covers more kernel entries per variant, we chose a corresponding P = KVS = 4 and covered all possible kernel entries of the kernels. For similar reasons, we chose larger KVS for KSS of 2 and 1, respectively (6 and 9, respectively). We selected kernel variants as described in Section 5.2. Fig. 11 shows the learning curves for CIFAR-10 and Tiny ImageNet datasets with different variants of VGG16 and ResNet18 models with KSS of 1.^{7}^{7}7We saw similar trends with KSS of 2 and 4 in VGG16 and ResNet18, and so did not show in separate plots for brevity’s sake. It is clear that the sparse variants learn at similar rates as the corresponding baselines.
Table VII shows the impact of an added periodicity constraint on test accuracy with our proposed variants. Note that because of the overhead of storing auxiliary vectors, the overall storage reduction is smaller than the ones reported in Table VII. For example, for VGG16_PS4_P4, the reduction in the number of parameters is 55.6%, but including the storage of the auxiliary vectors in format, the reduction is approximately 44.6%. If format is used, the reduction in overall storage requirements, relative to the baseline is approximately 25%.
The results without and with periodically repeating sparse kernel patterns discussed in Sections 5.2 and 5.3, respectively, show considerable performance degradation at low KSS values such as 1. This section presents the performance of the network models with the proposed boosting method in which we periodically incorporate FC kernels () in the 3D filter.^{8}^{8}8In this paper, we focus on results with one FC kernel per period, i.e., = 1. However, we also evaluated performance with larger values of . For example, = 2 for = 16, yields similar accuracy as = 1 for = 8. Both models have similar parameter counts but the latter has significantly lower storage costs, suggesting restricting our model space to have is reasonable.
To evaluate the value of boosting, we measure its impact when periodicity is set to 8 and 16 as well as when applied to the non-boosting configurations used in Table VII. We tested the same sparse kernel variants as those used in Section 5.3. Thus, when the number of unique variants are less than , we randomly chose some of the sparse kernel variants to repeat before placing the FC kernels. However, for simulation of aaa_PSD1_P8 models we randomly choose 7 of 9 unique sparse kernel variants. Note that because each period will now contain one FC kernel, the proposed criteria of covering all kernel entries within a period is automatically satisfied.
Table VIII and IX show the classification accuracy improvement compared to their sparse periodic counterparts and parameter count reduction compared to the corresponding baseline models. The results show that boosting yields an improvement of up to 3.2% (3.6%) in classification accuracy for CIFAR-10 (Tiny ImageNet). With sparse KSS of 4, the average performance improvement compared to periodic sparse models is . This is quite intuitive as the potential improvement is lower when KSS is high. However, for low KSS the average improvement is . For example, ResNet18 with KSS of 1 and repeating FC kernels with a period of 8 on CIFAR-10 provides an accuracy degradation of only compared to the baseline, which was earlier without the FC kernels inserted. This motivates the use of boosted pre-defined kernels that are very sparse. We observed similar trends with Tiny ImageNet as well. The relative cost of the increase in parameters due to boosting is low and, as the periodicity of the fully connected kernel placement increases, it becomes negligible. Fig. 16 shows the accuracy vs. FLOPs^{9}^{9}9We consider FLOPs associated with only the convolution layers because they generally represent the vast majority of FLOPs. relation for different architecture variants. Models whose points lie towards the top-left have better accuracy with fewer FLOPs. In particular, for VGG16 and ResNet18 variants on CIFAR-10 and VGG16 variants on Tiny ImageNet, boosting performs consistently well, whereas, as we can see from Fig. 16 (d), boosting is not as beneficial for Tiny ImageNet on ResNet18. In general, we see that, with modest computation overhead, boosting consistently improves accuracy for models with extremely low KSS and maintains high accuracy otherwise.
It is important to emphasize that the overall parameter overhead is a function of both periodicity and KSS, as exemplified by the four sparse models described in Table X analyzed using the storage requirement formulas in Table IV. Comparing models 1 and 2, which have the same sparse KSS, shows the impact of periodicity; as does comparing models 3 and 4. In contrast, comparing models 1 and 3 shows the impact of KSS for fixed periodicity; as does comparing models 2 and 4. The last two columns of the table represent the parameter counts normalized with respect to the baseline model. Averaging across the four examples, the table shows that reduces the overall parameter count compared to , including the sparse matrix representation, by . Perhaps more importantly, the results show that the format can reduce the overall parameter count by as much as compared to the baseline model.
To better evaluate the space and choice of KVs, we generated model variants with six different random seeds. We tested VGG16 and ResNet18 models with of 4 and 2 to classify CIFAR-10 and Tiny ImageNet. We observed differences of less than between the minimum and maximum classification accuracy across the different seeds. In particular, for ResNet18_PSD2_P8 and ResNet18_PSD4_P8 the gaps between minimum and maximum accuracy are 0.55% and 0.44%, respectively, averaged over the two datasets. For VGG16_PSD2_P8 and VGG16_PSD4_P8 these values are 0.65% and 0.65%, respectively.
Lastly, to demonstrate boosting has general benefits, Table XI shows the results of boosting with Tiny ImageNet^{10}^{10}10For the CIFAR-10 dataset we obtained similar results, with ResNet18_pSC4_P8 exceeding the baseline performance with an average test accuracy of 92.95%. when the FC kernels are placed periodically, with period , in between sparse kernels with no pre-defined KVS or kernel variants (as described in Section 5.2). Note, as with the networks described in Section 5.2, the lack of structure makes these models have higher indexing overhead compared to the periodic models analyzed above.
Because ShuffleNet [zhang2018shufflenet] and MobileNetV2 [sandler2018mobilenetv2] are two widely-accepted low-complexity CNN architectures, we compared them with our proposed pre-defined periodic sparse models that have similar or fewer FLOPs.^{11}^{11}11Note that we kept the hyperparameters for MobileNetV2 training the same as ResNet18 except the weight decay which was set to 0 as recommended by the original papers [sandler2018mobilenetv2]. In particular, Fig. 17(a) shows that for CIFAR-10 the ResNet18_PSD1_P16 increases accuracy to 92% compared to the baseline MobileNetV2 (ShuffleNet) accuracy of 90.3% (). Note that our obtained accuracies are also superior than reported in [lawrence2019iotnet] and only around 1% less than the accuracy reported in [she2019scienet] which was trained for 180 additional epochs. The pre-defined sparse CNN model VGG16_PSD1_P8 With 0.073 G FLOPs, has approximately () fewer computation complexity yet still outperforms MobileNetV2 (ShuffleNet) in terms of accuracy. For Tiny ImageNet, as shown in Fig. 17(b), our best classifying model provides an accuracy improvement of 3.2% with only 4% (2.6%) increased complexity compared to MobileNetV2 (ShuffleNet).
Moreover, as we can see from Fig. 18(a), and (b), with () fewer parameters our proposed models perform similar to ShuffleNet for Tiny ImageNet (CIFAR-10). Similarly, the parameter requirement of our proposed models with similar accuracy as MobileNetV2 are , and lower for CIFAR-10 and Tiny ImageNet, respectively.^{12}^{12}12These values can be translated to the normalized parameter count with the help of the formulas in Table IV.
Squeezing the network layers, i.e. reducing the number of channels per 3D filter by a factor of (< 1.0), popularly known as the width multiplier, is another simple technique to reduce the network’s FLOPs and storage requirement [howard2017mobilenets, iandola2016squeezenet, tan2019efficientnet]. To further establish the idea of the pre-defined periodic sparsity, we apply our proposed kernels in squeezed variant of the ResNet18 architecture with an of 0.5. The important network model parameters of the squeezed variants of ResNet18 and MobileNetV2 models are described in Table XII. With the same hyperparameter settings as stated in Section 5.1, the baseline accuracy for ResNet18 with =0.5 are 91.1%, and 59.1% for CIFAR-10, and Tiny ImageNet, respectively. We trained several variants of this squeezed model with KSS values of 4, 2, and 1, each with the fully connected kernel repeating after every 8 and 16 kernels. Fig. 19 shows our proposed variants of squeezed ResNet18 consistently outperforms both MobileNetV2_0.75 and MobileNetV2 in classification accuracy, keeping the number of FLOPs similar or lower. In particular, Fig. 19 (a) shows that on CIFAR-10 dataset, to provide similar accuracy the squeezed ResNet18 with KSS of 2 and periodicity of 16 requires fewer FLOPs compared to MobileNetV2. Also, the ResNet18 variant that requires the least number of FLOPs, provides improved accuracy with fewer computations compared to MobileNetV2_0.75. A similar trend is observed for Tiny ImageNet, as shown in Fig. 19(b). Averaged over the two datasets, the proposed squeezed ResNet18 variants provides similar accuracy with , and fewer FLOPs compared to MobileNetV2_0.75 and MobileNetV2, respectively. On the same datasets, when we constrain the number FLOPs to be similar, pre-defined periodic sparsity can provide an average accuracy improvement of and , compared to MobileNetV2 with of 0.75 and 1.0, respectively. The model parameter reduction factors are proportional to the computation reduction and as the ResNet18_0.5 model has comparable parameters as MobileNetV2, advantage in storage for the sparse versions of ResNet18_0.5 is quite clear, and thus not discussed in details for brevity’s sake.
This paper showed that with pre-defined sparsity in convolutional kernels the network models can achieve significant model parameter reduction during both training and inference without significant accuracy drops. However, managing sparsity requires matrix indexing overhead in terms of storage and energy efficiency. To address this shortcoming, we added periodicity to the sparsity, periodically using same sparse kernel patterns in the convolutional layers, significantly reduce the indexing overhead.
Furthermore, to deal with the performance degradation due to pre-defined sparsity, we introduced a low-cost network architecture modification technique in which FC kernels are periodically inserted in between sparse kernels. Experimental results showed that, compared to the sparse-periodic variants, this boosting technique improves average classification accuracy by up to , averaged over two periodicity of 8, and 16 in ResNet18 and VGG16 architecture on CIFAR-10 and Tiny ImageNet. We also demonstrated the merits of the proposed architectures with squeezed variants of ResNet18 (width multiplier < 1.0) and have shown it to outperform MobileNetV2 by an average accuracy of with similar FLOPs.
Our future work includes exploring additional forms of compressed sparse representations and their hardware support.
Lastly, we note that much of our findings are empirical in nature. Finding a more theoretical basis that can motivate and guide the use of periodic pre-defined sparsity in deep learning is also an important area of future work.
Comments
There are no comments yet.