neurips-2019-micronet-challenge
NeurIPS 2019 MicroNet Challenge
view repo
Deep neural networks (DNN) have shown remarkable success in a variety of machine learning applications. The capacity of these models (i.e., number of parameters), endows them with expressive power and allows them to reach the desired performance. In recent years, there is an increasing interest in deploying DNNs to resource-constrained devices (i.e., mobile devices) with limited energy, memory, and computational budget. To address this problem, we propose Entropy-Constrained Trained Ternarization (EC2T), a general framework to create sparse and ternary neural networks which are efficient in terms of storage (e.g., at most two binary-masks and two full-precision values are required to save a weight matrix) and computation (e.g., MAC operations are reduced to a few accumulations plus two multiplications). This approach consists of two steps. First, a super-network is created by scaling the dimensions of a pre-trained model (i.e., its width and depth). Subsequently, this super-network is simultaneously pruned (using an entropy constraint) and quantized (that is, ternary values are assigned layer-wise) in a training process, resulting in a sparse and ternary network representation. We validate the proposed approach in CIFAR-10, CIFAR-100, and ImageNet datasets, showing its effectiveness in image classification tasks.
READ FULL TEXT VIEW PDFNeurIPS 2019 MicroNet Challenge
Finding Storage- and Compute-Efficient Convolutional Neural Networks
Convolutional neural networks (CNN) have excelled in numerous computer vision applications. Their performance is attributed to their design. That is, deeper (i.e., designed with many layers) and high-capacity (i.e., equipped with many parameters) CNNs achieve better performance in a given task, at the cost of sacrificing computational and memory efficiency. This general trend has been disrupted by the need to deploy neural networks in resource-constrained devices (e.g., autonomous vehicles, robots, smartphones, wearable, and IoT devices) with limited energy, memory, and computational budget, as well as low-latency and/or low-communication cost requirements. Thus, driven by both the industry and the scientific community, the design of efficient CNNs has become an active area of research. Moreover, the Moving Picture Expert Group (MPEG) of the International Organization of Standards (ISO) joined this endeavor, and recently issued a call on neural network compression techniques [1].
Recent studies have shown that most CNNs are over-parameterized for the given task [2]. Such models can be interpreted as super-networks, designed with millions of parameters to reach a target performance (e.g., high classification accuracy), while being memory and computational inefficient. However, from these models, it is possible to find a small and efficient sub-network with comparable performance. This hypothesis has been validated with simple methods, i.e., by pruning neural network connections based on the weights’ magnitude [3], resulting in little accuracy degradation. Moreover, the recently proposed lottery-ticket hypothesis [4], supports the existence of an optimal sub-network inside a super-network, and has shown to generalize across different datasets and optimizers [5].
Among existing network compression techniques, pruning and quantization are two popular and effective techniques to reduce the redundancy of deep neural networks [6]. Pruning entails systematically removing network connections in a structured (i.e., by removing groups of parameters) or unstructured fashion (i.e., by removing individual parameter elements) [7]. In contrast, quantization minimizes the bit-width of the network parameter values (and thus, the number of distinct values) [8, 9]. From another perspective, efficient neural networks can be designed by finding the right balance between its dimensions, i.e., the networks’ width, depth, and input resolution. In this regard, compound model scaling [10]
allows scaling the dimensions of a baseline-network according to some heuristic rules grounded on computational efficiency.
In this work, we propose Entropy-Constrained Trained Ternarization (EC2T), a method that leverages on compound model scaling [10] and ternary quantization techniques [9], to design a sparse and ternary neural network. The motivations behind such network representation are based on efficiency. Specifically, in terms of storage, at most two binary-masks and two full-precision values are required to represent and save each layer’s weight matrix. Regarding mathematical operations, multiply-accumulate operations (MACs) are reduced to a few accumulations plus two multiplications. The EC2T approach is illustrated in Figure 1 and consists of two stages. In the first stage, a super-network is created by scaling the dimensions of a baseline-network (its width and depth). Subsequently, during a training stage, a sparse and ternary sub-network is found by simultaneously pruning (enforced by introducing an entropy constraint in the assignment cost function) and quantizing (ternary values are assigned layer-wise) the super-network. Specifically, our contributions are:
We propose an approach to design sparse and ternary neural networks, that relies on compound model scaling [10] and quantization techniques. For the latter, we extend the approach described in [9] by introducing an assignment cost function in terms of distance and entropy constraints. The entropy constraint allows adjusting the trade-off between sparsity and accuracy in the quantized model. Therefore, quantized models with different levels of sparsity can be rendered, according to the compression and application requirements.
Our approach allows simultaneous quantization and sparsification in a single training stage.
In the context of image classification, the proposed approach finds sparse and ternary networks across different datasets (CIFAR-10, CIFAR-100, and ImageNet), whose performance is competitive with efficient state-of-the-art models.
This paper is organized as follows. First, in section 2, a literature review of techniques to design efficient neural networks is provided, emphasizing those that are related to our approach. Subsequently, in section 3, the proposed EC2T approach is detailed. Afterward, in section 4, we present experimental evidence and results, validating the proposed method across different networks and datasets. Finally, in section 5, we discuss the insights of the EC2T approach, its advantages and downsides, and future work.
In recent years, various techniques have been proposed in the literature to design efficient neural networks, e.g., pruning, quantization, distillation, and low-rank factorization [6]. In particular, pruning and quantization provide unique benefits to DNNs in terms of hardware efficiency and acceleration.
Pruning removes non-essential neural network connections, according to different criteria, either in groups (structured pruning) or individual parameters (unstructured pruning). Specifically, the second approach is achieved by maximizing the sparsity ^{1}^{1}1Percentage of zero-valued parameter elements in the whole neural network. of the network parameters. Consequently, the computational complexity of the network is reduced, since arithmetic operations can be skipped for those parameter elements which are zero [11]. Early works on sparsity use second-order derivatives (Hessian) to compute the saliency of parameters, suppressing those with the smallest value [12, 13]. Current state-of-the-art techniques to promote sparsity in DNNs rely either on magnitude-based pruning or Bayesian approaches [14]. Magnitude-based pruning is the simplest and most effective way to induce sparsity in neural networks, [7]. In contrast, Bayesian approaches although computationally expensive, represent an elegant solution to the problem. Moreover, they establish connections with information theory. In this context, variational dropout [15] and -regularization [16] are two representative techniques.
Regarding quantization, it reduces the redundancy of deep neural networks by minimizing the bit-width of the full-precision parameters. Therefore, quantized networks require fewer bits to represent each full-precision weight, and demand less mathematical operations than their full-precision counterparts. Binary networks [17, 18]
represent an extreme case of quantization where both, weights and activations are binarized. Thus, arithmetic operations are reduced to bit-wise operations. By introducing three distinct elements per layer, ternary networks achieve more expressive power and higher performance than binary networks. Moreover, sparsity can be induced in the network by including zero as a quantized value, while the remaining values are modeled with scaling factors per layer. Following this approach,
[19] proposed to minimize the Euclidean distance between full-precision and quantized parameters (e.g., ), where the latter are symmetrically constrained (e.g., , with ). In contrast, [9] used asymmetric constraints (e.g., , with and ), improving the modeling capabilities of ternary networks. Several variants of ternary network quantization exist, e.g., based on Truncated Gaussian Approximation (TGA) [20], Alternating Direction Method of Multipliers (ADMM)) [21], and Multiple-Level-Quantization (MLQ) [22], among others. With regards to hardware efficiency, ternary networks represent a trade-off between binary networks (extremely hardware-friendly, but with limited modeling capabilities) and their full-precision counterparts (with higher modeling capabilities, but expensive in terms of storage and computational resources), [19].Usually, highly efficient network representations are the result of combining multiple techniques. For instance, pruning followed by quantization [23, 24], in addition to entropy coding [25, 26, 27]. From a different perspective, progress in designing efficient neural networks has been fueled by advances in hand-crafted architectures (e.g., Mobilenet [28], Mobilenet-V2 [29], and ShuffleNet [30]) as well as neural architecture search techniques (e.g., Mnasnet [31], EfficientNet [10], and MobileNet-V3 [32]). Moreover, simpler methods such as model scaling, allows increasing the performance of a baseline network by scaling one or more dimensions (i.e., its depth, width, and input resolution) independently [31, 32]. In [10], this approach is improved with the introduction of compound model scaling, where the network dimensions are treated as dependent variables, constrained by a limited number of resources, measured in terms of floating-point operations (FLOPs).
In this research work, we advocate for compound model scaling, ternary quantization, and information theory techniques, as the core building blocks to design a CNN with optimal dimensions (i.e., the right balance between the networks’ width and depth) and efficient parameter representation (i.e., three distinct values per layer and maximal sparsity).
The entropy-constrained trained ternarization (EC2T) approach (see Figure 1), consists of two stages, namely compound model scaling followed by ternary quantization, both described in sections 3.1 and 3.2, respectively.
In this stage, a super-network is created by scaling the dimensions of a pre-trained model, resulting in an over-parameterized network. Specifically, the pre-trained network’s depth, width, and input image resolution, are modified with the scaling factors , , and , respectively, according to Equation (1). In this equation, , and , are constants determined by grid search, and is an user specified parameter. For small-scale datasets (CIFAR-10 and CIFAR-100) the input image resolution was fixed in the pre-trained model. Thus, Equation (1) was solved with . On the other hand, for large-scale datasets (ImageNet), the EfficientNet-B1 network was adopted using the scaling factors suggested in [10].
(1) | |||
In this stage, a sparse and ternary sub-network is obtained by simultaneously pruning and quantizing a super-network. To this end, we extend the approach described in [9], where a ternary network is obtained by the inter-play between quantized and full-precision models. That is, gradients from the quantized model are used to update both, its parameters and those of the full-precision model. Therefore, the first parameter update enables the learning of ternary values (i.e., only two scalar values per layer are learned, while the third quantized value, which is zero, is excluded from the learning process). On the other hand, the latter parameter update promotes the learning of ternary assignments (i.e., by adapting the full-precision parameters to the quantization process). Nonetheless, this approach does not allow explicit control of the sparsification process. To overcome this limitation, we introduce the assignment cost function shown in Equation (2), which guides the assignment (with centroid indices) of ternary values (or centroid values) in the quantized network, in terms of distance and entropy constraints.
(2) |
(3) |
In Equation (2), stands for the assignment cost for the full-precision weights at layer , given the centroid values , indexed by . Therefore, if has dimensions and there are centroid values in that layer, then . The first term in Equation (2) measures the distance between every full-precision weight element (where and are indices along the dimensions of ) and the centroid values , according to Equation (3). The second term in Equation (2), weighted by the scalar , is an entropy constraint which promotes sparsity in the quantized model. This is achieved by measuring the information content of the quantized weights, i.e.,
, where the probability
defines how likely a weight element is going to be assigned to the centroid value . This probability is calculated for each layer as , with being the number of full-precision weight elements assigned to the centroid value , and the total number of parameters in .After computing Equation (2) (for all layers and centroid values), the quantized model is updated at layer , by assigning the current centroid values (), using the new centroid indices () obtained from Equation (4). In this equation, the assignment matrix has the dimensions of the full-precision weights . For ternary networks, we define the centroid values as , and their assignments with the indices . In this notation, the indices , , and , correspond to negative, zero, and positive values, respectively.
(4) |
During the ternary quantization process, the strength of the sparsification (at layer ) is modulated by the scalar (shown in Equation (2)). As a concrete example, Figure 2 illustrates the effect of using different values for during the quantization of the parameters (in the first block) of the EfficientNet-B1 network. In practice, is computed as . In this expression, is a global hyper-parameter that controls the intensity of the sparsification, while and are scalars computed layer-wise. The scaling factor , renders higher values for layers with lots of parameters. Analogously, it renders lower values for layers with few parameters. Finally, is updated during training and avoids a binary quantization process (see the histogram with = in Figure 2).
Model | Top-1 Acc. () | () | Params. | FLOPs | ||
---|---|---|---|---|---|---|
ImageNet | ||||||
ResNet-18 | 69.75 | 0.00 | 11M | 1795M | 1797M | 3592M |
EC2T-1 () | 67.30 | 26.80 | 852K | 669M | 59M | 728M |
EC2T-2 () | 67.58 | 59.00 | 734K | 560M | 61M | 622M |
EC2T-3 () | 67.26 | 72.09 | 686K | 528M | 57M | 585M |
EC2T-4 () | 67.02 | 75.62 | 673K | 424M | 57M | 481M |
TTQ [9] | 66.60 | 30-50 | ||||
ADMM [21] | 67.00 | |||||
TGA [20] | 66.00 | |||||
CIFAR-10 | ||||||
ResNet-20 | 91.67 | 0.00 | 269K | 40.6M | 40.7M | 81.3M |
EC2T-1 () | 91.16 | 45.17 | 13.4K | 10.6M | 0.5M | 11.1M |
EC2T-2 () | 91.01 | 63.90 | 11.8K | 8.0M | 0.5M | 8.5M |
EC2T-3 () | 90.76 | 73.26 | 11.0K | 6.1M | 0.5M | 6.6M |
TTQ [9] | 91.13 | 30-50 | ||||
TGA [20] | 90.39 | |||||
MLQ [22] | 90.02 |
Baseline model. EC2T approach with the entropy constraint disabled ().
EC2T approach with the entropy constraint enabled ().
Sparsity, measured as the percentage of zero-valued parameters in the whole neural network.
: Not reported by the authors.
The experiments were conducted in a variety of networks across different datasets (i.e., CIFAR-10, CIFAR-100, and ImageNet), using multiple GPUs (NVIDIA Titan-V and Tesla-V100).
First, to reveal the advantages of our proposal (EC2T) over Trained-Ternary-Quantization (TTQ) [9], an image classification network was designed for the CIFAR-10 dataset, by introducing the building blocks of PyramidNet [33] in the ResNet-44 architecture [34]. This neural network, termed C10-MicroNet, was derived from models designed for the 2019 MicroNet Challenge ^{2}^{2}2https://micronet-challenge.github.io competition. For a detailed description of the network architecture, see Appendix A. The experimental results contrasting the two mentioned approaches are depicted in Figure 3. In this illustration, notice that as the sparsity of the quantized networks increases, EC2T shows less accuracy degradation than TTQ.
Subsequently, Table 1 provides a comparison of the EC2T approach vs state-of-the-art ternary quantization techniques, by applying them to ResNet-20 and ResNet-18 networks, in CIFAR-10 and ImageNet datasets, respectively. From these results, we have two main conclusions. First, they suggest that disabling the entropy constraint in Equation (2) (i.e., setting ), renders ternary models with low sparsity. Nonetheless, they are more efficient than their full-precision counterparts and show little accuracy degradation. These ternary networks are referred to as EC2T-1 in Table 1. Specifically, in the ImageNet dataset, the EC2T-1 model reduces the parameter count in 92.25% and the FLOPs in 79.73%, while in the CIFAR-10 dataset, the reductions are 95.02% and 86.35% in parameter count and FLOPs, respectively. In contrast, by enabling the entropy constraint in Equation (2) (i.e., setting ), it results in ternary models with increased sparsity, and thus, they are more efficient in terms of parameter size and mathematical operations. For instance, in the ImageNet dataset, the model with the highest sparsity is EC2T-4, which reduces the number of parameters by 93.88% and the number of FLOPs by 86.61%, while its accuracy is degraded only by 2.73%. Likewise, in the CIFAR-10 dataset, the model with the highest sparsity is EC2T-3, with an accuracy degradation of 0.91%, while the parameter count and FLOPs are reduced by 95.91% and 91.88%, respectively. The second conclusion is that the EC2T approach renders accurate ternary models, which are competitive with state-of-the-art techniques. Regarding sparsity, only [9]
provides an estimated value for the ternary models after applying TTQ (30%-50%). For the remaining techniques (ADMM
[21], TGA [20], and MLQ [22]), only the quantized model accuracy is reported.Finally, Table 2 contrasts efficient state-of-the-art neural networks vs sparse and ternary networks rendered with our proposal, in three distinct datasets (CIFAR-10, CIFAR-100, and ImageNet). The former models include CondenseNet [35], Mobilenet-V2 [29], and Mobilenet-V3 [32]. The latter models result from applying the EC2T approach to the pre-trained networks, C10-MicroNet, C100-MicroNet, and EfficientNet-B1 [10]. In particular, the C10-MicroNet and C100-MicroNet networks were designed and improved based on our submissions to the 2019-MicroNet Challenge.
Both share the same topology, except in the last layer (i.e., the softmax layer), which is adapted to the number of output classes (see Appendix
A). From the results in Table 2, we highlight two points. First, the ternary networks found by our proposed technique (see models indicated with EC2T), are more efficient in terms of parameter size and FLOPs than their respective baselines (C10-MicroNet, C100-MicroNet, and EfficientNet-B1). Moreover, using the tree adder [36] and efficient matrix representations (including Compressed-Entropy-Row (CER)/Compressed-Sparse-Row (CSR) formats [11] and the method described in Appendix B), leads to further savings in mathematical operations and storage (see models referred with Improvements). Second, these ternary models are competitive with current state-of-the-art efficient neural networks (i.e., CondenseNet, Mobilenet-V2, and Mobilenet-V3), offering similar advantages in terms of memory and computational resources.Model | Top-1 Acc. () | () | Params. | FLOPs | ||
ImageNet | ||||||
EfficientNet-B1 | 78.43 | 0.00 | 7.72M | 654M | 670M | 1324M |
EC2T () | 75.05 | 60.73 | 1.07M | 338M | 50M | 387M |
Improvements | - | - | 972K | 212M | 50M | 261M |
MobileNet-V2 (d=1.4) | 74.70 | 6.90M | 585M | |||
MobileNet-V3 (Large) | 75.20 | 5.40M | 219M | |||
CIFAR-100 | ||||||
C100-MicroNet | 81.47 | 0.00 | 8.03M | 1243M | 1243M | 2487M |
EC2T () | 80.13 | 90.49 | 412K | 126M | 3M | 129M |
Improvements | - | - | 226K | 67M | 3M | 71M |
CondenseNet-86 | 76.36 | 520K | 65M | |||
CondenseNet-182 | 81.50 | 4.20M | 513M | |||
CIFAR-10 | ||||||
C10-MicroNet | 97.02 | 0.00 | 8.02M | 1243M | 1243M | 2487M |
EC2T () | 95.87 | 95.64 | 295K | 72M | 3M | 75M |
Improvements | - | - | 133K | 39M | 3M | 42M |
CondenseNet-86 | 95.00 | 520K | 65M | |||
CondenseNet-182 | 96.24 | 4.20M | 513M |
Baseline model. EC2T approach with the entropy constraint enabled ().
Improved representation of the neural network parameters by applying the tree adder,
the Compressed-Entropy-Row (CER)/Compressed-Sparse-Row (CSR) formats, and the method described Appendix B.
Sparsity, measured as the percentage of zero-valued parameters in the whole neural network.
Reported as Multiply-Additions (MAdds). The number of FLOPs is approximately twice this value.
: Not reported by the authors.
In this work, we presented Entropy-Constrained Trained Ternarization, an approach that relies on compound model scaling and ternary quantization to design efficient neural networks. By incorporating an entropy constraint during the network quantization process, a sparse and ternary model is rendered, which is efficient in terms of storage and mathematical operations. The proposed approach has shown to be effective in image classification tasks in both, small and large-scale datasets. As future work, this method will be investigated in other tasks and scenarios, e.g., federated-learning [37]. Moreover, interpretability techniques [38] will help us to understand how these models make predictions given their constrained parameter space.
This work was funded by the German Ministry for Education and Research as BIFOLD - Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025A and ref 01IS18037I).
Predicting parameters in deep learning.
In Advances in neural information processing systems, pages 2148–2156, 2013.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 11438–11446, 2019.Thirty-Second AAAI Conference on Artificial Intelligence
, 2018.On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.
PLOS ONE, 10(7):e0130140, 2015.The MicroNet-C10 and MicroNet-C100 networks were designed for the CIFAR-10 and CIFAR-100 datasets, respectively. They share the same architecture described in Table A.1, which consists of three sections of layers. The first section is represented by the input layer or “Stem Convolution". The next section has three stages, each one containing identical building blocks, whose elements are depicted in Figure A.1. This block was designed by introducing the building blocks PyramidNet [33] in the ResNet-44 architecture [34]. The third section consists of a global average-pooling layer followed by a fully-connected layer. Finally, as an important remark, when applying the Entropy-Constrained Trained Ternarization (EC2T) approach, the first and last layers are not quantized.
Stage | Operation | Resolution | Output Channels | Repetitions |
---|---|---|---|---|
Stem Convolution () | ||||
BN ReLU | 1 | |||
1 | Building Block | |||
2 | Building Block | |||
3 | Building Block | |||
ReLU Global Avg. Pooling | 1 | |||
Fully-Connected | 1 |
In addition to the trainable network parameters, we count those values that are needed to reconstruct the model from sparse matrix formats, i.e., binary masks or indices. Specifically, full-precision parameters (32-bits) count as one, while quantized parameters (with less than 32-bits) as a fraction of a parameter. For instance, a binary mask element counts as 1/32 with respect to a full-precision (32-bit) parameter.
If Compressed-Entropy-Row(CER)/Compressed-Sparse-Row (CSR) formats are not applied, a ternary convolution layer of size consists of two binary masks as illustrated in Figure B.1. One mask indicates the location of the centroid values (see Figure B.1b), while the other describes the sign of those values (see Figure B.1c). Thus, the parameter count for these masks is and , respectively. In this notation, is the number of effective input channels, the kernel size, the number of effective output channels, and , with . The effective number of channels is computed as the original number of channels minus the number of channels pruned by the Entropy-Constrained Trained Ternarization (EC2T) approach. To calculate the layers’ sparsity, we exclude the pruned channels. The third matrix in Figure B.1, uses two 16-bit numbers to represent the centroid values. Thus, they count as a single full-precision (32-bit) parameter (Figure B.1
d). For the batch normalization layers, we add a 16-bit value (bias) per effective output channel. Therefore, their corresponding parameter count is
.