There is a growing trend to bring machine learning, especially deep neural networks (DNNs) powered intelligence to mobile devices[bib:arxiv2019:han]. Many smartphones and handheld devices are integrated with intelligent user interfaces and applications such as hand-input recognition (e.g., iType[bib:li2017:infocom]), speech-based assistants (e.g.,
Siri), face recognition enabled phone-unlock (e.g., FaceID). New development frameworks targeted at mobile devices (e.g.,TensorFlow Lite) have been launched to encourage novel DNN-based mobile applications to offload the DNN inference to local mobile embedded devices. In addition to smartphones, DNNs are also expected to execute on-device inference on a wider range of mobile and IoT devices, such as wearables[bib:yang2018:infocom] (e.g., Fitbit wristbands) and smart home infrastructures (e.g., Amazon Echo). The diverse applications and the various mobile platforms raise a challenge for DNN developers and users: How to adaptively generate DNNs for different resource-constrained mobile embedded platforms to enable on-device DNN inference, while satisfying the domain-specific application performance requirements?
Generating DNNs for mobile mobile embedded platforms is non-trivial because many successful DNNs are computationally intensive while mobile embedded devices are usually limited in computation, storage and power. For example, LeNet [model:lenet], a popular DNN for digit classification, involves 60k weight and 341k multiply-accumulate operations (MACs) per image. AlexNet [model:alexnet], one of the most famous DNNs for image classification, requires 61M weights and 724M MACs to process a single image. It can become prohibitive to download applications powered by those DNNs to local devices. These DNN-based applications also drain the battery easily if executed frequently.
In view of those challenges, DNN compression techniques have been widely investigated to enable the DNN deployment on mobile embedded platforms by reducing the precision of weights and the number of operations during or after DNN training with desired accuracy. And consequently, they shrink the computation, storage, latency, and energy overhead on a target platform [bib:sze2017:arxiv], [bib:arxiv2019:zhou]. Various categories of DNN compression techniques have been studied, including weight compression [bib:bhattacharya2016:CD-ROM] [bib:arXiv2015:han] [bib:lane2016:IPSN] [pmlr-v80-wu18h], convolution decomposition [bib:ICLR2017:soravit] [bib:arXiv2017:Howard] [bib:cvpr2015:ICCV], and special layer architectures [bib:iandola2016:arxiv] [bib:lin2013:NIN]. However, there are two major problems in existing DNN compression techniques:
Most DNN compression techniques aim to provide an one-for-all solution without considering the diversity of application performance requirements and platform resource constraints. A single compression technique to reduce either model complexity or process latency may not suffice to meet complex user demands on the generated DNNs. Both the selection of DNN compression techniques and the configuration of DNN compression hyperparameters should be on-demand, i.e., adapt to the requirements and constraints on accuracy, computation, storage, latency, and energy imposed by developers and platforms.
Most DNN compression techniques are manually selected and configured through experience engineering, while the design criteria remain a black-box to non-expert end developers. An automatic compression framework that allows user-defined criteria will benefit the development of DNN-powered mobile applications for diverse domain tasks.
This paper presents AdaDeep, a framework that automatically selects the compression techniques and the corresponding hyperparameters on a layer basis. It adapts to different user demands on application-specified performance requirements (i.e., accuracy and latency) and platform-imposed resource constraints (i.e., computation, storage, and energy budgets).
To integrate these complex user demands into AdaDeep, we formulate the tuning of DNN compression as a constrained hyperparameter optimization problem.
In particular, we define the DNN compression techniques (e.g., weight compression and convolution decomposition techniques listed in 7.2.1) as a new coarse-grained hyperparameter of DNNs.
And we regard the compression hyperparameters (e.g., the width multiplier and the sparsity coefficient enumerated in 7.3.1) as the fine-grained hyperparameters of DNNs.
However, it is intractable to obtain a closed-form solution, due to 1) the large numbers of the coarse-grained hyperparameter, i.e., combinations of DNN compression techniques, 2) the infinite search space of the fine-grained hyperparameters, i.e., compression hyperparameters, and 3) the varying platform resource constraints.
Alternatively, AdaDeep applies a two-phase deep reinforcement learning (DRL) optimizer.
Specifically, it involves a deep Q-network (DQN) optimizer for compression technique selection, and a deep deterministic policy gradient (DDPG) optimizer for the corresponding compression hyperparameter search.
The two optimization phases are conducted interactively to provide a heuristic solution.
applies a two-phase deep reinforcement learning (DRL) optimizer. Specifically, it involves a deep Q-network (DQN) optimizer for compression technique selection, and a deep deterministic policy gradient (DDPG) optimizer for the corresponding compression hyperparameter search. The two optimization phases are conducted interactively to provide a heuristic solution.
We implement AdaDeep with TensorFlow [url:tensorflow] and evaluate its performance over six different public benchmark datasets for DNNs on twelve different mobile devices. Evaluations show that AdaDeep enables a reduction of - in storage, - in latency, - in energy consumption, and - in computational cost, with a negligible accuracy loss () for various datasets, tasks, and mobile platforms.
The main contributions of this work are as follows.
To the best of our knowledge, this is the first work that integrates the selection of both compression techniques and compression hyperparameters into an automated hyperparameter tuning framework, and balances the varied user demands on performance requirements and platform constraints.
We propose a two-phase DRL optimizer to automatically select the best combination of DNN compression techniques as well as the corresponding compression hyperparameters, in a layer-wise manner. AdaDeep extends the automation of DNN architecture tuning to DNN compression.
Experiments show that the DNNs generated by AdaDeep achieve much improved performance, as compared to existing compression techniques under various user demands (datasets, domain tasks, and target platforms). AdaDeep also uncovers some novel combinations of DNN compression techniques suitable for mobile applications.
A preliminary version of AdaDeep has been published in [bib:mobisys2018:liu]. This work further develops [bib:mobisys2018:liu] with the following three new contributions. First, a new DRL optimizer is proposed for fully automating the solving process of the constrained DNN compression problem (see Eq. (2)). Improving upon the one-agent based DQN for both the conv and fc layers, we develop a two-phase DRL optimizer for solving the constrained DNN compression problem in Eq. (2). In particular, in the first phase AdaDeep leverages the separate DQN agents for conv and fc layers to select the optimal combination of compression techniques in a layer-wise manner (see 5), and then employs a DDPG optimizer in the second phase to search suitable compression hyperparameters for the selected compression techniques (refer to 6). Second, all the experiments in [bib:mobisys2018:liu] have been updated using the new DRL optimizer to extensively validate its effectiveness. Three, we have conducted experiments in additional model and dataset (i.e., ResNet [model:resnet] on CIFAR-100 [data:cifar100]) for evaluating AdaDeep in more diverse settings.
In the rest of this paper, we present AdaDeep’s framework in 2, and formulate user demands on performance and resource cost in 3. We present the overview of the automated two-phase DRL optimizer in 4, and elaborate the design of these two types of optimizer in both 5 and 6. We evaluate AdaDeep’s performance in 7, review the related work in 8, discuss limitations and future directions in 9, and finally conclude this work in 10.
This section presents an overview of AdaDeep. From a system-level viewpoint, AdaDeep automatically generates the most suitable compressed DNNs that meet the performance requirements and resource constraints imposed by end developers and the target deployment platforms.
AdaDeep consists of three functional blocks: DNN initialization, user demand formulation, and on-demand optimization (Figure 1). The DNN initialization block selects an initial DNN model for the on-demand optimization block from a pool of state-of-the-art DNN models ( 7.1). The user demand formulation block quantifies the DNN’s performance and cost ( 3), which are then input into the on-demand optimization block as the optimization goals and constraints. The on-demand optimization block takes the initial DNN model and the optimization goals to automatically select the optimal DNN compression techniques and compression hyperparameters that maximize the system performance while satisfying cost budgets ( 5).
Mathematically, AdaDeep aims to solve the following constrained optimization problem.
where , , and denote the measured accuracy, energy cost, latency and storage of a given DNN running on a specific mobile platform. User demands are expressed as a set of goals and constraints on , , and . Specifically, and are the minimal testing accuracy and maximal energy cost acceptable by the user. The two goals on and are combined by importance coefficients and . is a normalization operation, i.e., . We denote and as the user-specified latency and storage budgets. The metrics and can be directly determined by the DNN architecture, while and are also platform-dependent. However, all of them can be tuned by applying different DNN compression techniques and compression hyperparameters. In summary, AdaDeep aims to select the best compression techniques from the set of all possible combinations and search the optimal compression hyperparameter from the set of selective hyperparameter values , according to the user-demands on performance and resource budgets. For completeness, the set should be the permutations and combinations of discrete layer compression techniques at convolutional (conv) layers and fully-connected (fc) layers, defined as . Here, and are the number of optional compression techniques at conv and fc layers, respectively; and represent the number of conv and fc layers to be compressed, respectively; and the set is a continuous real-value space.
We maximize , minimize while constrain and within the user-specified budgets, because we assume that accuracy is the most important performance metric, and the energy efficiency is in general more important than storage and latency for the power-sensitive mobile applications. AdaDeep can also integrate other optimization problem formulations.
Technically, AdaDeep faces two challenges.
It is non-trivial to derive the runtime performance and , and the platform-dependent overhead and of a DNN. In 3, AdaDeep
proposes a systematic way to calculate these variables and associates them to the parameters of a DNN and the given platform. We apply the state-of-the-art estimation models and modify them to suite the software/hardware implementation considered in our work. Evaluations show that the proposed estimation models can achieve the same ranking as the measured one on the real-world deployment platforms.
It is intractable to obtain a closed-form solution to the optimization problem in Eq.(2). AdaDeep employs the deep reinforcement learning (DRL) based optimization process to solve it (see 4, 5, and 6). Although DRL is a well-known optimization technique, its application in automated DNN architecture and hyperparameter optimization is emerging [zoph2016neural]. We follow this trend and apply two types of layer-wise DRL optimizer, i.e., deep Q-network (DQN) and deep deterministic policy gradient (DDPG), in the context of user-demand DNN compression.
We summarize some symbols in Table I, which are frequently used in this paper.
|latency and storage budgets|
3 User Demand Formulation
This section describes how we formulate the user demand metrics, including accuracy , energy cost , latency and storage , in terms of DNN parameters and platform resource constraints. Such a systematic formulation enables AdaDeep to predict the most suitable compressed DNNs by user needs, before being deployed to mobile devices.
Accuracy . The inference accuracy is defined as:
denote the classifier decision and the true label, respectively, andstands for the sample set in the corresponding mini-batch.
Storage . We calculate the storage of a DNN using the total number of bits associated with weights and activations [pmlr-v70-sakr17a]:
where and denote the storage requirement for the activations and weights, and are the index sets of all activations and weights in the DNN. and denote the precision of activations and weights, respectively. For example, bits in TensorFlow [url:tensorflow].
Computational Cost . We model the computational cost
of a DNN as the total number of multiply-accumulate (MAC) operations in the DNN. For example, for a fixed-point convolution operation, the total number of MACs is a function of the weight and activation precision as well as the size of the involved weight and activation vectors[RDSEC_SIPS].
Latency . The inference latency of a DNN executed in mobile devices strongly depends on the system architecture and memory hierarchy of the given device. We referred to the latency model in [Latency] which has been verified in hardware implementations. Specifically, the latency is derived from a synchronous dataflow model, and is a function of the batch size, the storage and processing capability of the deployed device, as well as the complexity of the algorithms, i.e., DNNs.
Energy Consumption . The energy consumption of evaluating DNNs include computation cost and memory access cost . The former can be formulated as the total energy cost of all the MACs, i.e., , where and denote the energy cost per MAC operation and the total number of MACs, respectively. The latter depends on the storage scheme when executing DNNs on the given mobile device. We assume a memory scheme in which all the weights and activations are stored in a Cache and DRAM memory, respectively, as such a scheme has been shown to enable fast inference execution [bib:ISCA2016:chen] [bib:arXiv2017:yang][bib:arxiv2017:xu]. Hence can be modeled as:
where and denote the energy cost per bit when accessing the Cache and DRAM memory, respectively. To obtain the energy consumption, we refer to a energy model from a state-of-the-art hardware implementation of DNNs in [bib:arXiv2017:yang], where the energy cost of accessing the Cache and DRAM memory normalized to that of a MAC operation is claimed to be and , respectively. Accordingly:
where is measured to be pJ for mobile devices.
Summary. The user demand metrics (, , and ) can be formulated with parameters of DNNs (e.g., the number of , the index sets of all activations and weights ) and platform-dependent parameters (e.g., the energy cost per bit). The parameters of DNNs are tunable via DNN compression techniques and compression hyperparameters. Different mobile platforms vary in platform parameters and resource constraints. Hence it is desirable to automatically select appropriate compression techniques and compression hyperparameters to optimize the performance and resource cost for each application and platform.
Note that it is difficult to precisely model the platform-correlated user demand metrics, e.g., and , since they are tightly coupled with the platform diversity. However, the ranking of the DNNs costs derived by the above estimation models is consistent with the ranking of the actual costs of these DNNs measured on the real-world deployment platforms. As will be introduced in the next section, the proposed AdaDeep framework is generic and it can easily integrate other advanced estimation models.
4 On-demand Optimization Using DRL
We leverage deep reinforcement learning (DRL) to solve the optimization problem in Eq.(2).
Specifically, two types of DRL optimizers are employed to automatically select compression techniques and the corresponding hyperparameters (e.g., compression ratio, number of inserted neurons, and sparsity multiplier) on a layer basis, in the goal of maximizing performance requirements (
compression ratio, number of inserted neurons, and sparsity multiplier) on a layer basis, in the goal of maximizing performance requirements (i.e., and ) while satisfying users’ demands on cost constraints (i.e., and ).
Figure 2 shows the two-phase DRL optimizer designed for the automated DNN compression problem. The first phase leverages two DQN agents for conv and fc layers to select a suitable combination of compression techniques in a layer-wise manner. A DDPG optimizer agent is used in the second phase to select compression hyperparameter from a continuous real-value space for the selected compression techniques at different layers. The two optimization phases are conducted interactively. During the DQN-based optimization phase, the hyperparameters at different compressed layers are fixed as the values estimated by DDPG agent. In the DDPG-based optimization phase, hyperparameter search is performed based on the compression techniques selected by DQN.
DQN and DDPG are two typical DRL methods to handle complex input, action and rewards to learn the controller agent. In the literature of DRL, a policy refers to a specific mapping from state to action . A reward function returns the gain when transitioning to state after taking action in state . Given a state , an action and a policy , the action-value (a.k.a. the function) of the pair (, ) under is defined by the action-value, which defines the expected reward for taking action in state and then following policy thereafter. The DQN agent iteratively improves its -function by taking actions, observing the reward and next state in the environment, and updating the estimate. Once the DQN agent is learned, the optimal policy for each state can be decided by selecting with the highest -value. As for the DDPG agent, it involves an actor-critic framework to combine the idea of DQN and Policy Gradient. Policy Gradient seeks to optimize the policy space directly, that is, an actor network learns the deterministic policy to select action at state . And a value-based critic network is to evaluate potential value of policy estimated by the actor network. We propose to adopt the DRL, i.e., DQN and DDPG, for automated DNN compression in AdaDeep for the following reasons:
Both DQN and DDPG agents enable automatic decision based on the dynamically detected performance and cost. And they are suited for non-linear and non-differentiable optimization.
The DNN to be compressed and the DQN or DDPG agent can be trained jointly end-to-end [bib:acc2017:liu]. Because the DQN/DDPG engent employs the neural network architecture, therefore can participant the feed-forward and back-propagation operations of the DNNs to be compressed. And the output of DQN and DDPG is the decision signal to control the selection of compressed techniques and hyperparameters.
The DRL-based optimizer provide both capability and flexibility in DNN compression. Within the framework of DQN, we can easily add or delete selective compression techniques by simply adding branch sub-networks (i.e., actions), and figure out the mapping function of the complex optimization problem’s input and results. And DDPG can also expand or narrow the value region (action space) without affecting other components of this framework.
To apply DRL to the DNN compression problem, we need to (i) design the reward function to estimate the immediate reward and future reward after taking an action; (ii) design the definition of DRL’s state and action in the context of DNN compression; and (iii) design the DRL architecture and training algorithm with tractable computation complexity. We will elaborate them in 5 and 6. We note that the proposed two-phase DRL optimizer, i.e., DQN- and DDPG-based optimizer, are still heuristic. Hence they cannot theoretically guarantee a globally optimal solution. However, as we will show in the evaluations, the proposed optimizer outperform exhaustive or greedy approaches in terms of the performance of the compressed DNNs.
5 DQN Optimizer for Layer-wise Compression Technique Selection
|DQN Terms||Contextual Meanings for DNN compression|
|State s||Input feature size to DNN layer|
|Action s||Selective compression techniques for DNN layer|
|Reward function||Optimization gain & constraints satisfaction|
|value =||Potential optimization gain & constraints satisfaction|
Training loss function
|Difference between the true value and the estimated value of DQN|
5.1 Design of Reward Function
To define the reward function according to the optimization problem Eq.(2), a common approach is to use the Lagrangian Multiplier [bib:SIAM2008:Ito] to convert the constrained formulation into an unconstrained one:
where , , and are the Lagrangian multipliers. It merges the objective (e.g., and ) and the constraint satisfaction (how well the and usages meet budgets). However, maximizing Eq.(6) rather than Eq.(2) will cause ambiguity. For example, the following two situations lead to the same objective values are thus indistinguishable: (i) poor accuracy and energy performance, with low latency/storage usage; and (ii) high accuracy and energy performance, with high latency/storage usage. Such ambiguity can easily result in a compressed DNN that exceeds the user-specified latency/storage budgets.
To avoid such ambiguity, we define two loss functions for the objective gain and the constraint satisfaction, respectively. We borrow the idea of dueling DQN [bib:wang2016:arxiv] to separate the state-action value function and the state-action advantage function into two parallel streams (see Figure 3). The two streams share conv layers with parameters which learn the representations of states. And then they joint two columns to separately generate the state-action objective gain value , with weight parameter , and the state-action constraint satisfaction value , with weight parameter . The two columns are finally aggregated to output a single state-action value . We define a novel value:
The network and comes with their corresponding reward functions and :
After taking an action, we observe the rewards for and for , and use their interaction and balance to guide the selection of compression techniques.
5.2 The DQN Optimizer for Compression Technique Selection and Combination
The proposed layer-wise DQN optimizer for compression technique selection and combination is outlined in Algorithm 1.
Table II explains the contextual definitions of the DQN terms in our compression technique selection problem.
For each layer , we observe a state .
Two agents are employed for two types of DNN layers (i.e., conv and fc layers), which respectively regard the optional compression techniques at conv and fc layers as their action space and .
For each layer/state , we select a random action with probability
, we select a random action with probabilityand select the action with largest value by probability ( by default). Repeating the above operation layer by layer, we forward the entire DNN to compute a global Reward , and regard it as the reward of each states .
To build a DQN with weight parameters , and , we optimize the following loss function iteratively. At iteration , we update .
with the frozen value learned by the target network [bib:van2016:AAAI]:
We adopt the standard DQN training techniques [bib:wang2016:arxiv] and use the update rule of SARSA [bib:ADPRL2009:van] with the assumption that future rewards are discounted by a factor [bib:mnih2013:arxiv] of the default value . And we leverage experience replay to randomly sample from a memory , to increase the efficiency of DQN training.
6 The DDPG Optimizer for Compression Hyperparameter Search
We employ a DDPG optimizer to automatically search the proper compression hyperparameters for layer compression techniques from a continuous action space [bib:eccv2018:he]. Its contextual definitions of state and reward are the same as that in the DQN optimizer (see Table II).
Action Space for Hyperparameter Search. The compression hyperparameters considered in this work include the compression ratio in a weight pruning [bib:arXiv2015:han], the number of inserted neurons by weight factorization [bib:lane2016:IPSN, bib:bhattacharya2016:CD-ROM], and the sparsity multiplier in a convolution decomposition [bib:arXiv2017:Howard, bib:ICLR2017:soravit]. Note that we search hyperparameters from a continuous action space for its effectiveness. To simplify implementation and reduce the training time, we transfer all of the above compression hyperparameters into a “ratio”, whose value space is mapped into , so that we only need one DDPG agent to select action from the same action space for all compressed layers. We defer the transformation details from compression hyperparameters to the ratio to 7.3.2.
Figure 4 shows the architecture of the proposed DDPG optimizer. It follows a actor-critic framework to concurrently learn the actor network and the value-based critic network . The actor gets advice from the critic that helps the actor decide which actions to reinforce during training. Meanwhile, the DDPG makes uses of double actor networks and critic networks to improve the stability and efficiency of training [bib:ICLM2014:silver]. The architecture of and is the same as and with frozen parameters. We adopt a same dueling DQN architecture (see Figure 3) to build the critic network and , which separates the reward into objective gain and constraint satisfaction (refer to 5.1). And we establish the actor network, expressing the deterministic state-action function, through several conv and fc layers with parameters .
Algorithm 2 illustrates the DDPG optimizer for compression hyperparameter search.
For each compressed layer , it observes a state and leverages the DDPG’s predict actor network to estimate the deterministic optimal action with truncated normal distribution noise
with truncated normal distribution noise[bib:eccv2018:he]. Repeating above operations, it forwards the DNN network to compute a global reward and , which is broadcast to each layer/state . Then the predict critic network estimates the state value of the current state and of the action estimated by the actor .
To train such DDPG optimizer, we optimize the actor network at iteration via the policy gradient function:
And we train the critic network by optimizing the loss function from both the random reply memory and the output of the actor and the critic networks:
where is computed by the sum of immediate reward and and the outputs of the frozen actor and critic .
This section presents evaluations of AdaDeep across various mobile applications and platforms.
7.1 Experiment Setup
We first present the settings for our evaluation.
Implementation. We implement AdaDeep with TensorFlow [url:tensorflow] in Python. The compressed DNNs generated by AdaDeep are then loaded into the target platforms and evaluated as Android projects executed in Java. Specifically, AdaDeep selects an initial DNN architecture from a pool of three state-of-the-art DNN models, including LeNet [model:lenet], AlexNet [model:alexnet], ResNet [model:resnet], and VGG [model:vgg], according to the size of samples in . For example, LeNet is selected when the sample size is smaller than
, otherwise AlexNet, VGG, or ResNet is chosen. Standard training techniques, such as stochastic gradient descent (SGD) and Adam[bib:arxiv2014:kingma], are used to obtain weights for the DNNs.
Evaluation applications and DNN configurations. To evaluate AdaDeep, we consider six commonly used mobile tasks. Specifically, AdaDeep is evaluated for hand-written digit recognition (: MNIST [data:mnist1998:LeCun]), image classification (: CIFAR-10 [data:cifar], : CIFAR-100 [data:cifar100] and
: ImageNet[data:imagenet]), audio sensing application (: UbiSound [bib:sicong2017:IMWUT]), and human activity recognition (: Har [data:Har]). According to the sample size, LeNet [model:lenet] is selected as the initial DNN structure for , and , ResNet-56 is choosen for , while AlexNet [model:alexnet] and VGG-16 [model:vgg] are chosen for .
Mobile platforms for evaluation. We evaluate AdaDeep on twelve commonly used mobile and embedded platforms, including six smartphones, two wearable devices, two development boards and two smart home devices, which are equipped with varied processors, storage, and battery capacity.
7.2 Layer Compression Technique Benchmark
7.2.1 Benchmark Settings
We apply ten mainstream compression techniques from three categories, i.e., weight compression (, , , ), convolution decomposition (, , ), and special architecture layers (, , ), to a 13-layer AlexNet (input, conv, pool, conv, pool, conv, conv, conv, pool, fc, fc, fc and output) [model:alexnet] and compare their performance evaluated on CIFAR-10 dataset () [data:cifar] on a RedMi 3S smartphone. The details of them are as follows.
: insert a fc layer between fc and fc
layers using the singular value decomposition (SVD) based weight matrix factorization[bib:lane2016:IPSN]. The neuron number in the inserted layer is set as , where is the number of neurons in fc.
: insert a fc layer between fc and fc using sparse-coding, another matrix factorization method [bib:bhattacharya2016:CD-ROM]. The -basis dictionary used in is set as , where is the neuron number in fc.
: prune fc and fc using the magnitude based weight pruning strategy proposed in [bib:arXiv2015:han]. It removes unimportant weights whose magnitudes are below a threshold (i.e., ).
: replace the fc layers, fc and fc, with a global average pooling layer [bib:lin2013:NIN]
. It generates one feature map for each category in the last conv layer. The feature map is then fed into the softmax layer.
: insert a conv layer between conv and pool using SVD based weight factorization [bib:lane2016:IPSN]. The numbers of neurons in the inserted layer by SVD , where is the neuron number in conv.
: decompose conv using convolution kernel sparse decomposition [bib:cvpr2015:ICCV]. It replaces a conv layer using a two-stage decomposition based on principle component analysis.
: decompose conv with depth-wise separable convolution [bib:arXiv2017:Howard]. The width multiplier .
: decompose conv using the sparse random technique [bib:ICLR2017:soravit] and we set the sparsity coefficient . The technique replaces the dense connections of a small number of channels with sparse connections between a large number of channels for convolutions. Different from , it randomly applies dropout across spatial dimensions at conv layers.
: replace conv by a Fire layer [bib:iandola2016:arxiv]. A Fire layer is composed of a conv layer and a conv layer with a mix of and conv filters. It decreases the sizes of input channels and filters.
: replace conv
by a micro multi-layer perceptron embedded with multiple small kernel conv layers (Mlpconv)[bib:lin2013:NIN]. It approximates a nonlinear function to enhance the abstraction of conv layers with small (e.g., ) conv filters.
The parameters ( in , and , the depth multiplier in , the sparse random multiplier in ) are empirically optimized by comparing the performance on the layer where the compression technique is applied.
As shown in Figure 5, compression techniques , , and can be applied to the fc layers (fc, fc and fc), while , , , , and are employed to compress the conv layers (conv, conv, conv and conv). For each layer compression technique, we load the compressed DNN on smartphone to process the test data
times, and obtain the mean and variance of the inference performance and resource cost, considering the varied workload of the device at different test times.
7.2.2 Performance of Single Compression Technique
To illustrate the performance of different compression techniques, we compare their compressed DNNs in terms of the evaluation metrics (, , , and ), over both the initial layer that they are applied to (see Figure 6) and the entire initial network, i.e., AlexNet (see Figure 7). First, we can see that overall these mainstream compression techniques are quite effective in trimming down the complexity of the initial network, with a certain accuracy loss () or accuracy gain (). For example, the compression techniques and reduce by about , while , , , , and reduce to be less than . Second, as expected, compressing the fc layers (, , , and ) results in a higher reduction, while compressing the conv layers (, , , , or ) lead to a larger reduction. This is due to the common observation in DNNs that the conv layers consume dominant computational cost while the fc layers account for most of the storage cost. Third, most of the considered compression techniques affect the only in the order of , thus we only consider for the storage cost in following experiments.
Summary. The performance of different categories of compression techniques on the same DNN varies. Within the same category of compression techniques, the performance also differs. There is no a single compression technique that achieves the best , , and . To achieve optimal overall performance on different mobile platforms and applications, it is necessary to combine different compression techniques and tune the compression hyperparameters according to the specific usage demands.
7.2.3 Performance of Blindly Combined Compression Techniques
|Compression technique||Measured accuracy & cost||Compression technique||Measured accuracy & cost|
In this experiment, we compare the performance when blindly combining two compression techniques, tested on a RedMi 3S smartphone (Device 1) using the AlexNet model and CIFA-10 dataset (). Specifically, one of the four techniques to compress the fc layers fc and fc (i.e., , , or ) is combined with one of the six techniques to compress the conv layer conv (i.e., , , , , or ), leading to a total of 24 combinations. Among them, the , and combinations have been introduced in the prior works named SparseSep [bib:bhattacharya2016:CD-ROM], SqueezeNet [bib:iandola2016:arxiv] and NIN [bib:lin2013:NIN], respectively.
Table III summarizes the results. We leverage the compressed AlexNet using the technique as a baseline. In particular, it achieves a detection accuracy of and requires a parameter storage of , an energy cost of , and a detection latency of . First, compared with the compressed model using , some combinations of compression techniques, e.g., + and +, reduce more than of , decrease by , and dramatically cut down by more than , while incurring only accuracy loss. While some combinations might perform worse than a single compression technique, e.g., + and + incur over accuracy loss. Second, the combination of + achieves the best balance between system performance and resource cost.
Summary. Some combinations of two compression techniques can dramatically reduce the resource consumption of DNNs than using a single technique. Others may lead to performance degradation. Furthermore, the search space grows exponentially when combining more than two techniques. These results demonstrate the need for an automatic optimizer to select and combine compression techniques.
7.3 Performance of DRL Optimizer
This section tests the performance of the DDPG and DQN optimizer in hyperparameter search and compression technique selection, and evaluates the collaborative two optimizers.
7.3.1 Hyperparameters Learned by DDPG Optimizer
We first describe the compression hyperparameters needed for our benchmark compression techniques, and present how we transform various hyperparameters to a ”ratio” so that they can share a single DDPG agent with the same action space . As in 7.2.1, we apply ten mainstream layer compression techniques at different conv and fc layers. Note that only some of them need extra compression hyperparameters. In particular, we consider the following ”ratio” hyperparameters, whose optional value can be normalized as a percentage within the real-value region :
ratio of the number of neurons inserted between and layer to the number of neurons at by technique.
ratio of the number of neurons inserted between and layer to the number of neurons at layer by technique.
ratio of the number of k-basis dictionary inserted between and layer to the number of neurons at layer by .
ratio of the neuron number at layer used to neuron number in original DNN layer .
width multiplier (a percentage) in .
sparsity coefficient (a percentage) in .
|Layer||Hyperparameters of compression technique|
Table IV presents the performance of the DDPG optimizer on hyperparameter search and provides a referential hyperparameter setup in the compressed AlexNet CIFAR-10 (D2) using different layer compression techniques. The first conv layer and final fc layer are not compressed. , and conduct weight factorization at conv and fc layers using an inserted layer with to neurons. prunes the weights of both conv and fc layers by the compression ratio of to . and decompose conv layers by the sparsity multiplier ranging from to .
Summary. The optimal hyperparameters of the single compression technique at different layers differ. The search space is large when searching the optimal hyperparameters for multiple layers. To balance the compression performance and the searching cost, an automated layer-wise hyperparameter search optimizer is necessary.
7.3.2 Performance Comparison of Optimizer
This experiment is to evaluate the advantage of both the proposed DQN optimizer and DDPG optimizer when searching for the optimal compression combination as well as hyperparameters. To do so, we compress [LeNet, MNIST] and [AlexNet, CIFAR-10] using the DQN optimizer, the two-phase DRL optimizer and two baseline optimization schemes and evaluate the resulted DNNs on a RedMi 3S snartphone (Device 1). The accuracy loss () and the cost reduction () are normalized over the compressed DNNs using the technique.
Exhaustive optimizer: This scheme exhaustively test the performance of all combinations of two compression techniques (similar to 7.2.3), and select the best trade-off on the validation dataset of MNIST, i.e., the one that yields the largest reward value defined by Eq. (12). The selected one is +, i.e., Fixed, in both the cases of LeNet on MNIST and AlexNet on CIFAR-10. The selected combination does not have tunable hyperparameters.
Greedy optimizer: It loads the DNN layer by layer and selects the compression technique that has the largest reward value defined by Eq. (12), in which both and are set to be 0.5. Also, when or violate the budget or , the optimization terminates. The compression hyperparameters layer compression techniques are fixed by the default optimal value (similar to 7.2.1).
DQN optimizer: It compresses the DNN using the DQN optimizer as described in 5. We set the scaling coefficients in Eq. (8) to be and considering that the battery capacity in RedMi 3S is relatively large and thus the energy consumption is of lower priority, and we set and in Eq. (8) because their corresponding constraints (i.e., and ) are equally important. The same as in the Greedy search within this subsection. The compression hyperparameters of layer-wise compression techniques are also set as the default optimal value (similar to 7.2.1).
DDPG plus DQN optimizer: It further leverages the DDPG optimizer to tune the compression hyperparameters of the DNN compressed by above DQN optimizer. The setup of scaling coefficients () is the same as that in the DQN optimizer within this subsection.
|Optimizer||Compared to the compressed LeNet on MNIST (case 1)||Compared to the compressed AlexNet on CIFAR-10 (case 2)|
|DDPG plus DQN|
Table V summarizes the best performance achieved by the above four optimizers. We can see that the networks generated by DQN and DDPG optimizer achieve better overall performance in terms of storage , latency , and energy consumption , while incurring negligible accuracy loss ( or ), compared to those generated by the other two baseline optimizers. In particular, compared with the DNN compressed by , the best DNN from the Greedy optimizer only reduces by and in [LeNet, MNIST] (case 1) and [AlexNet, CIFAR-10] (case 2), respectively. In contrast, the best DNN from the Exhaustive optimizer, i.e., Fixed, can reduces by and , respectively. DQN optimizer cuts down and of , while DDPG plus DQN optimizer achieves a maximum reduction of and on in two cases. Second, the network from the proposed DQN and DDPG plus DQN optimizers are the most effective in reducing the latency () in both cases, while those from the two baseline optimizers may result in an increased in some cases. For example, the DDPG plus DQN optimizer reaches the maximum reduction of by in case 1, and the DQN optimizer sharply reduces by in case 2. The network from the Greedy optimizer increases by in case 1 and the one from the Exhaustive optimizer introduces an extra in case 2. Third, when comparing the energy cost , Fixed is the least energy-efficient (reduce by only over the DNN compressed by ), while those from the DQN, the DDPG plus DQN, and the Greedy optimizers achieve an reduction of to , respectively. Meanwhile, the accuracy loss from the two baseline optimizers ranges from to , while those from DQN plus DDPG optimizer achieves the best accuracy (only a degradation in case a and even a gain in case 2). Finally, as for the training time, the DDPG and DQN optimizers require a shorter, or equal, or longer time compared with the exhaustive and Greedy optimizers (refer to 7.4.2).
Summary. The proposed DDPG and DQN optimizers attain the best overall performance in both experiments. Both DDPG plus DQN and DQN optimizers outperform the other two schemes for DNN compression in terms of the storage size, latency, and energy consumption while incurring negligible accuracy in diverse recognition tasks. This is because the run-time performance metrics (, , and ) and the resource cost ( and ) of the whole DNN network are systematically included in the reward value and adaptively feedback to the layer-wise compression technique selection or hyperparameter search process.
7.4 Performance of AdaDeep
In this subsection, we test the end-to-end performance of AdaDeep over six tasks and on twelve mobile platforms. Furthermore, to show the flexibility of AdaDeep in adjusting the optimization objectives based on the user demand, we show some examples of the choices on the scaling coefficients in Eq. (8).
7.4.1 AdaDeep over Different Tasks
In this experiment, AdaDeep is evaluated on all the six tasks/datasets using a RedMi 3S smartphone (Device 1). We set the scaling coefficients in Eq. (8) to be the same as those for the DRL optimizer in 5.3.1, i.e., and , and . In addition, we assume a Cache storage budget of MB and a latency budget of ms.
|Task||Compression techniqueshyperparameters||Compare to the DNN compressed by|
Performance. Table VI compares the performance of the best DNNs generated by AdaDeep on the six tasks in terms of accuracy loss, storage , computation (total number of MACs), latency and energy cost , normalized over the DNNs compressed using . Compared with their initial DNNs, DNNs generated by AdaDeep can achieve a reduction of - in , - in , - in , and - in , with a negligible accuracy loss () or even accuracy gain ().
Summary. For different compressed DNNs, tasks, and datasets, the combination of compression techniques found by AdaDeep also differs. Specifically, the combination that achieves the best performance while satisfying the resource constraints is + for Task 1 (on MNIST initialized using LeNet), + for Task 2 (on CIFAR-10 initialized using AlexNet), + for Task 3 (on CIFAR-100 initialized using ResNet-56), ++ for Task 4 (on ImageNet initialized using AlexNet), ++ for Task 5 (on ImageNet initialized using VGG), + for Task 6 (on Ubisound initialized using LeNet), and + for Task 7 (on Har initialized using LeNet), respectively. We can see that although the combination of compression techniques found by AdaDeep cannot always outperforms a single compression techniquein in all metrics, it achieves a better overall performance in terms of the five metrics according to the specific user demands.
7.4.2 AdaDeep over Different Mobile Devices
This experiment evaluates AdaDeep across twelve different mobile devices using LeNet and UbiSound () as the initial DNN and evaluation dataset, respectively. The performance achieved by the initial DNN is as follows: , MB, , ms, and mJ.
Different devices have different resource constraints, which lead to different performance and budget demands and thus require different coefficients in Eq. (8). Specifically, we empirically optimize for different devices to be: , , , and .
|Device||Compression techniques hyperparameters||Compare to initial DNN|
|1. Xiaomi Redmi 3S||+||0.9 %|
|2. Xiaomi Mi 5S||+|