What is the optimal depth for deep-unfolding architectures at deployment?

by   Nancy Nayak, et al.

Recently, many iterative algorithms proposed for various applications such as compressed sensing, MIMO Detection, etc. have been unfolded and presented as deep networks; these networks are shown to produce better results than the algorithms in their iterative forms. However, deep networks are highly sensitive to the hyperparameters chosen. Especially for a deep unfolded network, using more layers may lead to redundancy and hence, excessive computation during deployment. In this work, we consider the problem of determining the optimal number of layers required for such unfolded architectures. We propose a method that treats the networks as experts and measures the relative importance of the expertise provided by layers using a variant of the popular Hedge algorithm. Based on the importance of the different layers, we determine the optimal layers required for deployment. We study the effectiveness of this method by applying it to two recent and popular deep-unfolding architectures, namely DetNet and TISTANet.



page 1

page 2

page 3

page 4


Deep Convolutional Compressed Sensing for LiDAR Depth Completion

In this paper we consider the problem of estimating a dense depth map fr...

Effectiveness of Deep Networks in NLP using BiDAF as an example architecture

Question Answering with NLP has progressed through the evolution of adva...

Towards Interpretable Deep Networks for Monocular Depth Estimation

Deep networks for Monocular Depth Estimation (MDE) have achieved promisi...

Stabilizing Deep Tomographic Reconstruction Networks

While the field of deep tomographic reconstruction has been advancing ra...

EDEN: Evolutionary Deep Networks for Efficient Machine Learning

Deep neural networks continue to show improved performance with increasi...

Highway and Residual Networks learn Unrolled Iterative Estimation

The past year saw the introduction of new architectures such as Highway ...

ExpandNets: Exploiting Linear Redundancy to Train Small Networks

While very deep networks can achieve great performance, they are ill-sui...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent times, deep learning has succeeded tremendously in solving complex data-driven problems [hinton2012speech, krizhevsky2012image, devlin2014fast, lecun1998document, raj2018backpropogating]. In particular, deep learning approaches to solve detection problems have attracted attention in recent times [oshea2017deep, farsad2017detection, dorner2018deep, mohammadkarimi2019deep, jin2020parallel]

. There has been significant focus on developing a deep neural network architecture by unfolding an existing iterative algorithm

[hershey2014deep]. In such a network, each layer represents an iteration of the algorithm whose optimal parameters are learned by the network. For example, popular iterative algorithms such as Iterative Shrinkage and Thresholding Algorithm (ISTA) and Approximate Message Passing (AMP) are unfolded into a neural network-based architecture [gregor2010learning, borgerding2016onsager].

One of the well-known model-driven deep learning networks for MIMO detection is the DetNet [NeevsamueldeepMIMO]. Here, the authors unfold the iterations of a projected gradient descent into a deep neural network. They also show that the results given by DetNet are competitive when compared with the existing MIMO detectors. In the case of sparse signal recovery, OAMP-Net and TISTA-Net architectures unfold the ISTA and OAMP algorithms respectively [he2018model, ito2019trainable].

The success of deep neural networks in detection problems can be attributed to the feasibility of processing huge matrices. Therefore, we cannot discount the need for a large memory for storing and the huge computational complexity for obtaining inference, especially when we deploy an instance of the trained network in low power mobile devices or the Internet of Things-devices (IoT-devices). For such applications, deploying compact neural networks by reducing the number of layers has increasing relevance. If we determine the optimal layers during training, it is possible to reduce the memory without compromising on the performance of the network. To the best of our knowledge, no prior work in open literature focus on optimizing such model-driven deep-unfolding networks.

In this work, we determine the optimal number of layers in deep-unfolding architectures, thereby reducing the memory and computational complexities with a negligible effect on performance during the deployment. To achieve this, we first propose to employ a variant of the popular Hedge algorithm [freund1997decision], namely the dHedge algorithm [raj2017aggregating], during training to determine the relative importance of the layers. We then use this to remove the redundant layers, and hence compress the network during deployment. The number of layers that one should use has always been a subjective choice in most applications. By using our proposed method, one can train a deep-unfolding network with a large number of layers to begin with and allow dHedge to determine the required number of layers. Hence, the user need not choose the number of layers by trial and error. Though we demonstrate the utility of our results for DetNet and TISTANet, one can use this method to remove redundant layers in any deep-unfolding architecture. As a further addition to this method, one can also use other popular compression techniques such as pruning and quantization of weights to achieve furthermore reduction in memory consumption.

Throughout the work, denotes the expectation operator, denotes the norm, denotes transpose.

Ii System Model

For both MIMO detection and sparse signal recovery, we use the following system model:



is the received vector,

is the channel matrix,

is the additive white Gaussian noise (AWGN) with variance

. For MIMO detection architectures like DetNet, is a vector from , a constellation like BPSK. On the other hand, for signal recovery architectures like TISTA-Net, . Though we have , the work can be trivially extended to complex vectors also.

Ii-a DetNet

DetNet is composed of layers and each layer takes and

as inputs. Also, the functionality at each layer, the parameters to be optimized and the loss function used are provided in

[NeevsamueldeepMIMO]. These details are repeated below for ease of reading. The architecture for the th layer, where varies from to is



is the rectified linear unit and

is a piece-wise linear soft sign operator. The parameters that are optimized during the learning phase are:


To account for the problems of vanishing gradients, saturation of activation functions, etc. the loss to be minimized is defined as



is the standard decoder. Note that the output from all the layers are employed in computing the loss function. The final estimate is defined as



Each layer in TISTA-Net has the following architecture [ito2019trainable]:


where is the pseudo inverse of and is the unknown vector. The scalar variables are the variables optimized in the training phase. TISTA-Net is trained using incremental training. In the th round of the incremental training, the loss function that is minimized is . In other words, only the first layers are trained at th round. The final estimate of the output is .

Iii Architecture with DHedge

To determine the optimal layers, we have to first identify the relative importance of prediction outputs from each layer

at the end of every training epoch using the loss function. We observe that this is similar to the classical problem of deciding which expert offers the best output

[freund1997decision]. In a deep-unfolding architecture, we consider each network constructed using the first layers for as an expert in predicting ; therefore, there are experts in total. Each of these experts incurs a loss at training epoch . Smaller the loss, better the prediction. Since different layers train at different paces throughout the training phase, the expertise provided by each of these networks changes over the training epochs. In other words, these experts are non-stationary in nature. In the case of DetNet, we can observe from (4) that the authors of [NeevsamueldeepMIMO] have weighed the loss from th layer with log .111In case of TISTANet no such weighing ratios are discussed in [ito2019trainable]. Note that there is no guarantee that the fixed weighing ratios are optimal.222We use the term weighing ratios to differentiate them from the weights of the neural network. Two questions follow naturally. The first question is whether we can dynamically update the weighing ratios at the end of each training epoch based on the loss function. The second question is, once these weighing ratios are learned, how do we determine the optimal number of layers. We answer both these questions in the subsequent subsections.

Iii-a Determining weighing ratios by dHedge

To determine the correct weighing ratios, we need a suitable weight-update algorithm, one that intializes the weighing ratios of the experts and updates these ratios at every training epoch. The algorithm should update the ratios based on the feedback obtained on the experts’ performance, i.e., penalize it for poor performance and reward it otherwise. In our case, this can measured by means of the loss function at each layer, for every training epoch . The Hedge algorithm is a well-known algorithm used for stationary experts [freund1997decision]. In our specific problem, we define the th expert as the network up to and including layers and use the output at the th layer to obtain the th expert prediction of . To account for the non-stationary nature of these experts, we use discounted Hedge (dHedge), a modified version of the Hedge algorithm, which can handle the evolution of experts over time [raj2017aggregating].

In the dHedge algorithm, we assign weighing ratios for for the experts at the first epoch. After a round of prediction by all experts, if the expert incurs a loss of after the th time-step, we update the weighing ratios,


where and are the hedge parameter and the discount factor respectively. These are problem dependent tunable parameters.

In the case of DetNet, let be the weighing ratio for each layer at the th training epoch. The initialized weighing ratios are . Our aim is to minimize the following loss function:


After setting , we perform an iteration of back-propagation to train the parameters of DetNet. The weighing ratios for each layer at th training epoch are updated as,


After the update, we normalize the weighing ratios before the next training epoch. Note that the major change we have made to the original DetNet architecture is the change in the weighing ratio from log to , which we update after every training epoch based on (8). Similarly, for TISTA-Net, we minimize the following loss function:


Iii-B Determining optimal number of layers

Although we use the dHedge algorithm to account for the non-stationary nature of the experts during training, we note that the weights of the neural network converge at the end of training, thereby resulting in stationary experts. Therefore, we assume that the weighing ratios after training converge to . These learned weighing ratios provide an average measure of the relative importance of the network up to th layer in predicting the output. Since the expert with the least average loss has the maximum weighing ratio at the end of the training, the average loss will be minimum if we predict the output using this network during deployment. Also, we can eliminate any layer beyond this without any loss in performance, i.e., we can eliminate all layers , for


In case , i.e., the final layer gives the least loss, we cannot eliminate any layer without some loss in performance. However, we can still remove some layers by observing how the weighing ratios differ from one another. For example, if we determine an such that the weighing ratios for are nearly equal to , (i.e., for some tolerable error ), we can still afford to eliminate the last - layers and suffer only a negligible loss in performance. In other words, if the weighing ratios beyond the th layer are all only away from then the final networks are nearly equal experts in predicting . Hence, one can truncate the network to the first layers with negligible loss in the performance. However, determining the right trade-off between

and the loss in performance depends on the evolution of weighing ratios over the layers. The entire heuristic algorithm is presented in Algorithm

1. In the subsequent section, for DetNet and TISTANet, we have shown that we can significantly reduce the number of layers required without suffering any loss in performance.

1:Input: System parameters: , training batches of , network parameters
2:Input: dHedge parameters: , , tolerable error
3:Initialize weighing ratios
4:for  to  do
5:     Forward propagation
6:     Calculate loss
7:     Back propagation
8:     Weight update:
9:end for
10:Output: Final weighing ratios
11:if  then
12:     Remove the final layers
13:end if
Algorithm 1 Training DetNet with dHedge

Iv Numerical Results

In this section, we provide numerical results for DetNet and TISTANet modified with the dHedge algorithm. Each element of is sampled from . We train both the networks using Adam Optimizer [kingma2014adam]. For DetNet, we draw from a BPSK constellation. We generate mini-batches of size for each SNR value. For the optimizer, the learning rate decays exponentially starting with an initial learning rate of , a decay factor of and decay step-size of . The values of and of the dHedge algorithm are set to and respectively, using the hyperparameter tuning tool Hyperopt [bergstra2013making]. In the case of TISTA-Net, we sample each component of the sparse signal

from the Bernoulli-Gaussian distribution with

and . Here, we generate mini-batches of size . The learning rate of the optimizer is . The values of and for the dHedge algorithm are and , respectively.

Fig. 1: Weighing ratios vs Layers for DetNet
Fig. 2: BER vs SNR for DetNet

In Fig. 1, we plot the weighing ratios learned by dHedge over the layers. We also plot the normalized logarithmic weights originally proposed in [NeevsamueldeepMIMO] for comparison. We can observe that for all the three cases of , the weighing ratios increase monotonically over the layers. However, the increase is less pronounced in the final layers. The difference in the weighing ratio between the 120th and the 55th layer is in the order of . This implies that the loss measured at the 55th layer will only be marginally greater than the loss measured at the 120th. We can verify this by the BER curves in Fig. 2. For all these networks, for , we can choose , and eliminate the final layers. Also, the loss in the performance for both these networks is minimal, when compared with the original DetNet. We also obtain some savings in memory usage.

Fig. 3: Weighing ratios vs Layers for TISTA-Net

We observe that the dHedge algorithm provides a method to determine the optimal number of layers without trial and error for deep-unfolding architectures. To demonstrate this, we plot the weighing ratios learned for a 12, 18, and 25-layer TISTANet, in Fig. 3. For all the three networks, we can observe that the maximum weighing ratio occurs at the 10th layer. Hence, we can safely eliminate the final to get optimal performance in terms of NMSE and memory usage.

V Conclusion

To reduce the memory and computational complexity of deep-unfolding architectures, we proposed a method that determines the optimal number of layers required. For each layer, weighing ratios were assigned and then updated using the dHedge algorithm after every training epoch based on the loss incurred. Based on the evolution of weighing ratios with the layers, a heuristic algorithm to determine the optimal number of layers required for deployment was developed. The working of the algorithm was verified by simulation for two deep-unfolding architectures, namely DetNet and TISTANet. We believe that the proposed method of choosing the number of layers will be highly useful for any deep unfolded architecture since it gives a principled approach for reducing the depth of the network without loss of performance.