1 Introduction
The task to recover multivariate continuous data from noisy indirect measurements arises in many applications in signal and image processing and beyond. Here, we consider the typical situation where an unknown ground truth signal shall be reconstructed from observations . In many cases of interest, the ground truth and the observations are connected via a linear model . Therein, is a known measurement matrix and
is a random noise vector. In recent years, variational approaches in the form of
(1.1) 
have been widely used for such reconstruction tasks. The function is usually chosen in accordance with the noise distribution and such that discrepancies between and are penalized appropriately. Moreover, the function incorporates prior knowledge about the ground truth and penalizes solutions which do not conform with these requirements.
In the following, we show how a certain class of algorithms designed to solve (1.1), namely proximal splitting methods, can be unrolled in terms of a neural network and discuss the resemblance of the resulting architecture to residual networks. From this comparison, we derive a new architecture combining the advantages of both approaches. To that end, we briefly review proximal splitting methods in Subsection 1.1, followed by an overview of related work in Subsection 1.2. Further, Section 2 contains the derivation of our architecture and Section 3 describes its application to speech dequantization. In Section 4, we report the results of our numerical experiments, before we finally conclude our work in Section 5.
1.1 Proximal splitting methods
There exist a variety of optimization methods to solve (1.1) and which one to choose depends particularly on the structure of the involved functions and . If both are differentiable, then one could, for example, use any form of gradient descent. However, this is often not the case and one has to resort to alternative methods such as proximal splitting methods (see, e.g., Parikh and Boyd (2013) and Bauschke and Combettes (2017)). Among these, the algorithm proposed by Chambolle and Pock (2011) has gained a lot of attention and turned out to perform well on numerous practical tasks. The algorithm is designed for convex optimization problems (1.1) with and being proper, convex and lowersemicontinuous functions. One iteration of the Chambolle–Pock algorithm consists of one dual and one primal update step, followed by a primal extrapolation step, i.e.,
(1.2) 
The step sizes need to be chosen subject to the operator norm of and is an extrapolation factor. Furthermore, the mappings in the first two rows of (1.2) are the proximal operators of and , where is the convex conjugate function of (cf. Bauschke and Combettes (2017)). Many convex functions of interest have the property that the respective proximal mappings have simple closedform representations which is crucial for the applicability of the ChambollePock algorithm.
1.2 Related work
The idea to unroll iterative methods for optimization problems and treat them as deep neural networks appeared previously in the literature. For example, Gregor and LeCun (2010)
interpreted the iterative shrinkagethresholding algorithm (ISTA) from sparse coding as a recurrent neural network and trained the involved linear operator as well as the shrinkage function to obtain learned ISTA (LISTA). More recently,
Wang et al. (2016) proposed to unroll iterations (1.2) of the ChambollePock algorithm in the context of proximal deep structured networks. Similarly, Riegler et al. (2016a) and Riegler et al. (2016b)introduced deep primaldual networks for depth superresolution, where unrolled iterations are stacked on top of a deep convolutional network. In both cases, linear operators as well as step sizes and extrapolation factors are learned. In contrast, the learned primaldual algorithm proposed by
Adler and Öktem (2018) replaces the proximal operators in (1.2) by convolutional neural networks, and keeps the linear operator fixed.
1.3 Our contribution
In this work, we propose to unroll a fixed number of ChambollePock iterations (1.2) in a deep neural network. Among the abovementioned approaches, ours is most closely related to (Wang et al., 2016). However, there are some important differences. First, we interpret the unrolled algorithm as a special kind of residual network (He et al., 2016) rather than as a recurrent neural network. Based on this interpretation, we show that a minimal adjustment of the unrolled network architecture is enough to obtain a plain residual network, which motivates the term primaldual residual network. Second, we consider the common special case of (1.1) where is a norm and is the indicator function of a norm ball, i.e., where both functions encode a convex optimization problem with hard constraints. We deduce that in this case, the proximal operators of and are projections and further, that the step sizes in (1.2) can be learned implicitly through the involved linear operators. In the third place, we take up the application of speech dequantization. This application was addressed previously by Brauer et al. (2016) who also proposed to perform a limited number of ChambollePock iterations, but in the classical sense with an a priori fixed linear operator.
2 Network architecture
2.1 Primaldual networks
We first take the point of view that iterations (1.2) can be considered building blocks of feedforward neural networks. This conception makes sense in two respects: First, the proximal operators and
can be interpreted as activation functions. Second, the linear operator and its transpose can be considered manufactured weights which could just as well be learned. To that end, we regard
and as activations, and as well as as associated weights. This notion leads us to primaldual blocks in the form of(2.1) 
Hence, performing iterations (1.2) corresponds to stacking just as many primaldual blocks with and . As opposed to this, our approach is to learn all weights and allow for differing weights in different blocks. In particular, we do not require that . For simplicity, we restrict ourselves to the case in the following. In other words, we omit the extrapolation in (2.1) and take throughout. Figure 1 illustrates a primaldual block without extrapolation. Therein, and can be considered as incoming activations from the th block, while and are the activations computed in the st block.
2.2 Residual networks
Residual networks were originally proposed by He et al. (2016) in the context of image recognition. A generic residual block has the specific form
(2.2) 
where is the input vector and is an underlying mapping to be fit by a number of stacked layers. Recall that the only difference between (2.2) and a plain two layer block is the addition of in the end. This structure is motivated by the degradation problem which typically arises when additional layers are added to a deep network. In theory, the optimal training accuracy of the deeper network cannot be larger compared to the training accuracy of the shallower network due to the fact that the learner can principally choose the additional layer to be an identity map, while all other weights can be copied from the shallower network. However, in practice, training accuracy often degrades rapidly when the depth of the network is increased. Therefore, He et al. (2016) argue that a residual block (2.2) can serve as a preconditioner in the sense that an identity map can be obtained simply by setting and . Indeed, experiments on various datasets have shown that stacking residual blocks (2.2) instead of plain layers leads to increasing training accuracy in practice.
2.3 Primaldual residual networks
It turns out that the network structure illustrated in Figure 1 manifests two different kinds of such residual blocks. Namely, the activations are computed according to
(2.3)  
(2.4) 
respectively. The dual forward map (2.3) has the form of a residual block where a previous primal activation plays the role of a bias unit in the anterior layer. Vice versa, the primal forward map (2.4) can be seen as a residual block where a previous dual activation acts as bias unit. Thus, in some sense, a primaldual network encompasses both one residual network with respect to the dual variables and one with respect to the primal variables, where both networks overlap and interchange information through bias units. Moreover, if we replace the dual activation by a proper bias unit in (2.4), then the resulting mapping has the form of a residual block (2.2
), except that the ReLU activation functions are replaced by proximal mappings and that there is only one bias unit. However, the proposed replacement breaks the structure of overlapping networks. Formally, we obtain a
primaldual residual block(2.5) 
where denotes the activation in the intermediate layer. Figure 2 illustrates a generic primaldual residual block. Compared to a classical primaldual block (see Figure 1), the main difference is that the skip connections between each two dual activations and are cancelled. Instead, the new intermediate activation receives a trainable bias unit as input.
3 Application to speech dequantization
In the context of speech dequantization, Brauer et al. (2016) consider the problem
(3.1) 
where is the uniformly quantized version of a speech signal which shall be reconstructed. The scalar is the length of the quantization intervals and the constraint in (3.1) takes into account that originates in an norm ball with radius around the quantized signal. Accordingly, approximates in terms of the columns of a dictionary which is assumed to have full rank, and the norm penalty incorporates the assumption that the sought representation is sparse. Finally, the reconstructed signal is obtained as . Since the dictionary is by assumption invertible, one can perform the change of variables and solve the problem
(3.2) 
instead. By means of and , the problem (3.2) can be written in the form (1.1) with both functions being proper, convex and lowersemicontinuous. Note that is the indicator function of the feasible set of (3.2), i.e., if and in any other case.
3.1 Related network architectures
On the one hand, the conjugate function of also involves an indicator function, namely . On the other hand, the proximal operator of an indicator function of a convex set is simply the projection onto this set. Hence, also taking the linear part in into account, we obtain the associated primaldual block
(3.3) 
according to (2.1), where is the projection onto the norm ball with center and radius . In the same way, the primaldual residual block
(3.4) 
according to (2.5) can be formed. Note that the addition of is due to the term which occurs in and causes a constant linear offset in the respective proximal operator, namely, . Moreover, the projection operator is a generalization of the wellknown activation function (Collobert, 2004), i.e., for it holds that
(3.5) 
For , the projection onto the ball with radius can be obtained by componentwise application of a scaled function, namely .
4 Numerical Experiments
We investigate the impact of primaldual (residual) networks using a dataset of 720 sentences from the IEEE corpus provided in Loizou (2013) consisting of male speech and sampled at 16 kHz. From these signals, 70% (i.e., 504 signals) are used as training set, 15% (i.e., 108 signals) are reserved as development set, and another 15% serve as test set which we use for a comparison with the results of Brauer et al. (2016). In order to reconstruct ground truth signals from quantized measurements , our goal is to train primaldual (residual) networks consisting of stacked blocks (3.3) and (3.4), respectively. As initial activations, we use and . However, as discussed above, the actual input of the network
appears in the form of a linear offset during the computation of each dual activation. Our estimate for
is finally , where denotes the aggregate of weights in the network.4.1 Learning setup
To make the networks principally usable for realtime applications, the original and quantized signals are truncated using a rectangular window function. To that end, a window size as well as a shift length are a priori fixed. Then, the signals are split into dimensional subsignals
(4.1) 
where is the dimension of the respective full signals. We use and truncate all 504 signals in the training set. This way, we end up with training examples and a new training set
. To train the network, we use a weighted sum of mean squared error (MSE) and regularization as loss function, i.e., we minimize
(4.2) 
Subsequently, we use
regularization and choose the hyperparameter
with respect to the development set. All experiments were conducted on an NVIDIA GeForce^{®}GTX 1080 Ti GPU using TensorFlow™1.5.0. To minimize (
4.2), we used Adam (Kingma and Ba, 2014) with learning rate and all other parameters set to standard values.4.2 Comparison of network architectures
data  reg.  1  2  5  10  15  

PDN  train.  w/o  2.08  1.28  0.48  0.28  1.51 
PDRN  train.  w/o  2.05  1.28  0.53  0.40  0.57 
PDN  train.  w/  3.29  2.96  2.06  1.62  1.69 
PDRN  train.  w/  2.09  1.57  1.06  1.01  1.06 
PDN  dev.  w/o  2.43  2.36  2.31  2.37  2.48 
PDRN  dev.  w/o  2.38  2.32  2.00  1.92  2.03 
PDN  dev.  w/  3.46  3.20  2.49  2.07  2.13 
PDRN  dev.  w/  2.40  2.24  1.84  1.81  1.88 
In a first set of experiments, we compared the impact of primaldual networks (PDNs) and primaldual residual networks (PDRNs) applied to the task of speech dequantization. In either case, we tried different depths
and trained each network over 1000 epochs using a batch size of 128. Our results are illustrated in Figures
3 and 4, and in Table 1 we report the minimum MSE values associated with each trained network.All in all, our results indicate that PDRNs feature superior performance compared to classical PDNs. On the one hand, Table 1 shows that PDRNs yield throughout lower MSE values relative to PDNs on the development set (17% lower without regularization and 13% lower with regularization, if one considers the respective best values among all depths). On the other hand, it can be seen clearly from Figures 3 and 4 that, especially in case of deeper networks, PDRNs exhibit a more stable behavior in the sense that training and development errors decay smoother over time. In this context, the displayed results for are particularly notable. In addition, we could observe similar behavior for depths .
4.3 Comparison with primaldual algorithm
In further experiments conducted on the test set, we compared PDNs and PDRNs to the ChambollePock algorithm (CP). To that end, we computed the respective MSEs as above and, in addition, the associated signaltonoise ratios (SNRs). On the side of neural networks, we selected both one PDN and one PDRN on the basis of the respective errors on the development set (cf. the boldfaced entries in Table 1). Further, we applied CP to problem (3.1) with being a discrete cosine transform matrix, exactly as described in Brauer et al. (2016). For the benefit of a fair comparison, we stopped CP after 10, 25 and 50 iterations and calculated the respective errors at that points. Supplementary, we computed the MSEs as well as the SNRs of the quantized signals before reconstruction (QU). Our results are reported in Table 2.
A comparison of the test results in Table 2 reveals that PDNs and PDRNs offer significantly lower MSE values and higher SNR values than the ChambollePock algorithm, seemingly independent of the performed number of iterations. The considered PDRN improves the MSE of the unquantized signal by 58.6% and its SNR by 25.5%, whereas the largest number of ChambollePock iterations yields improvements of only 32.9% and 11.9%, respectively. Further, it can be seen that PDRNs outperform PDNs once more, since the respective relative improvements yielded with the considered PDN amount to 52.3% and 21.3%.
# CP iterations  

QU  PDN  PDRN  10  25  50  
MSE  4.47  2.13  1.85  3.05  3.00  3.00 
SNR  15.06  18.27  18.90  16.71  16.81  16.85 
5 Conclusion
In this paper, we have proposed the architecture of primaldual residual networks which combines features of unrolled proximal splitting methods and residual networks. Further, we have drawn a comparison between the proposed network architecture and classical primaldual networks which can be considered straightforwardly unrolled proximal splitting methods. Our results have shown that, applied to speech dequantization, primaldual residual networks can outperform their classical counterpart significantly. Moreover, we have seen that both architectures can beat a truncated proximal splitting scheme on the same task.
However, it should not go unmentioned that some questions have been left open for future research. First, we have yet only compared primal dual (residual) networks without extrapolation. As extrapolation is an important aspect concerning the convergence of proximal splitting methods, it may also play a role in view of the behavior of the related network architectures. Second, using the example of speech dequantization, we have shown how a particular convex optimization problem with hard constraints can be unrolled in terms of a neural network by performing an appropriate change of variables and considering the constraint as an indicator function of a related norm ball. It is likely possible that this approach can be generalized to a larger class of optimization problems. We are going to address these as well as other related questions in the near future.
References
 Adler and Öktem [2018] Jonas Adler and Ozan Öktem. Learned Primaldual Reconstruction. IEEE Transactions on Medical Imaging, 2018. doi: 10.1109/TMI.2018.2799231.
 Bauschke and Combettes [2017] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2nd edition, 2017.
 Brauer et al. [2016] Christoph Brauer, Timo Gerkmann, and Dirk Lorenz. Sparse reconstruction of quantized speech signals. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5940–5944. IEEE, 2016.
 Chambolle and Pock [2011] Antonin Chambolle and Thomas Pock. A FirstOrder PrimalDual Algorithm for Convex Problems with Applications to Imaging. J Math Imaging Vis, 40:120–145, 2011.

Collobert [2004]
Ronan Collobert.
Large Scale Machine Learning
. PhD thesis, Université Paris VI, 2004.  Gregor and LeCun [2010] Karol Gregor and Yann LeCun. Learning Fast Approximations of Sparse Coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 399–406. Omnipress, 2010.

He et al. [2016]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep Residual Learning for Image Recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 770–778, 2016.  Kingma and Ba [2014] Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Loizou [2013] Philipos C. Loizou. Speech Enhancement: Theory and Practice. CRC Press, 2013.
 Parikh and Boyd [2013] N. Parikh and S. Boyd. Proximal Algorithms. Foundations and Trends in Optimization. Now Publishers, 2013. ISBN 9781601987167.
 Riegler et al. [2016a] Gernot Riegler, David Ferstl, Matthias Rüther, and Horst Bischof. A Deep PrimalDual Network for Guided Depth SuperResolution. In Proceedings of the British Machine Vision Conference (BMVC), pages 7.1–7.14, 2016a.
 Riegler et al. [2016b] Gernot Riegler, Matthias Rüther, and Horst Bischof. ATGVNet: Accurate Depth SuperResolution. In Computer Vision – ECCV 2016, pages 268–284. Springer International Publishing, 2016b.
 Wang et al. [2016] Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Proximal Deep Structured Models. In 30th Conference on Neural Information Processing Systems (NIPS 2016), 2016.
Comments
There are no comments yet.