Primal-dual residual networks

by   Christoph Brauer, et al.

In this work, we propose a deep neural network architecture motivated by primal-dual splitting methods from convex optimization. We show theoretically that there exists a close relation between the derived architecture and residual networks, and further investigate this connection in numerical experiments. Moreover, we demonstrate how our approach can be used to unroll optimization algorithms for certain problems with hard constraints. Using the example of speech dequantization, we show that our method can outperform classical splitting methods when both are applied to the same task.



There are no comments yet.


page 1

page 2

page 3

page 4


Comparing different subgradient methods for solving convex optimization problems with functional constraints

We provide a dual subgradient method and a primal-dual subgradient metho...

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

We study a stochastic and distributed algorithm for nonconvex problems w...

A Primal-Dual Algorithmic Framework for Constrained Convex Minimization

We present a primal-dual algorithmic framework to obtain approximate sol...

Primal and Dual Prediction-Correction Methods for Time-Varying Convex Optimization

We propose a unified framework for time-varying convex optimization base...

Beating level-set methods for 3D seismic data interpolation: a primal-dual alternating approach

Acquisition cost is a crucial bottleneck for seismic workflows, and low-...

Variable Splitting Methods for Constrained State Estimation in Partially Observed Markov Processes

In this letter, we propose a class of efficient, accurate and general me...

OptNet: Differentiable Optimization as a Layer in Neural Networks

This paper presents OptNet, a network architecture that integrates optim...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task to recover multivariate continuous data from noisy indirect measurements arises in many applications in signal and image processing and beyond. Here, we consider the typical situation where an unknown ground truth signal shall be reconstructed from observations . In many cases of interest, the ground truth and the observations are connected via a linear model . Therein, is a known measurement matrix and

is a random noise vector. In recent years, variational approaches in the form of


have been widely used for such reconstruction tasks. The function is usually chosen in accordance with the noise distribution and such that discrepancies between and are penalized appropriately. Moreover, the function incorporates prior knowledge about the ground truth and penalizes solutions which do not conform with these requirements.

In the following, we show how a certain class of algorithms designed to solve (1.1), namely proximal splitting methods, can be unrolled in terms of a neural network and discuss the resemblance of the resulting architecture to residual networks. From this comparison, we derive a new architecture combining the advantages of both approaches. To that end, we briefly review proximal splitting methods in Subsection 1.1, followed by an overview of related work in Subsection 1.2. Further, Section 2 contains the derivation of our architecture and Section 3 describes its application to speech dequantization. In Section 4, we report the results of our numerical experiments, before we finally conclude our work in Section 5.

1.1 Proximal splitting methods

There exist a variety of optimization methods to solve (1.1) and which one to choose depends particularly on the structure of the involved functions and . If both are differentiable, then one could, for example, use any form of gradient descent. However, this is often not the case and one has to resort to alternative methods such as proximal splitting methods (see, e.g., Parikh and Boyd (2013) and Bauschke and Combettes (2017)). Among these, the algorithm proposed by Chambolle and Pock (2011) has gained a lot of attention and turned out to perform well on numerous practical tasks. The algorithm is designed for convex optimization problems (1.1) with and being proper, convex and lower-semicontinuous functions. One iteration of the Chambolle–Pock algorithm consists of one dual and one primal update step, followed by a primal extrapolation step, i.e.,


The step sizes need to be chosen subject to the operator norm of and is an extrapolation factor. Furthermore, the mappings in the first two rows of (1.2) are the proximal operators of and , where is the convex conjugate function of (cf. Bauschke and Combettes (2017)). Many convex functions of interest have the property that the respective proximal mappings have simple closed-form representations which is crucial for the applicability of the Chambolle-Pock algorithm.

1.2 Related work

The idea to unroll iterative methods for optimization problems and treat them as deep neural networks appeared previously in the literature. For example, Gregor and LeCun (2010)

interpreted the iterative shrinkage-thresholding algorithm (ISTA) from sparse coding as a recurrent neural network and trained the involved linear operator as well as the shrinkage function to obtain learned ISTA (LISTA). More recently,

Wang et al. (2016) proposed to unroll iterations (1.2) of the Chambolle-Pock algorithm in the context of proximal deep structured networks. Similarly, Riegler et al. (2016a) and Riegler et al. (2016b)

introduced deep primal-dual networks for depth super-resolution, where unrolled iterations are stacked on top of a deep convolutional network. In both cases, linear operators as well as step sizes and extrapolation factors are learned. In contrast, the learned primal-dual algorithm proposed by

Adler and Öktem (2018) replaces the proximal operators in (1.2

) by convolutional neural networks, and keeps the linear operator fixed.

1.3 Our contribution

In this work, we propose to unroll a fixed number of Chambolle-Pock iterations (1.2) in a deep neural network. Among the above-mentioned approaches, ours is most closely related to (Wang et al., 2016). However, there are some important differences. First, we interpret the unrolled algorithm as a special kind of residual network (He et al., 2016) rather than as a recurrent neural network. Based on this interpretation, we show that a minimal adjustment of the unrolled network architecture is enough to obtain a plain residual network, which motivates the term primal-dual residual network. Second, we consider the common special case of (1.1) where is a norm and is the indicator function of a norm ball, i.e., where both functions encode a convex optimization problem with hard constraints. We deduce that in this case, the proximal operators of and are projections and further, that the step sizes in (1.2) can be learned implicitly through the involved linear operators. In the third place, we take up the application of speech dequantization. This application was addressed previously by Brauer et al. (2016) who also proposed to perform a limited number of Chambolle-Pock iterations, but in the classical sense with an a priori fixed linear operator.

2 Network architecture

2.1 Primal-dual networks

We first take the point of view that iterations (1.2) can be considered building blocks of feedforward neural networks. This conception makes sense in two respects: First, the proximal operators and

can be interpreted as activation functions. Second, the linear operator and its transpose can be considered manufactured weights which could just as well be learned. To that end, we regard

and as activations, and as well as as associated weights. This notion leads us to primal-dual blocks in the form of


Hence, performing iterations (1.2) corresponds to stacking just as many primal-dual blocks with and . As opposed to this, our approach is to learn all weights and allow for differing weights in different blocks. In particular, we do not require that . For simplicity, we restrict ourselves to the case in the following. In other words, we omit the extrapolation in (2.1) and take throughout. Figure 1 illustrates a primal-dual block without extrapolation. Therein, and can be considered as incoming activations from the -th block, while and are the activations computed in the -st block.

Figure 1: Primal-dual block without extrapolation.

2.2 Residual networks

Residual networks were originally proposed by He et al. (2016) in the context of image recognition. A generic residual block has the specific form


where is the input vector and is an underlying mapping to be fit by a number of stacked layers. Recall that the only difference between (2.2) and a plain two layer block is the addition of in the end. This structure is motivated by the degradation problem which typically arises when additional layers are added to a deep network. In theory, the optimal training accuracy of the deeper network cannot be larger compared to the training accuracy of the shallower network due to the fact that the learner can principally choose the additional layer to be an identity map, while all other weights can be copied from the shallower network. However, in practice, training accuracy often degrades rapidly when the depth of the network is increased. Therefore, He et al. (2016) argue that a residual block (2.2) can serve as a preconditioner in the sense that an identity map can be obtained simply by setting and . Indeed, experiments on various datasets have shown that stacking residual blocks (2.2) instead of plain layers leads to increasing training accuracy in practice.

2.3 Primal-dual residual networks

It turns out that the network structure illustrated in Figure 1 manifests two different kinds of such residual blocks. Namely, the activations are computed according to


respectively. The dual forward map (2.3) has the form of a residual block where a previous primal activation plays the role of a bias unit in the anterior layer. Vice versa, the primal forward map (2.4) can be seen as a residual block where a previous dual activation acts as bias unit. Thus, in some sense, a primal-dual network encompasses both one residual network with respect to the dual variables and one with respect to the primal variables, where both networks overlap and interchange information through bias units. Moreover, if we replace the dual activation by a proper bias unit in (2.4), then the resulting mapping has the form of a residual block (2.2

), except that the ReLU activation functions are replaced by proximal mappings and that there is only one bias unit. However, the proposed replacement breaks the structure of overlapping networks. Formally, we obtain a

primal-dual residual block


where denotes the activation in the intermediate layer. Figure 2 illustrates a generic primal-dual residual block. Compared to a classical primal-dual block (see Figure 1), the main difference is that the skip connections between each two dual activations and are cancelled. Instead, the new intermediate activation receives a trainable bias unit as input.

Figure 2: Primal-dual residual block without extrapolation.

3 Application to speech dequantization

In the context of speech dequantization, Brauer et al. (2016) consider the problem


where is the uniformly quantized version of a speech signal which shall be reconstructed. The scalar is the length of the quantization intervals and the constraint in (3.1) takes into account that originates in an -norm ball with radius around the quantized signal. Accordingly, approximates in terms of the columns of a dictionary which is assumed to have full rank, and the -norm penalty incorporates the assumption that the sought representation is sparse. Finally, the reconstructed signal is obtained as . Since the dictionary is by assumption invertible, one can perform the change of variables and solve the problem


instead. By means of and , the problem (3.2) can be written in the form (1.1) with both functions being proper, convex and lower-semicontinuous. Note that is the indicator function of the feasible set of (3.2), i.e., if and in any other case.

3.1 Related network architectures

On the one hand, the conjugate function of also involves an indicator function, namely . On the other hand, the proximal operator of an indicator function of a convex set is simply the projection onto this set. Hence, also taking the linear part in into account, we obtain the associated primal-dual block


according to (2.1), where is the projection onto the -norm ball with center and radius . In the same way, the primal-dual residual block


according to (2.5) can be formed. Note that the addition of is due to the term which occurs in and causes a constant linear offset in the respective proximal operator, namely, . Moreover, the projection operator is a generalization of the well-known activation function (Collobert, 2004), i.e., for it holds that


For , the projection onto the ball with radius can be obtained by componentwise application of a scaled function, namely .

4 Numerical Experiments

We investigate the impact of primal-dual (residual) networks using a dataset of 720 sentences from the IEEE corpus provided in Loizou (2013) consisting of male speech and sampled at 16 kHz. From these signals, 70% (i.e., 504 signals) are used as training set, 15% (i.e., 108 signals) are reserved as development set, and another 15% serve as test set which we use for a comparison with the results of Brauer et al. (2016). In order to reconstruct ground truth signals from quantized measurements , our goal is to train primal-dual (residual) networks consisting of stacked blocks (3.3) and (3.4), respectively. As initial activations, we use and . However, as discussed above, the actual input of the network

appears in the form of a linear offset during the computation of each dual activation. Our estimate for

is finally , where denotes the aggregate of weights in the network.

4.1 Learning setup

To make the networks principally usable for real-time applications, the original and quantized signals are truncated using a rectangular window function. To that end, a window size as well as a shift length are a priori fixed. Then, the signals are split into -dimensional sub-signals


where is the dimension of the respective full signals. We use and truncate all 504 signals in the training set. This way, we end up with training examples and a new training set

. To train the network, we use a weighted sum of mean squared error (MSE) and regularization as loss function, i.e., we minimize


Subsequently, we use

-regularization and choose the hyperparameter

with respect to the development set. All experiments were conducted on an NVIDIA GeForce®

GTX 1080 Ti GPU using TensorFlow™1.5.0. To minimize (

4.2), we used Adam (Kingma and Ba, 2014) with learning rate and all other parameters set to standard values.

4.2 Comparison of network architectures

data reg. 1 2 5 10 15
PDN train. w/o 2.08 1.28 0.48 0.28 1.51
PDRN train. w/o 2.05 1.28 0.53 0.40 0.57
PDN train. w/ 3.29 2.96 2.06 1.62 1.69
PDRN train. w/ 2.09 1.57 1.06 1.01 1.06
PDN dev. w/o 2.43 2.36 2.31 2.37 2.48
PDRN dev. w/o 2.38 2.32 2.00 1.92 2.03
PDN dev. w/ 3.46 3.20 2.49 2.07 2.13
PDRN dev. w/ 2.40 2.24 1.84 1.81 1.88
Table 1: Minimum MSE values (in multiples of ) achieved with PDNs and PDRNs

In a first set of experiments, we compared the impact of primal-dual networks (PDNs) and primal-dual residual networks (PDRNs) applied to the task of speech dequantization. In either case, we tried different depths

and trained each network over 1000 epochs using a batch size of 128. Our results are illustrated in Figures

3 and 4, and in Table 1 we report the minimum MSE values associated with each trained network.

All in all, our results indicate that PDRNs feature superior performance compared to classical PDNs. On the one hand, Table 1 shows that PDRNs yield throughout lower MSE values relative to PDNs on the development set (17% lower without regularization and 13% lower with regularization, if one considers the respective best values among all depths). On the other hand, it can be seen clearly from Figures 3 and 4 that, especially in case of deeper networks, PDRNs exhibit a more stable behavior in the sense that training and development errors decay smoother over time. In this context, the displayed results for are particularly notable. In addition, we could observe similar behavior for depths .

4.3 Comparison with primal-dual algorithm

In further experiments conducted on the test set, we compared PDNs and PDRNs to the Chambolle-Pock algorithm (CP). To that end, we computed the respective MSEs as above and, in addition, the associated signal-to-noise ratios (SNRs). On the side of neural networks, we selected both one PDN and one PDRN on the basis of the respective errors on the development set (cf. the bold-faced entries in Table 1). Further, we applied CP to problem (3.1) with being a discrete cosine transform matrix, exactly as described in Brauer et al. (2016). For the benefit of a fair comparison, we stopped CP after 10, 25 and 50 iterations and calculated the respective errors at that points. Supplementary, we computed the MSEs as well as the SNRs of the quantized signals before reconstruction (QU). Our results are reported in Table 2.

A comparison of the test results in Table 2 reveals that PDNs and PDRNs offer significantly lower MSE values and higher SNR values than the Chambolle-Pock algorithm, seemingly independent of the performed number of iterations. The considered PDRN improves the MSE of the unquantized signal by 58.6% and its SNR by 25.5%, whereas the largest number of Chambolle-Pock iterations yields improvements of only 32.9% and 11.9%, respectively. Further, it can be seen that PDRNs outperform PDNs once more, since the respective relative improvements yielded with the considered PDN amount to 52.3% and 21.3%.

# CP iterations
QU PDN PDRN 10 25 50
MSE 4.47 2.13 1.85 3.05 3.00 3.00
SNR 15.06 18.27 18.90 16.71 16.81 16.85
Table 2: Test errors achieved with quantized signals before reconstruction (QU), best PDN and PDRN w.r.t. development set, and Chambolle-Pock algorithm (CP) (MSE values in multiples of )
Figure 3: Training and development errors yielded with PDNs (3.3).
Figure 4: Training and development errors yielded with PDRNs (3.4).

5 Conclusion

In this paper, we have proposed the architecture of primal-dual residual networks which combines features of unrolled proximal splitting methods and residual networks. Further, we have drawn a comparison between the proposed network architecture and classical primal-dual networks which can be considered straightforwardly unrolled proximal splitting methods. Our results have shown that, applied to speech dequantization, primal-dual residual networks can outperform their classical counterpart significantly. Moreover, we have seen that both architectures can beat a truncated proximal splitting scheme on the same task.

However, it should not go unmentioned that some questions have been left open for future research. First, we have yet only compared primal dual (residual) networks without extrapolation. As extrapolation is an important aspect concerning the convergence of proximal splitting methods, it may also play a role in view of the behavior of the related network architectures. Second, using the example of speech dequantization, we have shown how a particular convex optimization problem with hard constraints can be unrolled in terms of a neural network by performing an appropriate change of variables and considering the constraint as an indicator function of a related norm ball. It is likely possible that this approach can be generalized to a larger class of optimization problems. We are going to address these as well as other related questions in the near future.


  • Adler and Öktem [2018] Jonas Adler and Ozan Öktem. Learned Primal-dual Reconstruction. IEEE Transactions on Medical Imaging, 2018. doi: 10.1109/TMI.2018.2799231.
  • Bauschke and Combettes [2017] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2nd edition, 2017.
  • Brauer et al. [2016] Christoph Brauer, Timo Gerkmann, and Dirk Lorenz. Sparse reconstruction of quantized speech signals. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5940–5944. IEEE, 2016.
  • Chambolle and Pock [2011] Antonin Chambolle and Thomas Pock. A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging. J Math Imaging Vis, 40:120–145, 2011.
  • Collobert [2004] Ronan Collobert.

    Large Scale Machine Learning

    PhD thesis, Université Paris VI, 2004.
  • Gregor and LeCun [2010] Karol Gregor and Yann LeCun. Learning Fast Approximations of Sparse Coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 399–406. Omnipress, 2010.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 770–778, 2016.
  • Kingma and Ba [2014] Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Loizou [2013] Philipos C. Loizou. Speech Enhancement: Theory and Practice. CRC Press, 2013.
  • Parikh and Boyd [2013] N. Parikh and S. Boyd. Proximal Algorithms. Foundations and Trends in Optimization. Now Publishers, 2013. ISBN 9781601987167.
  • Riegler et al. [2016a] Gernot Riegler, David Ferstl, Matthias Rüther, and Horst Bischof. A Deep Primal-Dual Network for Guided Depth Super-Resolution. In Proceedings of the British Machine Vision Conference (BMVC), pages 7.1–7.14, 2016a.
  • Riegler et al. [2016b] Gernot Riegler, Matthias Rüther, and Horst Bischof. ATGV-Net: Accurate Depth Super-Resolution. In Computer Vision – ECCV 2016, pages 268–284. Springer International Publishing, 2016b.
  • Wang et al. [2016] Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Proximal Deep Structured Models. In 30th Conference on Neural Information Processing Systems (NIPS 2016), 2016.