Log In Sign Up

Accelerating Deep Unrolling Networks via Dimensionality Reduction

In this work we propose a new paradigm for designing efficient deep unrolling networks using dimensionality reduction schemes, including minibatch gradient approximation and operator sketching. The deep unrolling networks are currently the state-of-the-art solutions for imaging inverse problems. However, for high-dimensional imaging tasks, especially X-ray CT and MRI imaging, the deep unrolling schemes typically become inefficient both in terms of memory and computation, due to the need of computing multiple times the high-dimensional forward and adjoint operators. Recently researchers have found that such limitations can be partially addressed by unrolling the stochastic gradient descent (SGD), inspired by the success of stochastic first-order optimization. In this work, we explore further this direction and propose first a more expressive and practical stochastic primal-dual unrolling, based on the state-of-the-art Learned Primal-Dual (LPD) network, and also a further acceleration upon stochastic primal-dual unrolling, using sketching techniques to approximate products in the high-dimensional image space. The operator sketching can be jointly applied with stochastic unrolling for the best acceleration and compression performance. Our numerical experiments on X-ray CT image reconstruction demonstrate the remarkable effectiveness of our accelerated unrolling schemes.


page 1

page 7

page 8

page 9

page 10


Operator Sketching for Deep Unrolling Networks

In this work we propose a new paradigm for designing efficient deep unro...

Accelerating Plug-and-Play Image Reconstruction via Multi-Stage Sketched Gradients

In this work we propose a new paradigm for designing fast plug-and-play ...

Stochastic Primal-Dual Deep Unrolling Networks for Imaging Inverse Problems

In this work we present a new type of efficient deep-unrolling networks ...

The Practicality of Stochastic Optimization in Imaging Inverse Problems

In this work we investigate the practicality of stochastic gradient desc...

Quantifying Model Uncertainty in Inverse Problems via Bayesian Deep Gradient Descent

Recent advances in reconstruction methods for inverse problems leverage ...

Sinogram upsampling using Primal-Dual UNet for undersampled CT and radial MRI reconstruction

CT and MRI are two widely used clinical imaging modalities for non-invas...

I Introduction

Stochastic first-order optimization methods have become the de-facto techniques in modern data science and machine learning with exceptionally wide applications

[kingma2014adam, johnson2013accelerating, allen2017katyusha, chambolle2018stochastic]

, due to their remarkable scalability to the size of the optimization problems. The underlying optimization tasks in many applications nowadays are large-scale and high-dimensional by nature, as a consequence of big-data and overparameterized models (for example the deep neural networks).

While well-designed optimization algorithms can enable efficient machine learning, one can, on the other hand, utilize machine learning to develop problem-adapted optimization algorithms via the so-called learning-to-learn philosophy [andrychowicz2016learning, li2019learning]

. Traditionally, the optimization algorithms are designed in a hand-crafting manner, with human-designed choices of rules for computing gradient estimates, step-sizes, etc, for some general class of problems. Noting that although the traditional field of optimization has already obtain lower-bound matching (aka, optimal) algorithms

[woodworth2016tight, lan2012optimal, lan2015optimal] for many important general classes of problems, for specific instances there could be still much room for improvement. For example, a classical way of solving imaging inverse problems would be via minimizing a total-variation regularized least-squares [chambolle2016introduction] with specific measurement operators – which is a very narrow sub-class of the general class smooth and convex programs where these optimal optimization algorithms are developed for. To obtain optimal algorithm adapted for a specific instance of a class, the hand-crafted mathematical design could be totally inadequate, and very often we do not even know a tight lower-bound of it.

One of the highly active areas in modern data science is computational imaging (which is also recognized as low-level computer vision), especially medical imaging including X-ray computed tomography (CT)

[buzug2011computed], magnetic resonance imaging (MRI) [vlaardingerbroek2013magnetic], and positron emission tomography (PET) [ollinger1997positron]. In such applications, the clinics seek to infer images of patients’ inner body from the noisy measurements collected from the imaging devices. Traditionally, dimensionality reduction methods such as stochastic approximation [robbins1951stochastic] and sketching schemes [2015_Pilanci_Randomized, drineas2011faster, avron2013sketching] have been widely applied in solving large-scale imaging problems due to their scalability[kim2015combining, sun2019online, pmlr-v70-tang17a]. Inspired by their successes, in our work, we focus on developing efficient learned minibatch and sketched algorithms tailored for solving imaging inverse problems.

I-a Contributions of this work

In this work, we make four main contributions:

  • Novel deep unrolling networks – leveraging the machinery of modern stochastic optimization and dimensionality reduction

    For the first time, we propose a class of deep unrolling networks based on the principles of modern stochastic first-order optimization and dimensionality-reduction, for efficiently solving imaging inverse problems. We develop Learned Stochastic Primal-Dual (LSPD) network, and its accelerated variant Sketched LSPD (SkLSPD) which is further empowered with the sketching approximation of products [2015_Pilanci_Randomized, 2016_Pilanci_Iterative, pmlr-v70-tang17a]. Our proposed networks can be viewed as a minibatch and sketched extension of the state-of-the-art unrolling network – Learned Primal-Dual (LPD) network of [adler2018learned]. Noting that the LPD is a very generic framework which takes most of the existing unrolling schemes as special cases, our acceleration schemes can be extended and applied to many other deep unrolling networks such as the ISTA-Net [zhang2018ista], ADMM-Net [sun2016deep] and FISTA-Net [xiang2021fista].

  • Theoretical analysis of LSPD/SkLSPD network

    We provide a theoretical analysis of a simple instance of our LSPD and SkLSPD network, from the view-point of stochastic non-convex composite optimization. We provide upper and lower bounds of the estimation error under standard assumptions, suggesting that our proposed networks have the potential to achieve similar estimation accuracy as its full-batch counterpart.

  • Less is more – the numerical effectiveness of LSPD and SkLSPD in tomographic medical imaging

    We numerically evaluate the performance of our proposed networks on two typical tomographic medical imaging tasks – low-dose and sparse-view X-ray CT. We compare our LSPD and SkLSPD with the full batch LPD. We found out that our networks achieves competitive image reconstruction accuracy with the LPD, while only requiring a fraction of the computation of it as a free-lunch, in both supervised and unsupervised training settings.

  • Out of training distribution? Get boosted!
    – Efficient instance-adaptation in out-of-distribution reconstruction tasks.

    In computational imaging practice, we often encounter scenarios where the measurement data is observed under slightly different noise distribution and/or modality compared to the data used for training the unrolling network. We propose an instance-adaptation framework to fine-tune the unrolling network and adjust it to the out-of-distribution data. Due to the efficiency our scheme, We find numerically that our approach provides superior performance on instance-adaptation tasks for out-of-distribution reconstruction.

Ii Background

Ii-a Imaging Inverse Problems

In imaging, the measurement systems can be generally expressed as:



denotes the ground truth image (vectorized), and

denotes the forward measurement operator, the measurement noise, while denotes the measurement data. A classical way to obtain a reasonably good estimate of is to solve a composite optimization problem:


where data fidelity term is typically a convex function (one example would be the least-squares ), while being a regularization term, for example the total-variation (TV) semi-norm, or a learned regularization [mukherjee2021end, mukherjee2020learned]. A classical way of solving the composite optimization problem (2) is via the proximal gradient methods [chambolle2016introduction], which are based on iterations of gradient descent step on , proximal step on and momentum step using previous iterates for fast convergence [beck2009fast, nesterov2007gradient].

Since modern imaging problems are often in huge-scales, deterministic methods can be very computationally costly since they need to apply the full forward and adjoint operation in each iteration. For scalability, stochastic gradient methods [robbins1951stochastic] and ordered-subset methods [erdogan1999ordered, kim2015combining]

are widely applied in real world iterative reconstruction. More recent advanced stochastic variance-reduced gradient methods

[xiao2014proximal, defazio2014saga, allen2017katyusha, tang2018rest, driggs2020spring] have also been adopted in some suitable scenarios in imaging for better efficiency [tang2019limitation, tang2020practicality, karimi2016hybrid].

More recently, deep learning approaches have been adapted in imaging inverse problems, starting from the work of

[jin2017deep] on the FBP-ConvNet approach for tomographic reconstruction, and DnCNN [zhang2017beyond] for image denoising. Remarkably, the learned primal-dual (LPD) network [adler2018learned] which mimics the update rule of primal-dual gradient method and utilizes the forward operator and its adjoint within a deep convolutional network, achieves state-of-the-art results and outperforms primal-only unrolling approaches. Despite the excellent performance, the computation of the learned primal-dual method is significantly larger than direct approaches such as FBP-ConvNet.

Ii-B Deep unrolling

Now we start by presenting the motivation of our unrolling network, starting from the basic primal-dual gradient-based optimization algorithm. It is well-known that, if the loss function

is convex and lower semi-continuous, we can reformulate the original objective function (2) to a saddle-point problem:


where is the Fenchel conjugate of :


The saddle-point problem (3) can be efficiently solved by the primal-dual hybrid gradient (PDHG) method [chambolle2011first], which is also known as the Chambolle-Pock algorithm in the optimization literature. The PDHG method for solving the saddle-point problem obeys the following updating rule:

Primal-Dual Hybrid Gradient (PDHG)

The PDHG algorithm takes alternatively the gradients regarding the primal variable and dual variable and performs the updates. In practice, it is often more desirable to reformulate the primal problem (2) to the primal-dual form (3), especially when the loss function is non-smooth (or when the Lipschitz constant of the gradient is large).

Currently most successful deep networks in imaging would be the unrolling schemes [gregor2010learning] inspired by the gradient-based optimization algorithms leveraging the knowledge of the physical models. The state-of-the-art unrolling scheme – learned primal-dual network of [adler2018learned] is based on unfolding the iteration of PDHG by replacing the proximal operators and

with multilayer convolutional neural networks

and , with sets of parameters and , applied on the both primal and dual spaces. The step sizes at each steps are also set to be trainable. The learned primal-dual with iterations can be written as the following, where the learnable paramters are :

Learned Primal-Dual (LPD)

When the primal and the dual CNNs are kept fixed across the layers of LPD, it has the potential to learn both the data-fidelity and the regularizer (albeit one might need additional constraints on the CNNs to ensure that they are valid proximal operators). This makes the LPD parametrization more powerful than a learned proximal-gradient network (with only a primal CNN), which can only learn the regularization functional. The capability of learning the data-fidelity term can be particularly useful when the noise distribution is unknown and one does not have a clear analytical choice for the fidelity term.

Fig. 1: One simple example of the practical choices for the building blocks of one layer of our LSPD network. Both dual and primal subnetworks are consist of 3 convolutional layers. The dual subnet has 3 input channels concatenating , while the primal subnet has 2 input channels for

We choose the LPD as our backbone because it is a very generic framework which covers all existing gradient-based unrolling schemes and plug-and-play algorithms as special cases. For instance, if we choose the dual subnetworks of LPD to be a simple subtraction , we can recover Learned ISTA/FISTA. If we further choose the primal subnetwork to be a pretrained CNN denoiser, we recover the plug-and-play ISTA/FISTA algorithm. As such, by studying the acceleration of LPD, we cover the standard primal unrolling and plug-and-play gradient methods as special cases, and all our proposed acceleration schemes and theory in this work are applicable to these simplified cases.

Iii Accelerating the Deep Unrolling Schemes via Subset Approximation and Operator Sketching

In this section, we will present our two schemes for accelerating deep unrolling networks.

Iii-a Acceleration via Subset Approximation

In our new approach, we propose to replace the forward and adjoint operators in the full-batch LPD network of [adler2018learned], with only subsets of it. The proposed network can be view as an unrolled version of stochastic PDHG [chambolle2018stochastic] (but with ordered-subsets, and without variance-reduction). We partition the forward and adjoint operators into subsets, and also the corresponding measurement data. In each layer, we use only one of the subsets, in a cycling order. Let be the set of subsampling operators, then the saddle-point problem (3) can be rewritten as:


Utilizing this finite-sum structure, our learned stochastic primal-dual (LSPD) network can be described as111Alternatively, one may also consider an optional learned momentum acceleration by keeping the memory of the outputs of a number of previous layers: where , at the cost of extra computation and memory. For such case the input channel of the subnets would be .:

Learned Stochastic Primal-Dual (LSPD)

In the scenarios where the forward operator dominates the computation in the unrolling network, for the same number of layers, our LSPD network is approximately -time more efficient than the full-batch LPD network in terms of computational complexity. The LSPD we presented here describes a framework of deep learning based methods depending the parameterization of the primal and dual subnetworks and how they are trained. In practice the LPD and LSPD networks usually achieves best performance when trained completely end-to-end. While being the most recommended in practice, when trained end-to-end, it is almost impossible to provide any non-trivial theoretical guarantees. An alternative approach is to restrict the subnetworks across layers to be the same and train the subnetwork to perform denoising [kamilov2017plug, sun2019online, tang2020fast, ono2017primal], artifact removal [liu2020rare], or approximate projection to a image manifold [rick2017one], leading to a plug-and-play [venkatakrishnan2013plug, romano2017little, reehorst2018regularization] type of approach with theoretical convergence guarantees.

Note that our LSPD network covers the SGD-Net of [liu2021sgd] as a special case, by setting the dual network to be a simple subtraction, and limit the primal network to only have 1 input channel taking in the stochastic gradient descent step with a fixed primal scalar step-size. We refer to this type of networks as the Learned SGD (LSGD) in this paper:

which is a stochastic variant of the ISTA-Net [zhang2018ista].

Iii-B Double Acceleration via Operator Sketching

Now we are ready to present our sketched LPD and sketched LSPD networks. Our main idea is to speedily approximate the products , :


where () being the sketching/downsampling operator which can be potentially trainable w.r.t parameters , while is the sketched forward operator discretized on the reduced low-dimensional image space, and for the dual step we have

the upsampling operator which can also be trained. In practice, we found that it is actually suffice for us to just use the most simple off-the-shelf up/down-sampling operators in Pytorch for example the bilinear interpolation to deliver excellent performance for the sketched unrolling networks. Our Sketched LPD network is written as:

Again, we can use the same approximation for stochastic gradient steps:


and hence we can write our Sketched LSPD (SkLSPD) network as:

or alternatively:

Iii-B1 Remark regarding varying “coarse-to-fine” sketch size for SkLPD and SkLSPD

Numerically we suggest that we should use more aggressive sketch at the beginning for efficiency, while conservative sketch or non-sketch at latter iterations for accuracy. One plausible choice we found numerically pretty successful is: for the last few unrolling layers of SkLPD and SkLSPD, we switch to usual LPD/LSPD (say if the number of unrolling layers is 20, we can choose last 4 unrolling layers to be unsketched, that is, for ), such that the reconstruction accuracy is best preserved.

Iii-B2 Remark regarding the Option 2 for further improving efficiency

The second option of our SkLPD and SkLSPD further accelerates the computation comparing to Option 1, by making the primal-subnet taking the low-dimensional images and gradients as input and then upscale. Noting that the usual choice for the up and down sampler would simply be an off-the-shelf interpolation algorithm such as bilinear or bicubic interpolation which can be very efficiently computed, in practice we found the optional 2 often more favorable computationally if we use the coarse-to-fine sketch size. Numerically we found SkLPD and SkLSPD with option 2 and coarse-to-fine sketch size can be both trained faster and more efficient in testing due to the further reduction on the computation of the primal-subnet, without loss on reconstruction accuracy comparing to option 1.

Iv Theoretical Analysis of LSPD Network

In this section we provide theoretical recovery analysis of a subclass of the SkLSPD framework presented in the previous section. From this motivational analysis, we aim to demonstrate the reconstruction guarantee of a light-weight, recurrent version of the SkLSPD, and compare it with the recovery guarantee of the LPD derived under the same setting.

We analyze here a light-weight variant of SkLSPD (SkLSPD-LW), where we set the dual sub-networks to be a proximal operator of a weighted loss:


where includes trainable weights. and meawhile we set the primal subnetworks to have the same weights, trained separately as an approximate projection operator towards some image manifold . A typical example for this type of construction of unrolling network can be found in [rick2017one].

If we choose a simplified case , which means we use the usual loss for the data-fit, the following Sketched Learned-SGD (SkLSGD) network can be derived:

We can easily observe that, this simplified version can be view as a learned proximal SGD with dimensionality reduction.

Notably, we choose to study simplified parameterizations for the dual sub-network to make the error analysis feasible. Our ongoing work includes the analysis and understanding of learning the dual subnetwork which is effectively the learning of an adaptive loss.

Iv-a Generic Assumptions

We list here the assumptions we make in our motivational analysis of the simplified LSPD.

A. 1

(Approximate projection) We assume that the primal subnetwork of the Simplified LSPD is a -approximate projection towards a manifold :






Here we model the primal subnetwork to be a -projection towards a manifold. Note that in practice the image manifold typically form a non-convex subset. We also make conditions on the image manifold as the following:

A. 2

(Interpolation)We assume the ground-truth image , where is a closed set.

With this condition on the manifold, we further assume restricted eigenvalue (restricted strong-convexity) condition which is necessary for robust recovery

[negahban2012unified, agarwal2012fast, wainwright2014structured, oymak2017sharp]:

A. 3

(Restricted Eigenvalue Condition) We define a descent cone at point as:


and the restricted strong-convexity constant to be the largest positive constant satisfies the following:


and the restricted smoothness constant to be the smallest positive constant satisfies:


The restricted eigenvalue condition is standard and crucial for robust estimation guarantee for linear inverse problems, i.e., for a linear inverse problem to be non-degenerate, such type of condition must hold [oymak2017sharp, agarwal2012fast, 2012_Chandrasekaran_Convex]. For example, in sparse-recovery setting, when the measurement operator is a Gaussian map (compressed-sening measurements) and is -sparse, one can show that can be as large as [oymak2017sharp]. In our setting we would expect an even better , since the mainifold of certain classes of real-world images should have much smaller covering numbers compared to the sparse set.

Iv-B Estimation error bounds of SkLSPD

With the assumptions presented in the previous subsetion, here we provide the recovery guarantee of a layer SkLSPD-LW network on linear inverse problem where we have . Denoting to be the smallest constant satisfying:


we can have the following result: (Upper bound) Assuming A.1-3, let and , denote and , the output of a layer Simplified LSPD network has the following guarantee for the estimation of :


where , , if is convex, if is non-convex, and let ,


In our Theorem IV-B, we show an exponential convergence of estimation error towards up to a statistical accuracy. Noting that this theorem also covers LSPD for which we have and . For the case where we only learn the primal subnetwork while choose , we can simply the above upper-bound, as we recover the SkLSGD with :

Corollary IV.1

Assuming A.1-3, let , the output of a layer SkLSGD network has the following guarantee for the estimation of :


where , and222We denote and as the unit ball and unit sphere in around origin, respectively.,


The proof of this Corollary is a simplified version of the proof of Theorem 1, hence we do not repeatedly present this proof in the appendix.

333For the noiseless case , .

When the restricted eigenvalue is large enough such that , the simplified LSPD has a linear convergence in estimation error, up to only depending the approximation accuracy of the primal-subnet in terms of projection. For many inverse problems for example CT/PET tomographic imaging we have where being the largest eigenvalue of , and in these tasks the same convergence rate in Theorem IV-B apply for both LSPD/SkLSPD and LPD. This suggests the tremendous potential of computational saving of LSPD over full batch LPD.

One the other hand, using similar technique we can provide a complementing lower bound for the estimation error of SkLSGD: (Lower bound.) Under the same conditions of Theorem IV-B, if we further assume the constraint set is convex, for any , , if , the estimation error of the output of SkLSGD satisfies the lower bound:


where . Again, we present the proof of this result in the appendix.

Fig. 2: Examples for Low-dose CT on the test set of Mayo dataset. We can observe that our LSPD networks achieve almost the same reconstruction performance as the full-batch LPD, without visual difference.

V Training of LSPD and SkLSPD

V-a Supervised end-to-end training

The most basic training approach for LSPD/SkLSPD is the end-to-end supervised training where we consider fully paired training samples of measurement and the ground-truth – which is typically obtained via a reconstruction from high-accuracy and abundant measurements. We take the initialization of LSPD/SkLSPD as a filtered back-projection . Let be the set of parameters , applying the LSPD/SkLSPD network on some measurement can be written as , the training objective can typically be written as:


where we denote by the number of paired training examples. Since the paired high-quality ground-truth data is expensive and idealistic in practice, we would prefer more advanced methods for training which have relaxed requirements on available training data. In Appendix, we present unsupervised training results for the unrolling networks using only measurement data.

V-B Fine-tuning step by instance-adaptation for out-of-distribution reconstruction

In medical imaging practice, we may encounter the scenario where the input data come from slightly different measurement modality or measurement noise distribution than the training data we used for training the unrolling network. In such a setting, instead of retraining the whole network, it is more desirable to fine-tune the network to adapt itself to this input data [yu2021empirical, adler2021task]. This process is known as instance-adaptation [vaksman2020lidia, tachella2020neural] which was applied in boosting denoising network. Let where is a group of transformations (for example, rotation transformation), and we assume that the operation of is equivariant for this group of transformation:


Such a condition is true for CT/PET if we consider rotation operation [celledoni2021equivariant]. Our post-processing method can be describe as the following: taking a pre-trained LSPD network apply it on the input data , and run a few steps of Adam optimizer on the following self-supervised objective (initialized with ):


where we adapted the equivariant regularization term proposed in [chen2021equivariant] to our setting for data-augmentation, and then reconstruct the underlying image as . Note that, this post-processing is not very recommended for classic full-batch unrolling network, since it will require a number of calls on full forward and adjoint operator, which is computationally inefficient. However, for our LSPD/SkLSPD network, such a limitation can be significantly mitigated since we only use subsets of these operators. We present some numerical results of this optional step on sparse-view CT in the Appendix.

Method calls PSNR SSIM GPU time (s) on
on and 1 pass of
test set

- 14.3242 0.0663
LPD (12 layers) 24 35.3177 0.9065 48.348
LSGD (24 layers) 12 31.5825 0.8528 33.089
LSPD (12 layers) 6 35.0577 0.9014 31.196
SkLSPD (12 layers) 4 34.9749 0.9028 23.996
SkLSPD (12 layers, light weight on dual-step) 4 34.6389 0.8939 19.843
TABLE I: Low-dose CT testing results for LPD, LSPD and SkLSPD networks on Mayo dataset, with supervised training

Vi Numerical Results

In this subsection we present numerical results of our proposed networks for low-dose X-ray CT. In real world clinical practice, the low dosage CT is widely used and highly recommended, since the intense exposures to the X-ray could significantly increase the risk of inducing cancers [mason2018quantitative]. The low-dose CT takes a large number of low-energy X-ray views, leading to huge volumes of noisy measurements. This makes the reconstruction schemes struggle to achieve efficient and accurate estimations. In our X-ray CT experiments we use the standard Mayo-Clinic dataset [mccollough2016tu] which contains 10 patients’ 3D CT scans. We use 2111 slices (from 9 patients) of 2D images sized for training and 118 slices of the remaining 1 patient for testing. We use the ODL toolbox [adler2018learned] to simulate fan beam projection data with 800 equally-spaced angles of views (each view includes 800 rays). The fan-beam CT measurement is corrupted with Poisson noise: , where we make a low-dose choice of . This formula is used to simulate the noisy projection data according to the Beer-Lambert law, and to linearize the measurements, we consider the log data.

In our LSPD and SkLSPD networks we interleave-partition (according to the angles) the forward/adjoint operators and data into 4 subsets. Our networks has 12 layers444each layer of LSPD includes a primal and a dual subnetwork with 3 convolutional layers with kernel size and 32 channels, same for LPD.

hence correspond to 3 data-passes, which means it takes only 3 calls in total on the forward and adjoint operator. We compare it with the learned primal-dual (LPD) which has 12 layers, corresponding to 12 calls on the forward and adjoint operator. We train all the networks with 50 epochs of Adam optimizer

[kingma2014adam] with batch size 1, in the supervised manner.

For our SkLSPD we choose the Option 2 presented in our Section III-B, with the coarse-to-fine sketch-size. For all these networks, we choose the subnetworks and to have 3 convolutional layers (with a skip connection between the first channel of input and the output) and 32 channels, with kernel size 5, and we do not use momentum option for simplicity and memory efficiency. The starting point is set to be the standard filtered-backprojection for all the unrolling networks. We set all of them to have 12 algorithmic layers (). For the up/down-samplers in our Sketched LSPD, we simply choose the bilinear upsample and downsample functions in Pytorch. When called, the up-sampler increase the input image 4 times larger (from to ), while the down-sampler makes the input image 4 times smaller (from to ). While the full forward operator is defined on the grid of , the sketched operator is defined on the grid of hence requires only a half of the computation in this setting. We use the coarse-to-fine strategy for SkLSPD, where we sketch the first 8 layers, but left the last 4 layers unsketched. We also implement and test the SkLSPD with a light-weight dual-subnetwork (corresponding to a proximal operator of a weighted loss with learnable weights, see the SkLSPD-LW in Section IV).

Fig. 3: Example for intermediate layer outputs for Low-dose CT on the test set of Mayo dataset. We can observe that LSPD/SkLSPD achieves competitive reconstruction quality with LPD across intermediate layers.

In addition, we also implement the Learned SGD [liu2021sgd] in our setting which can be view as a simple special case of our LSPD network (see Section III-A) . Here for LSGD we choose the same parameterization of primal sub-networks as our LSPD (except by their original design the LSGD subnetworks only take 1 input channel). To make a fair comparison, since LSGD do not have dual-subnetworks, we allow the LSGD to have 24 layers, such that the total number learnable parameters is similar to our LSPD.

We present the performance of the LPD, LSPD, and SkLSPD on the test set in Table 1, and some illustrative examples in Figure 2 for a visual comparison. We also present the results of the classical Filtered-Backprojection (FBP) algorithm, which is widely used in clinical practice. We can observe from the FBP baseline, due to the challenging extreme low-dose setting, the FBP reconstruction fails completely. This can be partially addressed by U-Net postpocessing (FBPConvNet, [jin2017deep]

), whose parameter size is one order of magnitute larger than our unrolling networks. Next we turn to the learned reconstruction results. From the numerical results, we found out that our LSPD and SkLSPD networks both achieve almost the same reconstruction accuracy compare to LPD baseline in terms of PSNR (peak signal-to-noise ratio) and SSIM (structural similarity index,

[wang2004image]) measures, with requiring only a fraction of the computation of the forward and adjoint operators. In terms of run time on GPU, our acceleration can introduce a reduction around to compared to the full batch LPD.

In our Table II, we present additional results on another widely applied modality in clinical practice – the sparse-view CT, where we take fewer amount of normal-dose measurements. Here we use again the ODL toolbox to simulate fan beam projection data with 200 equally-spaced angles of views (each view includes 1200 rays). The fan-beam CT measurement is corrupted with Poisson noise: , where we make a normal-dose choice of . Different to the low-dose CT, the main challenge of sparse-view CT is the ill-poseness of the inverse problems, that the measurement operator is highly under-determined with a non-trivial null-space. Meanwhile we also present in the Appendix our numerical results on instance-adaptation in out-of-distribution reconstruction for sparse-view CT, demonstrating again the effectiveness of our approach.

Method calls PSNR SSIM GPU time (s) on
on and 1 pass of
test set

- 22.0299 0.2713
LPD (12 layers) 24 36.9198 0.9129 28.018
SkLSPD (12 layers) 4 36.6359 0.9178 17.340
TABLE II: Sparse-View CT testing results for LPD and SkLSPD networks on Mayo dataset, with supervised training

Vii Conclusion

In this work we proposed a new paradigm for accelerating deep unrolling networks, including LSPD and SkLSPD as key instances. Our generic framework is based on leveraging the spirit of stochastic optimization and dimensionality reduction into the design of deep unrolling schemes for computational efficiency and memory efficiency in solving large-scale imaging inverse problems. We have provided theoretical analysis of the proposed framework for the estimation guarantees from the viewpoint of stochastic optimization theory. Then we provide numerical study of the proposed schemes in the context of X-ray CT image reconstruction, demonstrating the effectiveness of our acceleration framework for deep unrolling networks.

Moreover, it is also worth noting that, not long after our preliminary report [tang2021stochastic] regarding the LSPD network occurred on arXiv in 2021, Bajic et al.[bajic20223d] successfully extended our LSPD scheme to huge-scale 3D helical CT reconstruction, where they show that the subset approximation in LSPD can lead to significant reduction on the memory requirement during training, which is crucial in huge-scale applications. This again suggests great potential and practicality of our acceleration schemes.

-a Instance-adaptation for out-of-distribution data

In this appendix we present our out-of-distribution reconstruction framework for unrolling networks and demonstrate the effectiveness of our LSPD network in this setting. We apply the already trained LPD and LSPD models to a noisier sparse-view CT task, with which is a weaker measurement strength than the level which the networks are trained, and test the performance of the models in instance-adaptation. In Figure 4 we present two slices for illustrative examples. We denote LPD-init and LSPD-int to represent the results of the models trained in previous section which were aimed for a smaller noise level. Due to this mismatch, we can clearly observe a significant degrade on both of the reconstruction. Then we run 30 iterations of Adam [kingma2014adam] on instance-adaptation objective (23) for the input examples. We plot the convergence curve on PSNR in Figure 5, showing that our proposed LSPD network can also excel in instance-adaptation tasks. We observe that regarding the convergence rate in terms of the number of iterations, the LPD network has faster initial rate, but slow down in later iterations and catched up by the curve for LSPD, and the LSPD reaches better final recovery results. Noting that for each iteration the LPD requires much more computation than LSPD, hence for a clearer demonstration of the benefit of our LSPD network, we also plot the PSNR against the number of passes of the input data (the number of calls on and ).

Meanwhile, we also evaluate the instance-adaptation performance of both networks at the presence of model mismatch (Figure 6, 7), where the testing data is obtained from a different scanner geometry than the training data, with a doubled X-ray source distance and . We again observe a much improved performance in terms of both convergence rate and recovery accuracy for our proposed network LSPD over the classical LPD network.

Fig. 4: Results for instance-adaptation for out-of-distribution reconstruction, with noise level mismatch
Fig. 5: Results for instance-adaptation of out-of-distribution reconstruction, with noise level mismatch (Row 1 for example 1)
Fig. 6: Results for instance-adaptation for out-of-distribution reconstruction, with model mismatch
Fig. 7: Results for instance-adaptation of out-of-distribution reconstruction, with model mismatch (Row 1 for example 1)

-B Proof of Theorem 3.1

In this proof we utilize several projection identities from [oymak2017sharp]. We list them here first for completeness. The first one would be the cone-projection identity:


where denotes the uni-ball in . The second one is the shift projection identity regarding that the Euclidean distance is preserved under translation:


Now if , we can have the third projection identity which is an important result from geometric functional analysis [oymak2017sharp, Lemma 18]:




where is a potentially non-convex closed set included by cone . On the other hand, utilizing a simplified result of [pmlr-v97-qian19b] with partition minibatches, we have:


where . Then for the case of noisy measurements , following similar procedure we can have:

For SkLSPD we further assume the approximation errors of the forward and adjoint operator are bounded:


For -th layer of SkLSPD-LW we have the following:

By Moreau’s identity we can have:


Now according to the definiton of proximal operator, we have


Since for Simplified LSPD, we choose weighted least-squares data-fit:


where includes trainable weights.

Then we denote the following diagonal matrices:




and we can have:


Then, we denote the following quantities for simplifying the notations:


which is bounded for finite and shrink to 0 for large enough choice for the ratio of step-size , and:


where we used the second cone-projection identity and denoted . Then by applying the first and third identity we can continue: