I Introduction
Stochastic firstorder optimization methods have become the defacto techniques in modern data science and machine learning with exceptionally wide applications
[kingma2014adam, johnson2013accelerating, allen2017katyusha, chambolle2018stochastic], due to their remarkable scalability to the size of the optimization problems. The underlying optimization tasks in many applications nowadays are largescale and highdimensional by nature, as a consequence of bigdata and overparameterized models (for example the deep neural networks).
While welldesigned optimization algorithms can enable efficient machine learning, one can, on the other hand, utilize machine learning to develop problemadapted optimization algorithms via the socalled learningtolearn philosophy [andrychowicz2016learning, li2019learning]
. Traditionally, the optimization algorithms are designed in a handcrafting manner, with humandesigned choices of rules for computing gradient estimates, stepsizes, etc, for some general class of problems. Noting that although the traditional field of optimization has already obtain lowerbound matching (aka, optimal) algorithms
[woodworth2016tight, lan2012optimal, lan2015optimal] for many important general classes of problems, for specific instances there could be still much room for improvement. For example, a classical way of solving imaging inverse problems would be via minimizing a totalvariation regularized leastsquares [chambolle2016introduction] with specific measurement operators – which is a very narrow subclass of the general class smooth and convex programs where these optimal optimization algorithms are developed for. To obtain optimal algorithm adapted for a specific instance of a class, the handcrafted mathematical design could be totally inadequate, and very often we do not even know a tight lowerbound of it.One of the highly active areas in modern data science is computational imaging (which is also recognized as lowlevel computer vision), especially medical imaging including Xray computed tomography (CT)
[buzug2011computed], magnetic resonance imaging (MRI) [vlaardingerbroek2013magnetic], and positron emission tomography (PET) [ollinger1997positron]. In such applications, the clinics seek to infer images of patients’ inner body from the noisy measurements collected from the imaging devices. Traditionally, dimensionality reduction methods such as stochastic approximation [robbins1951stochastic] and sketching schemes [2015_Pilanci_Randomized, drineas2011faster, avron2013sketching] have been widely applied in solving largescale imaging problems due to their scalability[kim2015combining, sun2019online, pmlrv70tang17a]. Inspired by their successes, in our work, we focus on developing efficient learned minibatch and sketched algorithms tailored for solving imaging inverse problems.Ia Contributions of this work
In this work, we make four main contributions:

Novel deep unrolling networks – leveraging the machinery of modern stochastic optimization and dimensionality reduction
For the first time, we propose a class of deep unrolling networks based on the principles of modern stochastic firstorder optimization and dimensionalityreduction, for efficiently solving imaging inverse problems. We develop Learned Stochastic PrimalDual (LSPD) network, and its accelerated variant Sketched LSPD (SkLSPD) which is further empowered with the sketching approximation of products [2015_Pilanci_Randomized, 2016_Pilanci_Iterative, pmlrv70tang17a]. Our proposed networks can be viewed as a minibatch and sketched extension of the stateoftheart unrolling network – Learned PrimalDual (LPD) network of [adler2018learned]. Noting that the LPD is a very generic framework which takes most of the existing unrolling schemes as special cases, our acceleration schemes can be extended and applied to many other deep unrolling networks such as the ISTANet [zhang2018ista], ADMMNet [sun2016deep] and FISTANet [xiang2021fista].

Theoretical analysis of LSPD/SkLSPD network
We provide a theoretical analysis of a simple instance of our LSPD and SkLSPD network, from the viewpoint of stochastic nonconvex composite optimization. We provide upper and lower bounds of the estimation error under standard assumptions, suggesting that our proposed networks have the potential to achieve similar estimation accuracy as its fullbatch counterpart.

Less is more – the numerical effectiveness of LSPD and SkLSPD in tomographic medical imaging
We numerically evaluate the performance of our proposed networks on two typical tomographic medical imaging tasks – lowdose and sparseview Xray CT. We compare our LSPD and SkLSPD with the full batch LPD. We found out that our networks achieves competitive image reconstruction accuracy with the LPD, while only requiring a fraction of the computation of it as a freelunch, in both supervised and unsupervised training settings.

Out of training distribution? Get boosted!
– Efficient instanceadaptation in outofdistribution reconstruction tasks.In computational imaging practice, we often encounter scenarios where the measurement data is observed under slightly different noise distribution and/or modality compared to the data used for training the unrolling network. We propose an instanceadaptation framework to finetune the unrolling network and adjust it to the outofdistribution data. Due to the efficiency our scheme, We find numerically that our approach provides superior performance on instanceadaptation tasks for outofdistribution reconstruction.
Ii Background
Iia Imaging Inverse Problems
In imaging, the measurement systems can be generally expressed as:
(1) 
where
denotes the ground truth image (vectorized), and
denotes the forward measurement operator, the measurement noise, while denotes the measurement data. A classical way to obtain a reasonably good estimate of is to solve a composite optimization problem:(2) 
where data fidelity term is typically a convex function (one example would be the leastsquares ), while being a regularization term, for example the totalvariation (TV) seminorm, or a learned regularization [mukherjee2021end, mukherjee2020learned]. A classical way of solving the composite optimization problem (2) is via the proximal gradient methods [chambolle2016introduction], which are based on iterations of gradient descent step on , proximal step on and momentum step using previous iterates for fast convergence [beck2009fast, nesterov2007gradient].
Since modern imaging problems are often in hugescales, deterministic methods can be very computationally costly since they need to apply the full forward and adjoint operation in each iteration. For scalability, stochastic gradient methods [robbins1951stochastic] and orderedsubset methods [erdogan1999ordered, kim2015combining]
are widely applied in real world iterative reconstruction. More recent advanced stochastic variancereduced gradient methods
[xiao2014proximal, defazio2014saga, allen2017katyusha, tang2018rest, driggs2020spring] have also been adopted in some suitable scenarios in imaging for better efficiency [tang2019limitation, tang2020practicality, karimi2016hybrid].More recently, deep learning approaches have been adapted in imaging inverse problems, starting from the work of
[jin2017deep] on the FBPConvNet approach for tomographic reconstruction, and DnCNN [zhang2017beyond] for image denoising. Remarkably, the learned primaldual (LPD) network [adler2018learned] which mimics the update rule of primaldual gradient method and utilizes the forward operator and its adjoint within a deep convolutional network, achieves stateoftheart results and outperforms primalonly unrolling approaches. Despite the excellent performance, the computation of the learned primaldual method is significantly larger than direct approaches such as FBPConvNet.IiB Deep unrolling
Now we start by presenting the motivation of our unrolling network, starting from the basic primaldual gradientbased optimization algorithm. It is wellknown that, if the loss function
is convex and lower semicontinuous, we can reformulate the original objective function (2) to a saddlepoint problem:(3) 
where is the Fenchel conjugate of :
(4) 
The saddlepoint problem (3) can be efficiently solved by the primaldual hybrid gradient (PDHG) method [chambolle2011first], which is also known as the ChambollePock algorithm in the optimization literature. The PDHG method for solving the saddlepoint problem obeys the following updating rule:
PrimalDual Hybrid Gradient (PDHG)  
The PDHG algorithm takes alternatively the gradients regarding the primal variable and dual variable and performs the updates. In practice, it is often more desirable to reformulate the primal problem (2) to the primaldual form (3), especially when the loss function is nonsmooth (or when the Lipschitz constant of the gradient is large).
Currently most successful deep networks in imaging would be the unrolling schemes [gregor2010learning] inspired by the gradientbased optimization algorithms leveraging the knowledge of the physical models. The stateoftheart unrolling scheme – learned primaldual network of [adler2018learned] is based on unfolding the iteration of PDHG by replacing the proximal operators and
with multilayer convolutional neural networks
and , with sets of parameters and , applied on the both primal and dual spaces. The step sizes at each steps are also set to be trainable. The learned primaldual with iterations can be written as the following, where the learnable paramters are :Learned PrimalDual (LPD)  
When the primal and the dual CNNs are kept fixed across the layers of LPD, it has the potential to learn both the datafidelity and the regularizer (albeit one might need additional constraints on the CNNs to ensure that they are valid proximal operators). This makes the LPD parametrization more powerful than a learned proximalgradient network (with only a primal CNN), which can only learn the regularization functional. The capability of learning the datafidelity term can be particularly useful when the noise distribution is unknown and one does not have a clear analytical choice for the fidelity term.
We choose the LPD as our backbone because it is a very generic framework which covers all existing gradientbased unrolling schemes and plugandplay algorithms as special cases. For instance, if we choose the dual subnetworks of LPD to be a simple subtraction , we can recover Learned ISTA/FISTA. If we further choose the primal subnetwork to be a pretrained CNN denoiser, we recover the plugandplay ISTA/FISTA algorithm. As such, by studying the acceleration of LPD, we cover the standard primal unrolling and plugandplay gradient methods as special cases, and all our proposed acceleration schemes and theory in this work are applicable to these simplified cases.
Iii Accelerating the Deep Unrolling Schemes via Subset Approximation and Operator Sketching
In this section, we will present our two schemes for accelerating deep unrolling networks.
Iiia Acceleration via Subset Approximation
In our new approach, we propose to replace the forward and adjoint operators in the fullbatch LPD network of [adler2018learned], with only subsets of it. The proposed network can be view as an unrolled version of stochastic PDHG [chambolle2018stochastic] (but with orderedsubsets, and without variancereduction). We partition the forward and adjoint operators into subsets, and also the corresponding measurement data. In each layer, we use only one of the subsets, in a cycling order. Let be the set of subsampling operators, then the saddlepoint problem (3) can be rewritten as:
(5) 
Utilizing this finitesum structure, our learned stochastic primaldual (LSPD) network can be described as^{1}^{1}1Alternatively, one may also consider an optional learned momentum acceleration by keeping the memory of the outputs of a number of previous layers: where , at the cost of extra computation and memory. For such case the input channel of the subnets would be .:
Learned Stochastic PrimalDual (LSPD)  
In the scenarios where the forward operator dominates the computation in the unrolling network, for the same number of layers, our LSPD network is approximately time more efficient than the fullbatch LPD network in terms of computational complexity. The LSPD we presented here describes a framework of deep learning based methods depending the parameterization of the primal and dual subnetworks and how they are trained. In practice the LPD and LSPD networks usually achieves best performance when trained completely endtoend. While being the most recommended in practice, when trained endtoend, it is almost impossible to provide any nontrivial theoretical guarantees. An alternative approach is to restrict the subnetworks across layers to be the same and train the subnetwork to perform denoising [kamilov2017plug, sun2019online, tang2020fast, ono2017primal], artifact removal [liu2020rare], or approximate projection to a image manifold [rick2017one], leading to a plugandplay [venkatakrishnan2013plug, romano2017little, reehorst2018regularization] type of approach with theoretical convergence guarantees.
Note that our LSPD network covers the SGDNet of [liu2021sgd] as a special case, by setting the dual network to be a simple subtraction, and limit the primal network to only have 1 input channel taking in the stochastic gradient descent step with a fixed primal scalar stepsize. We refer to this type of networks as the Learned SGD (LSGD) in this paper:
which is a stochastic variant of the ISTANet [zhang2018ista].
IiiB Double Acceleration via Operator Sketching
Now we are ready to present our sketched LPD and sketched LSPD networks. Our main idea is to speedily approximate the products , :
(6) 
where () being the sketching/downsampling operator which can be potentially trainable w.r.t parameters , while is the sketched forward operator discretized on the reduced lowdimensional image space, and for the dual step we have
the upsampling operator which can also be trained. In practice, we found that it is actually suffice for us to just use the most simple offtheshelf up/downsampling operators in Pytorch for example the bilinear interpolation to deliver excellent performance for the sketched unrolling networks. Our Sketched LPD network is written as:
Again, we can use the same approximation for stochastic gradient steps:
(7)  
and hence we can write our Sketched LSPD (SkLSPD) network as:
or alternatively:
IiiB1 Remark regarding varying “coarsetofine” sketch size for SkLPD and SkLSPD
Numerically we suggest that we should use more aggressive sketch at the beginning for efficiency, while conservative sketch or nonsketch at latter iterations for accuracy. One plausible choice we found numerically pretty successful is: for the last few unrolling layers of SkLPD and SkLSPD, we switch to usual LPD/LSPD (say if the number of unrolling layers is 20, we can choose last 4 unrolling layers to be unsketched, that is, for ), such that the reconstruction accuracy is best preserved.
IiiB2 Remark regarding the Option 2 for further improving efficiency
The second option of our SkLPD and SkLSPD further accelerates the computation comparing to Option 1, by making the primalsubnet taking the lowdimensional images and gradients as input and then upscale. Noting that the usual choice for the up and down sampler would simply be an offtheshelf interpolation algorithm such as bilinear or bicubic interpolation which can be very efficiently computed, in practice we found the optional 2 often more favorable computationally if we use the coarsetofine sketch size. Numerically we found SkLPD and SkLSPD with option 2 and coarsetofine sketch size can be both trained faster and more efficient in testing due to the further reduction on the computation of the primalsubnet, without loss on reconstruction accuracy comparing to option 1.
Iv Theoretical Analysis of LSPD Network
In this section we provide theoretical recovery analysis of a subclass of the SkLSPD framework presented in the previous section. From this motivational analysis, we aim to demonstrate the reconstruction guarantee of a lightweight, recurrent version of the SkLSPD, and compare it with the recovery guarantee of the LPD derived under the same setting.
We analyze here a lightweight variant of SkLSPD (SkLSPDLW), where we set the dual subnetworks to be a proximal operator of a weighted loss:
(8) 
where includes trainable weights. and meawhile we set the primal subnetworks to have the same weights, trained separately as an approximate projection operator towards some image manifold . A typical example for this type of construction of unrolling network can be found in [rick2017one].
If we choose a simplified case , which means we use the usual loss for the datafit, the following Sketched LearnedSGD (SkLSGD) network can be derived:
We can easily observe that, this simplified version can be view as a learned proximal SGD with dimensionality reduction.
Notably, we choose to study simplified parameterizations for the dual subnetwork to make the error analysis feasible. Our ongoing work includes the analysis and understanding of learning the dual subnetwork which is effectively the learning of an adaptive loss.
Iva Generic Assumptions
We list here the assumptions we make in our motivational analysis of the simplified LSPD.
A. 1
(Approximate projection) We assume that the primal subnetwork of the Simplified LSPD is a approximate projection towards a manifold :
(9) 
where:
(10) 
and,
(11) 
Here we model the primal subnetwork to be a projection towards a manifold. Note that in practice the image manifold typically form a nonconvex subset. We also make conditions on the image manifold as the following:
A. 2
(Interpolation)We assume the groundtruth image , where is a closed set.
With this condition on the manifold, we further assume restricted eigenvalue (restricted strongconvexity) condition which is necessary for robust recovery
[negahban2012unified, agarwal2012fast, wainwright2014structured, oymak2017sharp]:A. 3
(Restricted Eigenvalue Condition) We define a descent cone at point as:
(12) 
and the restricted strongconvexity constant to be the largest positive constant satisfies the following:
(13) 
and the restricted smoothness constant to be the smallest positive constant satisfies:
(14) 
The restricted eigenvalue condition is standard and crucial for robust estimation guarantee for linear inverse problems, i.e., for a linear inverse problem to be nondegenerate, such type of condition must hold [oymak2017sharp, agarwal2012fast, 2012_Chandrasekaran_Convex]. For example, in sparserecovery setting, when the measurement operator is a Gaussian map (compressedsening measurements) and is sparse, one can show that can be as large as [oymak2017sharp]. In our setting we would expect an even better , since the mainifold of certain classes of realworld images should have much smaller covering numbers compared to the sparse set.
IvB Estimation error bounds of SkLSPD
With the assumptions presented in the previous subsetion, here we provide the recovery guarantee of a layer SkLSPDLW network on linear inverse problem where we have . Denoting to be the smallest constant satisfying:
(15) 
we can have the following result: (Upper bound) Assuming A.13, let and , denote and , the output of a layer Simplified LSPD network has the following guarantee for the estimation of :
(16) 
where , , if is convex, if is nonconvex, and let ,
(17) 
In our Theorem IVB, we show an exponential convergence of estimation error towards up to a statistical accuracy. Noting that this theorem also covers LSPD for which we have and . For the case where we only learn the primal subnetwork while choose , we can simply the above upperbound, as we recover the SkLSGD with :
Corollary IV.1
Assuming A.13, let , the output of a layer SkLSGD network has the following guarantee for the estimation of :
(18) 
where , and^{2}^{2}2We denote and as the unit ball and unit sphere in around origin, respectively.,
(19) 
The proof of this Corollary is a simplified version of the proof of Theorem 1, hence we do not repeatedly present this proof in the appendix.
When the restricted eigenvalue is large enough such that , the simplified LSPD has a linear convergence in estimation error, up to only depending the approximation accuracy of the primalsubnet in terms of projection. For many inverse problems for example CT/PET tomographic imaging we have where being the largest eigenvalue of , and in these tasks the same convergence rate in Theorem IVB apply for both LSPD/SkLSPD and LPD. This suggests the tremendous potential of computational saving of LSPD over full batch LPD.
One the other hand, using similar technique we can provide a complementing lower bound for the estimation error of SkLSGD: (Lower bound.) Under the same conditions of Theorem IVB, if we further assume the constraint set is convex, for any , , if , the estimation error of the output of SkLSGD satisfies the lower bound:
(20) 
where . Again, we present the proof of this result in the appendix.
V Training of LSPD and SkLSPD
Va Supervised endtoend training
The most basic training approach for LSPD/SkLSPD is the endtoend supervised training where we consider fully paired training samples of measurement and the groundtruth – which is typically obtained via a reconstruction from highaccuracy and abundant measurements. We take the initialization of LSPD/SkLSPD as a filtered backprojection . Let be the set of parameters , applying the LSPD/SkLSPD network on some measurement can be written as , the training objective can typically be written as:
(21) 
where we denote by the number of paired training examples. Since the paired highquality groundtruth data is expensive and idealistic in practice, we would prefer more advanced methods for training which have relaxed requirements on available training data. In Appendix, we present unsupervised training results for the unrolling networks using only measurement data.
VB Finetuning step by instanceadaptation for outofdistribution reconstruction
In medical imaging practice, we may encounter the scenario where the input data come from slightly different measurement modality or measurement noise distribution than the training data we used for training the unrolling network. In such a setting, instead of retraining the whole network, it is more desirable to finetune the network to adapt itself to this input data [yu2021empirical, adler2021task]. This process is known as instanceadaptation [vaksman2020lidia, tachella2020neural] which was applied in boosting denoising network. Let where is a group of transformations (for example, rotation transformation), and we assume that the operation of is equivariant for this group of transformation:
(22) 
Such a condition is true for CT/PET if we consider rotation operation [celledoni2021equivariant]. Our postprocessing method can be describe as the following: taking a pretrained LSPD network apply it on the input data , and run a few steps of Adam optimizer on the following selfsupervised objective (initialized with ):
(23)  
where we adapted the equivariant regularization term proposed in [chen2021equivariant] to our setting for dataaugmentation, and then reconstruct the underlying image as . Note that, this postprocessing is not very recommended for classic fullbatch unrolling network, since it will require a number of calls on full forward and adjoint operator, which is computationally inefficient. However, for our LSPD/SkLSPD network, such a limitation can be significantly mitigated since we only use subsets of these operators. We present some numerical results of this optional step on sparseview CT in the Appendix.
Method  calls  PSNR  SSIM  GPU time (s) on 
on and  1 pass of  
test set  


FBP 
  14.3242  0.0663  
LPD (12 layers)  24  35.3177  0.9065  48.348 
LSGD (24 layers)  12  31.5825  0.8528  33.089 
LSPD (12 layers)  6  35.0577  0.9014  31.196 
SkLSPD (12 layers)  4  34.9749  0.9028  23.996 
SkLSPD (12 layers, light weight on dualstep)  4  34.6389  0.8939  19.843 
Vi Numerical Results
In this subsection we present numerical results of our proposed networks for lowdose Xray CT. In real world clinical practice, the low dosage CT is widely used and highly recommended, since the intense exposures to the Xray could significantly increase the risk of inducing cancers [mason2018quantitative]. The lowdose CT takes a large number of lowenergy Xray views, leading to huge volumes of noisy measurements. This makes the reconstruction schemes struggle to achieve efficient and accurate estimations. In our Xray CT experiments we use the standard MayoClinic dataset [mccollough2016tu] which contains 10 patients’ 3D CT scans. We use 2111 slices (from 9 patients) of 2D images sized for training and 118 slices of the remaining 1 patient for testing. We use the ODL toolbox [adler2018learned] to simulate fan beam projection data with 800 equallyspaced angles of views (each view includes 800 rays). The fanbeam CT measurement is corrupted with Poisson noise: , where we make a lowdose choice of . This formula is used to simulate the noisy projection data according to the BeerLambert law, and to linearize the measurements, we consider the log data.
In our LSPD and SkLSPD networks we interleavepartition (according to the angles) the forward/adjoint operators and data into 4 subsets. Our networks has 12 layers^{4}^{4}4each layer of LSPD includes a primal and a dual subnetwork with 3 convolutional layers with kernel size and 32 channels, same for LPD.
hence correspond to 3 datapasses, which means it takes only 3 calls in total on the forward and adjoint operator. We compare it with the learned primaldual (LPD) which has 12 layers, corresponding to 12 calls on the forward and adjoint operator. We train all the networks with 50 epochs of Adam optimizer
[kingma2014adam] with batch size 1, in the supervised manner.For our SkLSPD we choose the Option 2 presented in our Section IIIB, with the coarsetofine sketchsize. For all these networks, we choose the subnetworks and to have 3 convolutional layers (with a skip connection between the first channel of input and the output) and 32 channels, with kernel size 5, and we do not use momentum option for simplicity and memory efficiency. The starting point is set to be the standard filteredbackprojection for all the unrolling networks. We set all of them to have 12 algorithmic layers (). For the up/downsamplers in our Sketched LSPD, we simply choose the bilinear upsample and downsample functions in Pytorch. When called, the upsampler increase the input image 4 times larger (from to ), while the downsampler makes the input image 4 times smaller (from to ). While the full forward operator is defined on the grid of , the sketched operator is defined on the grid of hence requires only a half of the computation in this setting. We use the coarsetofine strategy for SkLSPD, where we sketch the first 8 layers, but left the last 4 layers unsketched. We also implement and test the SkLSPD with a lightweight dualsubnetwork (corresponding to a proximal operator of a weighted loss with learnable weights, see the SkLSPDLW in Section IV).
In addition, we also implement the Learned SGD [liu2021sgd] in our setting which can be view as a simple special case of our LSPD network (see Section IIIA) . Here for LSGD we choose the same parameterization of primal subnetworks as our LSPD (except by their original design the LSGD subnetworks only take 1 input channel). To make a fair comparison, since LSGD do not have dualsubnetworks, we allow the LSGD to have 24 layers, such that the total number learnable parameters is similar to our LSPD.
We present the performance of the LPD, LSPD, and SkLSPD on the test set in Table 1, and some illustrative examples in Figure 2 for a visual comparison. We also present the results of the classical FilteredBackprojection (FBP) algorithm, which is widely used in clinical practice. We can observe from the FBP baseline, due to the challenging extreme lowdose setting, the FBP reconstruction fails completely. This can be partially addressed by UNet postpocessing (FBPConvNet, [jin2017deep]
), whose parameter size is one order of magnitute larger than our unrolling networks. Next we turn to the learned reconstruction results. From the numerical results, we found out that our LSPD and SkLSPD networks both achieve almost the same reconstruction accuracy compare to LPD baseline in terms of PSNR (peak signaltonoise ratio) and SSIM (structural similarity index,
[wang2004image]) measures, with requiring only a fraction of the computation of the forward and adjoint operators. In terms of run time on GPU, our acceleration can introduce a reduction around to compared to the full batch LPD.In our Table II, we present additional results on another widely applied modality in clinical practice – the sparseview CT, where we take fewer amount of normaldose measurements. Here we use again the ODL toolbox to simulate fan beam projection data with 200 equallyspaced angles of views (each view includes 1200 rays). The fanbeam CT measurement is corrupted with Poisson noise: , where we make a normaldose choice of . Different to the lowdose CT, the main challenge of sparseview CT is the illposeness of the inverse problems, that the measurement operator is highly underdetermined with a nontrivial nullspace. Meanwhile we also present in the Appendix our numerical results on instanceadaptation in outofdistribution reconstruction for sparseview CT, demonstrating again the effectiveness of our approach.
Method  calls  PSNR  SSIM  GPU time (s) on 
on and  1 pass of  
test set  


FBP 
  22.0299  0.2713  
LPD (12 layers)  24  36.9198  0.9129  28.018 
SkLSPD (12 layers)  4  36.6359  0.9178  17.340 
Vii Conclusion
In this work we proposed a new paradigm for accelerating deep unrolling networks, including LSPD and SkLSPD as key instances. Our generic framework is based on leveraging the spirit of stochastic optimization and dimensionality reduction into the design of deep unrolling schemes for computational efficiency and memory efficiency in solving largescale imaging inverse problems. We have provided theoretical analysis of the proposed framework for the estimation guarantees from the viewpoint of stochastic optimization theory. Then we provide numerical study of the proposed schemes in the context of Xray CT image reconstruction, demonstrating the effectiveness of our acceleration framework for deep unrolling networks.
Moreover, it is also worth noting that, not long after our preliminary report [tang2021stochastic] regarding the LSPD network occurred on arXiv in 2021, Bajic et al.[bajic20223d] successfully extended our LSPD scheme to hugescale 3D helical CT reconstruction, where they show that the subset approximation in LSPD can lead to significant reduction on the memory requirement during training, which is crucial in hugescale applications. This again suggests great potential and practicality of our acceleration schemes.
a Instanceadaptation for outofdistribution data
In this appendix we present our outofdistribution reconstruction framework for unrolling networks and demonstrate the effectiveness of our LSPD network in this setting. We apply the already trained LPD and LSPD models to a noisier sparseview CT task, with which is a weaker measurement strength than the level which the networks are trained, and test the performance of the models in instanceadaptation. In Figure 4 we present two slices for illustrative examples. We denote LPDinit and LSPDint to represent the results of the models trained in previous section which were aimed for a smaller noise level. Due to this mismatch, we can clearly observe a significant degrade on both of the reconstruction. Then we run 30 iterations of Adam [kingma2014adam] on instanceadaptation objective (23) for the input examples. We plot the convergence curve on PSNR in Figure 5, showing that our proposed LSPD network can also excel in instanceadaptation tasks. We observe that regarding the convergence rate in terms of the number of iterations, the LPD network has faster initial rate, but slow down in later iterations and catched up by the curve for LSPD, and the LSPD reaches better final recovery results. Noting that for each iteration the LPD requires much more computation than LSPD, hence for a clearer demonstration of the benefit of our LSPD network, we also plot the PSNR against the number of passes of the input data (the number of calls on and ).
Meanwhile, we also evaluate the instanceadaptation performance of both networks at the presence of model mismatch (Figure 6, 7), where the testing data is obtained from a different scanner geometry than the training data, with a doubled Xray source distance and . We again observe a much improved performance in terms of both convergence rate and recovery accuracy for our proposed network LSPD over the classical LPD network.
B Proof of Theorem 3.1
In this proof we utilize several projection identities from [oymak2017sharp]. We list them here first for completeness. The first one would be the coneprojection identity:
(24) 
where denotes the uniball in . The second one is the shift projection identity regarding that the Euclidean distance is preserved under translation:
(25) 
Now if , we can have the third projection identity which is an important result from geometric functional analysis [oymak2017sharp, Lemma 18]:
(26) 
where:
(27) 
where is a potentially nonconvex closed set included by cone . On the other hand, utilizing a simplified result of [pmlrv97qian19b] with partition minibatches, we have:
(28) 
where . Then for the case of noisy measurements , following similar procedure we can have:
For SkLSPD we further assume the approximation errors of the forward and adjoint operator are bounded:
(29) 
For th layer of SkLSPDLW we have the following:
By Moreau’s identity we can have:
(30) 
Now according to the definiton of proximal operator, we have
(31) 
Since for Simplified LSPD, we choose weighted leastsquares datafit:
(32) 
where includes trainable weights.
Then we denote the following diagonal matrices:
(33) 
and,
(34) 
and we can have:
(35) 
Then, we denote the following quantities for simplifying the notations:
(36) 
(37) 
which is bounded for finite and shrink to 0 for large enough choice for the ratio of stepsize , and:
(38) 
where we used the second coneprojection identity and denoted . Then by applying the first and third identity we can continue: