1 Introduction
Natural signals such as images, audio and video often demonstrate sparsity in some representation domain such as wavelets ^{pati:93:omp}
, Fourier transform, etc. Dictionary learning methods can extract features from natural signal data and represent the natural signal in a sparse format. Such sparse representation methods have been used in many areas like image denoising
^{elad:06:idv}^{yang:10:isr}, inpainting ^{julien:08:src}, medical image reconstruction ^{zhang:17:tbd}, etc.There are two typical approaches to find a sparse representation of a signal: the synthesis model and the analysis model. Let be the measurement outcome of signal with measurement operator . For synthesis models, a signal
is encoded as a combination of column vectors in a synthesis dictionary
by assuming , where contains only a few nonzero elements, and then the data consistency is enforced simultaneously in the form of . For analysis models, a signal is assumed to be sparse in an (overcomplete) transform domain with , where is referred to as the analysis dictionary, and meanwhile one in parallel enforces the data consistency .The sparse coding step is NPhard, and the algorithms to learn synthesis or analysis dictionaries are computationally expensive such as KSVD ^{aharon:06:ksa}, Analysis KSVD ^{rubinstein:13:aka}, NAAOL ^{yaghoobi:13:coa}, etc. The sparsifying transform model manifests advantage in terms of low computational cost for computing the sparse coding of signals ^{ravishankar:15:lst}. By modeling where indicates an error term, is approximately sparse in the transform domain. Such a sparsifying transform model ^{ravishankar:13:lst} can be regarded as a generalized analysis model. For instance, Ravishankar et al. proposed to learn doubly sparse transforms for signals ^{ravishankar:13:lds}, where the sparsifying transform is a product of two different transforms, a sparse transform and a fast, analytical transform. Other works also developed efficient blind compressed sensing algorithms ^{sabsam} for joint sparsifying transform learning and image reconstruction for MRI.
Recent years have witnessed the growing development of deep learning methods. A typical method is to employ deep neural networks (DNN) and train an endtoend model between the input and the output. For example, Jin et al. proposed the FBPConvNet based on the Unet framework
^{jin:17:dcn}, which constructs a mapping from noisy lowdose CT reconstruction images to highquality images. Chen et al. combined the autoencoder and deconvolution network and proposed the REDCNN framework for lowdose CT imaging
^{chen:2017:redCNN}. Despite that DNNbased methods show potential for capturing underlying features of big data, a large amount of (often paired) training data is indispensable to train a feasible network. In addition, an explicit understanding of the modeling or representation captured in the network is currently lacking.In the past few years, researchers imitated the nested network structure and created more complex sparse signal models. For example, Tang et al. proposed a Deep MicroDictionary Learning and Coding Network, where classic convolutional layers are replaced with dictionary learning and feature encoding operations ^{tang:19:dmd}. Singhal et al. ^{Singhal:18:mmt} presented a deep dictionary learning network (DDL) for image classification by minimizing the reconstruction error after going through multiple DL feature encoding steps. Ravishankar et al. exploited multilayer extension for sparsifying transform model, which enables sparsifying an input image layer by layer ^{ravishankar:18:lml}. Other works proposed to improve signal models by combining multiple signal modules in a parallel structure ^{wen:14:sos, zheng:18:pua}. Wen et al. extended the single sparsifying transform model to a union of sparsifying transforms model, where different inputs are assigned to different clusters and transforms are learned with respect to each cluster ^{wen:14:sos}.
Several applications of sparsifying transform (ST) learning and various extensions have been demonstrated in the field of medical image reconstruction such as magnetic resonance image reconstruction and computed tomography (CT) image reconstruction ^{zheng:18:pua, yang:20:mars}. The conventional method for regulardose CT image reconstruction is the analytical filtered backprojection (FBP) ^{feldkamp:84:pcb}. However, with reduced/low Xray dose, severe artifacts and noise can degrade the quality of the reconstructed image. A class of reconstruction methods takes the physics model of the imaging system into account along with statistical models of measurements and noise and often simple object priors. These methods are referred to as modelbased image reconstruction methods (MBIR) ^{MBIR_review_21}.
The penalized weighted least squares (PWLS) approach is a common MBIR method, whose loss function includes a weighted quadratic datafidelity term and a regularizer term. The regularizer is conventionally based on handcrafted models and priors
^{sauer:93:alu, thibault:06:arf, thibault:07:atd, cho:15:rdf}. Previous studies manifested a promising result by introducing learned sparsifying transform learning into the PWLS scheme ^{pfister:14:mbi, pfister:14:trw, pfister:14:ast, chun:17:esv}. The sparsifying transform learned in an unsupervised manner is incorporated into a prior term, which is equivalent to involving effective information extracted from the training slices into the process of image reconstruction. Moreover, using multiple sparsifying transforms increases the number of obtained features ^{zheng:18:pua, yang:20:mars} and further boosts the performance of reconstruction algorithms.In this study, we propose a networkstructured sparsifying transform learning module, which we refer to as the Multilayer Clusteringbased residual Sparsifying Transform (MCST) model. We extended the original sparsifying transform model in two aspects. On the one hand, the multilayer structure of MCST model enables gaining potential features by exploiting residual maps layer by layer. On the other hand, the clustering operation in each layer implements an attentionlike mechanism by grouping input patches. We propose a pipeline for training the MCST model and show how to update the variables in this pipeline. In addition, we have applied the proposed learned MCST model to LDCT reconstruction through a PWLS optimization scheme. We named the new LDCT reconstruction algorithm PWLSMCST. Experimental results demonstrate that PWLSMCST not only provides better image reconstruction quality than conventional methods like FBP and PWLS method with the nonadaptive edgepreserving (EP) regularizer, but also outperforms several advanced methods such as PWLSULTRA ^{zheng:18:pua} and PWLSMARS ^{yang:20:mars} in terms of RMSE and SSIM.
2 Methods
2.1 Model description
In this section, we give a mathematical description of the MCST model and introduce its framework. Figure 1 shows the structure of the proposed multilayer clusteringbased residual sparsifying transform (MCST) model. Table 1 in Appendix gives a detailed description of the notation used in our methodology. Matrix denotes the transform in class in layer . and represent residual map and sparse code submatrices respectively, in layer with column indices , where is a set containing indices of clustered variables of class in the th layer. In particular, each column in
is a vectorized overlapping image patch extracted from the training dataset (of images) with an appropriate patch stride. Figure
2 gives an illustration of the patch extraction and patch clustering processes on an image across multiple layers. For patches with the same index, even though they are in different layers, the location of the included pixels in the image is the same. Each patch is vectorized for further clustering processing after extraction. The proposed MCST model extends the original sparsifying transform model in two dimensions to strengthen the model’s representation ability. The first type of extension is increasing the depth of the model. We borrow the design idea of the recent MARS ^{yang:20:mars} model. Multilayer structure enforces the filtered residuals (the difference between the transformed input signals and their sparse representation) to be further sparsified layer by layer. Each layer keeps partial information or captures effective features from the input. The second type of improvement is changing the width of the MCST model. Similar to the popular attention mechanism, for different parts of the input signal in each layer, we learn sparsifying transform modules separately. Since our method is patchbased, a clustering operation is necessary to assign each patch into a specific class. Moreover, the relative order of the patches cannot be disrupted by the clustering operation, which requires us to remember the index of each patch at each layer and combine them in the same order before the next clustering.2.2 Algorithm for model training and image reconstruction
Figure 3 gives an overview of the algorithms for model training and lowdose CT reconstruction. The whole process is divided into two stages. In the process of model training, (2.1) is solved by employing a block coordinate descent (BCD) method. We propose an iterative algorithm involving cluster assignment update, transform update, and sparse coding steps. Note that we use unlabelled images to train the proposed MCST signal model. A similar cyclical updating strategy is applied to the image reconstruction stage as well. Instead of updating the transform in each iteration, for the reconstruction stage, an image update step is included into the BCD algorithm. We designed the prior function (regularizer) incorporating the learned MCST model to harness the effective information obtained in the training stage.
2.2.1 Algorithm for MCST model training
The formulation for training the MCST model is given as in (2.1). Since (2.1) is nonconvex, similar to the recent MARS work ^{yang:20:mars}, we apply the BCD algorithm to solve this problem, which takes an iterative updating strategy among different variables to be optimized. Algorithm 1 shows the full MCST learning pipeline. We update variables in MCST by looping over layers. The specific solving procedure is presented in the following steps:

Cluster assignment update in th layer
In this step, we find the cluster assignment for every patch in the th layer by solving the following optimization problem (1). Random initialization is employed for cluster assignment in the first iteration. When updating the cluster assignment in the th layer, the patch assignments in the deeper layers are fixed.
(1) Interpretation
The objective function of optimization problem (1) consists of two parts: sparse encoding residual and an penalty to enforce sparsity on the encoded representation. When computing the cluster assignment of a patch in the th layer, which is denoted as , the transforms in later layers ( where ) still play a role. This is because the residuals of the patch at later layers () depend on transforms () and the residual () in the th layer, which is reflected in the constraint.
To solve problem (1), we start at layer and proceed to later layers. To determine , one only needs to enumerate the values of the objective function with , and choose the index that gives the minimal objective function value. We repeat the solution process across all layers until the cluster assignment no longer changes.

Sparse coding step for
The objective to update in each iteration is shown in (2). With the cluster assignment for every patch and all transforms and sparse coefficient maps except the th layer fixed, the objective function in (2.1) becomes
(2) To solve (3), we fix the base layer index as . Recall that the residuals in later layers depends on the residual and the encoding vector in layer . Therefore, we need to peel off this dependence to find the solution for the encoding vector in layer . We use the unitary property of the transforms to disentangle the dependence. Consider the encoding discrepancy term in a layer where :
(4) where we define the backpropagation information vector from the th to the th layer for the patch as
(5) Consequently, the objective function in the optimization problem (3) can be rewritten as
The solution of optimal to the optimization problem above is thus
(6) The solution form depends on the specific layer , where optimization is done in (2).

Transform update for
To update the transforms in the th layer, we leave out quadratic terms with layer depth less than and all terms which are unrelated to the target optimization variables. By assembling the column vectors , into and respectively where , (2.1) can be rewritten as the following optimization problem. With cluster assignment and sparse code fixed, we solve (7) to update .
(7) Define as the matrix consisting of backpropagation information vectors from the th layer to the th layer for all patches in layer belonging to cluster . We need to handle the same entanglement between transforms in the former layers and later layers. With the same argument to untwine such dependence as in equation (2.2.1), the objective function in problem (7) can be rewritten as
(8) where the symbol means equal up to some additive terms that do not depend on the variables being optimized for.
From the rewritten objective function, the unitary constraint on transforms reduces the optimization over each to the orthogonal procrustes problem, which thus can be solved separately. Define as
(9) In the th layer, the solution to each orthogonal procustes problem (i.e. the solution to (7)) with respect to each transform is
(10) where and
are the singular vector matrices in the singular value decomposition of the matrix
^{orthoProcrustes_66}.
2.2.2 PWLSMCST image reconstruction algorithm
We adopt the Penalized Weighted Least Squares (PWLS) approach to reconstruct an image from its noisy sinogram. The learned MCST model is incorporated into the cost function as the regularizer term. The specific PWLS problem is shown in (P1), where denotes the reconstructed image and is the number of pixels in the image. Vector represents the noisy sinogram data and is the system matrix of the CT scan. is the diagonal weighting matrix with
indicating the inverse variance of
. Matrix is the diagonal majorizing matrix of . Operator extracts the th overlapping patch from the original image with patch stride of pixel. Parameter controls the tradeoff between the datafidelity term and the regularizer term. are tunable thresholds which control the sparsity of the MCST model in each layer.(P1) 
where is defined as
To solve (P1), we decompose (P1) into several subproblems (image update for , cluster assignment update for , sparse coding for ) and solve these subproblems sequentially in each iteration of the proposed algorithm. The complete algorithm corresponding to the image reconstruction stage is shown in Algorithm 2.

Image update step
Here, we fix the cluster assignments and sparse representations . The resulting subproblem can be rewritten as follows.
(11) The solving process for (11) is identical to the image update step in Ref. ^{yang:20:mars} For each outer iteration, we update the image times with parameter decreasing as in (12). The formulae to compute and the Hessian matrix for image update are given in (13) and (14), respectively.
(12) (13) (14) 
Cluster assignment update for
In the process of image reconstruction, the cluster assignment for every reconstructed image patch is updated dynamically in the same manner as in the learning stage. We find the optimal class for the th patch in the th layer by solving the following problem as in Section 2.2.1.
(15) 
Sparse coding step for
3 Experiments
3.1 Experiment setup
In this study, we evaluate the performance of the proposed MCST sparse signal model and the PWLSMCST algorithm for LDCT reconstruction. We investigate the PWLSMCST algorithm with different numbers of layers (2, 3 layers) with each layer containing 5 clusters to classify input patches into various groups. We compare the proposed methods against several baseline methods including FBP
^{feldkamp:84:pcb}, PWLS with EP regularization ^{cho:15:rdf}, and the recent MARS ^{yang:20:mars} and ULTRA ^{zheng:18:pua} regularizations in PWLS, respectively. The details of these methods are described as follows.
FBP: We used a Hanning window with width to accomplish the backprojection reconstruction.

PWLSEP: An edgepreserving (EP) regularizer is employed in the PWLS reconstruction scheme. The mathematical representation of the EP regularizer is , where counts the overall number of pixels, denotes the neighborhood of the th pixel , and and denote the analytically determined parameters to encourage uniform resolution. is a potential function.

PWLSMARS: A multilayer residual sparsifying transform (no clustering involved) is adopted as the prior in PWLS reconstruction. For a fair comparison, we choose MARS models with , layers (MARS2, MARS3), which have the same depth as the MCST models in our experiments.

PWLSULTRA: PWLS with a union of transforms model, which is equivalent to the MCST signal model with a single layer.
We used two metrics, RMSE and SSIM ^{xu:12:ldx, zhang:17:tbd}, to evalute the quality of reconstructions. We compute the root mean square error (RMSE) as RMSE , where and denote the reconstructed image and ground truth image, respectively, and calculates the pixel number inside the region of interest (ROI), which is a circular region containing all structures and tissues.
Our experiments include two main parts: learning transforms for constructing regularizers, and using them for lowdose CT image reconstruction. We work with two datasets: XCAT phantom simulated data ^{Segars:08:rcs} and Mayo clinic data ^{Mayo:16:data}. We generate the lowdose measurements with the “Poisson + Gaussian” noise model, i.e., ^{ding:16:mmp}, where denotes the incident photon intensity of the Xray beam and we set per ray and with no scatter. is the variance of electronic noise. For XCAT phantom data, we use 5 slices to train the model. We simulate the lowdose measurements with GE 2D LightSpeed fanbeam geometry. The size of measurements is . The pixel size for the XCAT phantom is mm. For the Mayo Clinic data, 7 regulardose images collected from three patients were used to train the MCST model. We synthesized the lowdose measurements using a fanbeam CT geometry with a monoenergetic source. The sinograms are of size . Specifically, the width of the detector column is mm. The distances from the source to detector and to rotation center are mm and mm, respectively. The size of the reconstructed images is with mm.
3.2 Lowdose experiments with XCAT phantom dataset
3.2.1 Behavior of the learned MCST models
First, we display the learned transforms from the XCAT phantom training dataset. Figure 4 shows the image set with all training slices. The number below each subfigure indicates its location in the volume. We extracted overlapping patches with a patch stride . The total number of training patches is approximately .
We learn the transforms in the MCST method with different model depth: ULTRA (singlelayer MCST), MCST2 (2layer), and MCST3 (3layer), where each layer contains 5 clusters to enable learning rich features. To verify the effectiveness of the multilayer and multicluster extension, MARS2 (2layer) and MARS3 (3layer) are included for comparison. We ran 1000 iterations to train each model and vary the parameters as for the MARS model in our previous work ^{yang:20:mars}. Specifically, we set parameters for ULTRA, for MCST2 and MARS2, for MARS3, for MCST3. We initialize the transforms for MCST in the first layer with 2D DCT matrices and the rest of the transforms are initialized with random matrices. Figure 5 shows the prelearned transforms for ULTRA with 5 clusters (shown in the orange box), MARS with two layers (shown in the gray box), and MCST2 with 5 clusters in both layers (shown in the blue box). Each row of the learned transforms is reorganized as an square matrix for display purposes. The results show that the MCST model integrates the advantages of ULTRA model and MARS model. In particular, the transforms in the first layer of MCST show rich features (like ULTRA), while the transforms in the second layer capture finer features by further sparsifying the representation of residuals.
3.2.2 LDCT reconstruction results with PWLSMCST algorithm
The reconstructed slice and slice of the XCAT phantom dataset are illustrated in Figures 6 and 7, respectively. We compare the results of PWLSMCST with the conventional method FBP, PWLSEP, PWLSULTRA, and PWLSMARS. We used the FBP reconstruction as initialization for PWLSEP and set the regularization parameter as . We ran 1000 iterations of the relaxed LALM algorithm for PWLSEP and ran 1500 outer iterations for other STbased iterative methods to ensure their convergence. We take the same training dataset and validate image slice (slice 48 of the XCAT phantom) as in Ref. ^{yang:20:mars}. The reconstruction parameters are varied based on their values in Ref. ^{yang:20:mars}. Specifically, we set the regularization parameter and sparsity parameters as for ULTRA, for MARS2 and MCST2, for MARS3, and for MCST3, respectively.
The reconstruction results show that PWLSMCST attains better performance than all baseline methods. Compared with PWLSMARS, the images reconstructed by PWLSMCST demonstrate greater contrast due to flexible image modeling with multiclass transforms learned from different groups of patches. The performance difference is exemplified in the zoomedin portions and details pointed by red arrows. Also, using multiple layers enables PWLSMCST to preserve more key features compared to PWLSULTRA. For example, in the reconstructed images by PWLSULTRA, some subtle bone structures are missing (zoomedin region at the bottomleft corner of subfigures), whereas PWLSMCST mitigates this effect; the zoomin regions located at the bottomright corner of subfigures for PWLSMCST are closer to the ground truth than those with PWLSULTRA. We point out that PWLSMCST provides the best performance in terms of RMSE and SSIM.
However, compared with the 2layer MCST model, the improvement with deeper MCST model (MCST3) is limited, which is in accordance with the performance of PWLSMARS. Since XCAT phantom images have relatively simple structures, the complexity of MCST3 models may be too high to demonstrate significant usefulness with this dataset. A larger number of tunable parameters in deeper MCST models might also create potential suboptimality in the training process.
3.3 Lowdose experiments with Mayo Clinic dataset
3.3.1 MCST model training
Next, we train the MCST model on the Mayo Clinic dataset. We use 7 regulardose slices collected from three different patients (L096, L067, L143) for training. Figure 8 displays all training slices used in the experiment. Similar to the XCAT phantom experiment, we extracted overlapping patches from these training slices. The total number of patches is .
We trained the STbased signal model with similar parameters as for the XCAT phantom experiment. We set iteration number for training as 1000 and set for ULTRA, for MARS2 and MCST2, for MARS3, and for MCST3. The initialization of the transforms in the MCST model are DCT matrices for the first layer and random matrices for successive layers, respectively.
As Figure 9 shows, the learned transforms from the ULTRA learning algorithm contain some compound features (e.g. the third transform in the first row). Similar to the XCAT phantom experiment, MCST2 blends the advantages of MARS with the ULTRA model and captures richer features from the training patch set. Furthermore, considering the complexity of the Mayo Clinic dataset, MCST2 is able to learn effective features in every transform of the first layer.
3.3.2 Numerical and visual results on Mayo Clinic data
After obtaining a series of transforms, we incorporate those prelearned transforms into the PWLS scheme and compare various reconstruction methods. For PWLSEP, we iterate the relaxed LALM algorithm with 1000 iterations with the FBP reconstruction as initialization. The hyperparameter
is set as . For PWLSULTRA, PWLSMARS2, and PWLSMARS3, we set the maximum outer iterations as 1500 and inner image update times . Similar to the MARS work ^{yang:20:mars}, we tuned the parameters on one validation image (slice 100 of patient L506) and tested the performance of our proposed algorithm on other slices (slice 140 of patient L333, slice 90 of patient L109, and slice 120 of patient L067). We varied the reconstruction parameters based on the finetuned parameters in the PWLSMARS work. Specifically, we chose for PWLSULTRA, for PWLSMARS2 and PWLSMCST2, and for PWLSMARS3 and PWLSMCST3, respectively.give the image reconstruction results of slice 140 of patient L333, slice 90 of patient L109, and slice 120 of patient L067, respectively. The proposed PWLSMCST algorithm gives better quality of image reconstructions compared to other recent unsupervised learningbased methods (PWLSULTRA and PWLSMARS), especially for recovering subtle details. Although the result of PWLSULTRA is clearer than the initial PWLSEP reconstruction, it can be improved by incorporating the rich multilayer structure from MCST. Numerical results indicate that PWLSMCST performs the best in terms of RMSE and SSIM criteria.
4 Conclusion
In this work, we proposed a multilayer clusteringbased residual sparsifying transform (MCST) network comprised of multiple sparsifying transform modules to extract rich information from training images. The MCST model is different from the previous single sparsifying transform model in two aspects. First, learning multiple transforms (in each layer) by clustering the data allows capturing rich features from patches. Second, the multilayer structure aims to better capture the sparsity of the signal representation across several layers. We applied the learned MCST model to LDCT reconstruction and presented a novel image reconstruction algorithm dubbed PWLSMCST. Experimental results indicate that PWLSMCST provides better image reconstruction quality and image features than conventional methods such as FBP and PWLSEP. Furthermore, PWLSMCST shows advantages over several recent sophisticated algorithms for LDCT reconstruction including PWLSMARS and PWLSULTRA, especially for displaying clearer edges and preserving subtle details. In the future, we plan to involve more sophisticated architectures like skipped connections among different layers and multiscale transformation through downsampling and upsampling within the MCST model and formulations. Additional imaging applications can be explored to validate the general applicability of the proposed model.
Appendix: Notations
We use lower case bold font to denote vectors and upper case bold font to denote matrices. is the 2norm for vectors, is the Frobenius norm for matrices, and counts the nonzero entries. Operator is the hardthresholding function which zeros out values less than . We list the notation used in the description of our methodology in Table 1.
Number of classes in layer  
A set containing indices of patches in class in the th layer  
th column of the matrix and  
Submatrix of and respectively, with column indices in the set  
Transform of the class in th layer  
Cluster assignment of the patch in th layer  
information backpropagation vector for the patch from th layer to th layer  

