1 Introduction
Positron emission tomography (PET) is an imaging modality for quantitatively measuring biochemical and physiological processes in vivo by using a radiotracer. Image reconstruction for PET is challenging due to the illconditioned tomographic problem and lowcounting statistics (high noise) of PET data, for example, in dynamic PET imaging where short time frames are used to monitor rapid change in tracer distribution.
Among different methods of PET image reconstruction, the kernel methods (e.g., [1, 2, 3, 4, 5]) address the noise challenge by uniquely integrating image prior information into the forward model of PET image reconstruction through a kernel representation framework [1]. Image prior information may come from composite time frames of a dynamic PET scan [1], or from anatomical images (e.g., MRI image [2, 3]
in integrated PET/MRI). The kernel methods can be easily implemented with the existing expectationmaximization (EM) algorithm and have demonstrated substantial image quality improvement as compared to other methods
[1, 2, 3].In the existing kernel methods, a kernel representation is commonly built using an empirical process for defining feature vectors and manually selecting methodassociated parameters
[1]. However, such an experiencebased parameter tuning and optimization approach often leads to suboptimal performance. In this paper, we first describe the equivalence between the kernel representation and a trainable neural network model. Based on this connection, we then propose a deep kernel method that learns the trainable components of the neural network model from available image data to enable a datadriven automated learning of an optimized kernel model. The learned kernel model is then applied to tomographic image reconstruction and is expected to outperform existing kernel models that are empirically defined.There are a lot of ongoing efforts in the field to explore deep learning with neural networks for PET image reconstruction. Deep neural networks have been proposed for direct mapping from the projection domain to the image domain
[6] but the models are so far mainly practical for 2D data training. By unrolling an iterative tomographic reconstruction algorithm, modelbased deeplearning reconstruction [7] represents a promising direction. One limitation of this method is it requires pretraining using a large number of data sets and involves projection data in the iterative training process, which is computationally intensive. Alternatively, neural networks can be used for image representation in iterative reconstruction, e.g. by the deep image prior model [8]. The resulting reconstruction problem, however, is nonlinear and is often complex and challenging to optimize.Different from these methods that utilize pure neural networks, the proposed deep kernel method combines deep neural networks into the kernel framework to form a novel way for tomographic image representation. The method has a unique advantage that once the model is trained with neural networks, the unknown kernel coefficient image remains linear in the model and is therefore easy to be reconstructed from PET data. It does not necessarily require a large data set for training but is directly applicable to singlesubject learning and reconstruction.
2 Background
2.1 PET Image Reconstruction
PET projection data
can be well modeled as independent Poisson random variables using the loglikelihood function
[9],(1) 
where denotes the detector index and is the total number of detector pairs. The expectation of the projection data, , is related to the unknown image through
(2) 
where
is the detection probability matrix for PET and includes normalization factors for scanner sensitivity, scan duration, deadtime correction and attenuation correction.
is the expectation of random and scattered events [9].2.2 Kernel Methods for PET Reconstruction
The kernel methods describe the image intensity at the pixel as a linear representation of kernels [1],
(4) 
where defines the neighborhood of pixel and is the total number of image pixels. and are the feature vectors extracted from image priors for pixel and pixel , respectively. is the kernel coefficient at pixel . is the kernel function that defines a weight between pixel and pixel . A popular choice of is the radial Gaussian kernel,
(5) 
with being the kernel parameter. The equivalent matrixvector form of Eq. (4) is
(6) 
with the th element of the square kernel matrix being .
The kernel coefficient image is then estimated from the projection data by maximizing the loglikelihood ,
(7) 
which can be solved using the kernelized EM algorithm [1],
(8) 
where denotes the iteration number and the superscript “” denotes matrix transpose. is a vector with all elements being 1. Once is estimated, the final PET activity image is given by
Note that in practice, a normalized kernel matrix
(9) 
is commonly used for better performance [1]. The (,)th element of is equivalent to
(10) 
The neighborhood of pixel can be defined by its
nearest neighbors (kNN)
[14] to make sparse. The feature vector is usually set to the intensity values of the image prior at pixel and the kernel parameter is chosen empirically, e.g. .3 Proposed Deep Kernel Method
3.1 Kernel Representation as Neural Networks
We first describe the kernel representation using a neural network description illustrated in Fig. 1
. The construction of kernel representation is decomposed into two main modules: (1) feature extraction and (2) pairwise attention.
Denote the image prior data by which consists of prior images. The feature extraction module is to extract a feature vector of length for each image pixel from ,
(11) 
where
denotes the feature extraction operator, for example, a convolutional neural network. The extraction of conventional intensitybased features is equivalent to a
convolution operations on (if the images are 3D). This step provides a feature data of size where can be equal to .The pairwise attention module first calculates the similarity between pixel and its neighboring pixels that are specified by a predetermined neighborhood ,
(12) 
Note that here only neighbors are selected for each pixel (due to kNN). This leads to a similarity data of size . Pairwise weights are then calculated from using
(13) 
generating a weight data of size . Here . This type of weight calculation is also called softmax in neural networks and can be directly explained as a pairwise attention mechanism [11, 12]. is the attention weight of other “key” pixels as compared to the “query” pixel .
The final step reshapes using the neighborhood indices defined by to generate a sparse matrix, which is equal to the normalized kernel matrix defined in Eq. (9). Each row of the kernel matrix is of the size and can be displayed as an image, which can also be understood as an attention map for the corresponding pixel in .
3.2 Deep Kernel Model
Integrating all the neural network components in Fig. 1 together, we have the following deep kernel model to represent a PET image ,
(14) 
where denotes the equivalent neural network model of with the image prior data as the input and collecting any model parameters that are trainable.
The deep kernel model is nonlinear with respect to and but remains linear with respect to the kernel coefficient image . While this model shares the spirit of using attention with the nonlocal neural network [12], the linearity of makes it unique and more suitable for tomographic image reconstruction problems. Once is determined, can be easily reconstructed from the projection data using the kernelized EM algorithm in Eq. (8).
In conventional kernel methods, is determined empirically, which often leads to suboptimal performance. For example, intensitybased features are commonly used for . However, convolutional neural networkderived features can be more informative [13]. In this paper, we exploit the capability of deep learning to train an optimized form for generating from available image prior data based on the proposed deep kernel model.
3.3 Deep Kernel Learning
The deep kernel learning problem is formulated using the observation that in the kernelized image model Eq. (6), is usually a clean version of if
is noisy. This inspires the following use of the denoising autoencoder framework
[15] to train the model parameters of ,(15) 
where denotes the th highquality image in the training dataset and is a corrupted version of . is the total number of training image pairs. In PET, and can be obtained from high count data and lowcount data, respectively.
The deep kernel model can be pretrained using a large number of patient scans (large ) if such a training dataset is available. It can also be trained online for single subject (small ) without pretraining, as described below.
3.4 SingleSubject Deep Kernel Method for Dynamic PET
In dynamic PET, the image prior data may consist of several composite images where is the number of composite frames (e.g. ). These images are reconstructed from the rebinned longscan projection data [1] and may have good image quality due to the relatively high count level of a composite frame. The composite image prior has been used in the standard kernel methods for constructing the kernel matrix empirically. Here we use it to train an optimized kernel model adaptive to a single subject.
The singlesubject deep kernel learning problem is formulated as
(16) 
where the corrupted image can be obtained from the reconstruction of the lowcount projection data that are downsampled from using a count reduction factor (e.g. ). Once is trained, the learned kernel model is then used to reconstruct all the dynamic frames of the scan framebyframe using the kernel EM algorithm in Eq. (8).
In theory, both the feature extraction and pairwise attention modules in the neural network model (Fig. 1) are trainable. As a proof of concept, we only train the feature extraction operator in this preliminary work. A residual Unet architecture [8] is used for the feature extraction module .
4 Computer Simulation Validation
4.1 Simulation Setup
Dynamic Ffluorodeoxyglucose (FDG) PET scans were simulated for a GE 690 PET scanner using a Zubal head phantom shown in Fig. 2a. The phantom is composed of gray matter, white matter, blood pools and a tumor (15 mm in diameter). The scanning schedule consisted of 24 time frames over one hour: 420s, 440s, 460s, 4180s, and 8300s. The time acitivity curve of each region is shown in Fig. 2b. An attenuation map was simulated with a constant linear attenuation coefficient assigned in the whole brain. Dynamic images were first forward projected to generate noisefree sinograms. A 20% uniform background was included to account for mean random and scatter events. Poisson noise was then introduced with the expected total number of events over 60 min set to be 8 million.
4.2 Reconstruction Methods
The simuated dynamic data were reconstructed using four different methods: (1) standard MLEM reconstruction; (2) existing kernel method without learning [1]; (3) proposed deep kernel method with online training of the feature extraction; and (4) the deep image prior (DIP) reconstruction method [8] as a recent representative of nonlinear neural networkbased reconstruction methods.
The image prior for the kernel methods and DIP were the composite images obtained from three composite frames, each with 20 min scan duration, using the approach described in [1]. For the conventional kernel method, the setting was the same as empirically optimized in [1]. The DIP method was implemented in a way similar to [8] but was adapted to use the composite image prior data as the input of the Unet.
In the deep kernel method, the lowcount images were obtained by using onetenth of the counts in each composite frame. Five hundred iterations were used for the training with the learning rate set to . The in kNN for defining the neighborhood was set to be 200 for nearly optimal performance.
4.3 Evaluation Metrics
Different image reconstruction methods were compared using the image signaltonoise ratio (SNR) defined by
(17) 
where is an image estimate of frame obtained with one of the reconstruction methods and denotes the ground truth image.
4.4 Demonstration of Attention Map for the Kernel Methods
To understand how the deep kernel method may improve image reconstruction, Fig. 3 shows the attention weight maps for two different pixels, one from the tumor and the other from the gray matter region. These attention maps were generated by reshaping the corresponding row of the kernel matrix for a pixel
. The traditional MLEM reconstruction can be considered as a special pixel kernel method for which the kernel matrix is the identity matrix. As illustrated in Fig.
3(a), the attention of MLEM just focuses on the current pixel . No spatial correlation is explored by this pixel kernel.The kernel methods are able to exploit spatial correlation from pixels that are considered as neighbors of pixel by kNN. The attention is now not only on the current pixel but also spreads to similar pixels in the image space, which in turn can help image reconstruction. However, neighboring pixels may be falsely identified if in kNN is large [1]. Without deep learning, the existing kernel model is unable to exclude the effect of those false neighbors, resulting in worse reconstruction as increases [1]. As shown in Fig. 3(b), high attention weights also appear in the gray matter region for a tumor pixel, and vice versa.
In comparison, the deep kernel model with training can learn a way to assign a more appropriate weight to irrelevant pixels even if those pixels are initially included in the nearest neighbors. Fig. 3(c) shows that with deep learning, attention is predominantly extracted in the tumor region for a tumor pixel and in the gray matter region for a gray matter pixel.
4.5 Image Quality Comparisons
Fig. 4 shows the reconstructed images for an early frame #2 and a late frame #24 by different reconstruction algorithms with 60 iterations. The results of image SNR are included. The proposed deep kernel method achieved a better image quality with higher SNR as compared to other three methods, especially in the blood region in frame 2 and in the tumor region in frame 24. The DIP method generally led to oversmooth images.
Fig. 5 shows the SNR plots of frame 2 and frame 24 by varying the iteration number in each reconstruction algorithm. Fig. 6 further shows the SNR results of all different time frames at iteration 60. The DIP method was only comparable to the conventional kernel method for dynamic PET image reconstruction. In contrast, the deep kernel demonstrated a significant improvement over other methods.
5 Application to Patient Data
5.1 Data Acquisition
A cardiac patient scan was performed on the GE Discovery ST PET/CT scanner in 2D mode at the UC Davis Medical Center. The patient received approximately 20 mCi FFDG with a bolus injection, followed by an immediate data scan. A lowdose transmission CT scan was then performed at the end of PET scan to provide the CT image for PET attenuation correction. The raw data of the first 90 seconds were binned into a total of 45 dynamic frames with a 2second duration. The projection data size was and the image size was . The data correction sinograms of each frame, including normalization, attenuation correction, scattered correction and randoms correction, were extracted using the vendor software used in the reconstruction process.
The dynamic data were reconstructed using different methods. Five composite frames (, ) were used to build the kernel matrix .
5.2 Results
Fig. 7 shows the reconstructed images using different algorithms for reconstructing two hightemporal resolution (HTR) frames, one at s and the other at s. The MLEM reconstructions were extremely noisy and the conventional kernel method substantially reduced the noise. The DIP method led to oversmoothed images (results not shown). In comparison, the images by the proposed deep kernel demonstrated an improvement with clearer structures and lower noise in the left ventricle, right ventricle and myocardium, though no ground truth is available for the real dataset.
Fig. 8 further shows the HTR time activity curves for two pixels in the descending aorta (AOR) and myocardium (MYO). The MLEM suffered from heavy noise. While the conventional kernel method achieved a significant noise reduction at highactivity time points, those lowactivity time points still suffer from noise. In comparison, the deep kernel method demonstrated an effective noise reduction for both earlytime and latetime points.
6 Conclusion
In this paper, we have developed a new deep kernel method for PET image reconstruction. The proposed deep kernel model allows the construction of kernel representation to be trained from data optimally rather than defined by an empirical process. Computer simulation and patient results have demonstrated the improvement of the deep kernel method over existing methods in dynamic PET imaging.
References
 [1] G. Wang and J. Qi, “PET image reconstruction using kernel method,” IEEE Trans. Med. Imag., vol. 34, no. 1, pp. 6171, Jan., 2015.
 [2] W. Hutchcroft, G. B. Wang, K. Chen, C. Catana, and J. Qi, “Anatomicallyaided PET reconstruction using the kernel method,” Phys. Med. Biol., vol. 61, no. 18, pp. 66686683, 2016.
 [3] P. Novosad and A. J. Reader, “MRguided dynamic PET reconstruction with the kernel method and spectral temporal basis functions,” Phys. Med. Biol., vol. 61, no. 12, pp. 46244645, 2016.
 [4] G. Wang, “High temporalresolution dynamic PET image reconstruction using a new spatiotemporal kernel method,” IEEE Trans. on Med. Imag., vol. 38, no. 3, pp. 664674, Mar., 2019.
 [5] J. Bland et al.,“MRguided kernel EM reconstruction for reduced dose PET imaging,”IEEE Trans. Radiat. Plasma Med. Sci., vol. 2, no. 3, pp. 235243, May 2018.
 [6] I. Hggstrm, C. R. Schmidtlein, G. Campanella, and T. J. Fuchs, “DeepPET: A deep encoderdecoder network for directly solving the PET image reconstruction inverse problem,” Med. Imag. Anal., vol. 54, pp. 253–262, May 2019.
 [7] A. J. Reader, G. Corda, A. Mehranian, et al., “Deep learning for PET image reconstruction,” IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 5, no. 1, pp. 124, Jan. 2021.
 [8] K. Gong, C. Catana, J. Y. Qi, and Q. Z. Li, “PET image reconstruction using deep image prior,” IEEE Trans. Med. Imag., vol. 38, no. 7, pp. 16551665, Jul., 2019.
 [9] J. Qi and R. M. Leahy, “Iterative reconstruction techniques in emission computed tomography,” Phys. Med. Biol., vol. 51, no. 15, pp. R541R578, 2006.
 [10] L. A. Shepp and Y. Vardi, “Maximum likelihood reconstruction for emission tomography,” IEEE Trans. Med. Imag., vol. MI1, no. 2, pp. 113122, Oct., 1982.
 [11] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in Neural Information Processing Systems , 2017, pp. 5998–6008.

[12]
X. Wang, R. Girshick, A. Gupta et al., “Nonlocal neural networks,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2018, pp. 77947803. 
[13]
S. Li and G. Wang, “Modified kernel MLAA using autoencoder for PETenabled dualenergy CT.”
Philosophical Transactions of the Royal Society A, vol. 379, no. 2204, 2021.  [14] J. H. Friedman, J. Bentely, and R. A. Finkel, “An algorithm for finding best matches in logarithmic expected time,” ACM Trans. Math. Software, vol. 3, no. 3, pp. 209226, 1977.

[15]
M. A. Kramer, “Nonlinear principal component analysis using autoassociative neural networks,”
AIChE Journal, vol. 37, no. 2, pp. 233243, 1991.  [16] R. L. Harrison et al., “Preliminary experience with the photon history generator module of a publicdomain simulation system for emission tomography,” in Conf. Rec. IEEE Nucl. Sci. Symp. Med. Imag. Conf., pp. 115411581, 1993.
Comments
There are no comments yet.