Modern methods for diagnosing medical conditions have been developing rapidly in recent times and a tool of utmost importance is the Computerized Tomography (CT) scan. It is used often to help diagnose complex bone fractures, tumors, heart disease, emphysema, and more. It works in a method similar to that of the X-Ray scan. A rotating source of X-Ray beams is used to shoot narrow beams through a certain section of your body with a highly sensitive detector being placed opposite to the source which picks up these X Rays and uses a highly advanced mathematical algorithm to create 2D slices of a body part from one full rotation. This process is repeated until a number of slices are created. As helpful as this procedure is in diagnosing, it does have some cause for concern as the patient is exposed to radioactive waves for varying durations. CT scans have been mainly responsible for increasing the radiation received by humans from medical procedures and have even led to medical procedures becoming the second-largest source of radiation after background radiation to affect humans. Reducing the dose of the X-rays in CT scans is possible but leads to problems such as increased noise, reduction of contrast in edges, corners, and sharp features, and over smoothing of images. We propose a method to help preserve the details and reduce the noise generated from low dose scans so they may become a viable solution in place of high dose scans.
Medical Image Denoising has garnered considerable amount of attention from the computer vision research community. There has been extensive research[Liang_2020, pmid29870364, Chen_2017, pmid31515756, https://doi.org/10.1002/mp.13415, 10.1371/journal.pmed.1002699] in this domain in the recent past. Although these methods have shown excellent results, they implicitly associate denoising with operations on a global scale rather than leveraging the local visual information. We argue that we can benefit from the patch embedding operations that form the basis of a vision transformer [dosovitskiy2021image]. Recently, Vision Transformers (ViT) have shown great success in many computer vision tasks including image restoration [wang2021uformer] but they have not been exploited on medical image datasets.
To the best of our knowledge, this is the first work that utilizes transformers for medical image denoising. The major contributions of this paper are as follows:
We introduce a novel architecture - Eformer, for edge enhancement based medical image denoising using transformers. We incorporate learnable Sobel filters for edge enhancement which results in improved performance of our overall architecture. We outperform existing state-of-the-art methods and show how transformers can be useful for medical image denoising.
We conduct extensive experimentations on training our network following the residual learning paradigm. To prove the effectiveness of residual learning in image denoising tasks, we also show results using a deterministic approach where our model directly predicts denoised images. In medical image denoising, residual learning clearly outperforms traditional learning approaches where directly predicting denoised images becomes similar to formulating an identity mapping.
This paper follows the following structure - in Section 2 we discuss the previous work done in image denoising and the use of transformers in related tasks. In Section 3, we have explained our approach in a detailed manner. In Section 4, we compare our results with existing methods which is followed by some conclusive statements and future directions in Section 5.
2 Related Work
Low-dose CT (LDCT) image denoising is an active research area in medical image denoising due to its valuable clinical usability. Due to the limitations in the amount of data and the consequent low accuracy of conventional approaches [Kaur2018ARO]
, data-efficient deep learning approaches have a huge potential in this domain. The pioneering work of Chen[Chen:17]
showed that a simple Convolutional Neural Network (CNN) can be used for suppressing the noise of LDCT images. The models proposed in[Enc-decoder, redcnn, cpce] show that an encoder-decoder network is efficient in medical image denoising. REDCNN [redcnn] combines shortcut connections into the residual encoder-decoder network and CPCE [cpce] uses conveying-paths connections. Fully Convolutional Networks such as [pmid31515756] uses dilated convolutions with different dilation rates whereas [Jifara2019] uses simple convolution layers with residual learning for denoising medical image. GAN based models such as [pmid29870364, https://doi.org/10.1002/mp.13415] use WGAN [arjovsky2017wasserstein] with Wasserstein Distance and Perceptual Loss for image denoising.
Recently, transformer-based architectures have also achieved huge success in the computer vision domain pioneered by ViT (Vision Transformer) [dosovitskiy2021image]
, which successfully utilized transformers for the task of image classification. Since then, many models involving transformers have been proposed that have shown successful results for many low-level vision tasks including image super-resolution[yang2020learning], denoising [wang2021uformer], deraining [chen2021pretrained]
, and colorization[kumar2021colorization]. Our work is also inspired by one such denoising transformer, Uformer [wang2021uformer], which employs non-overlapping window-based self-attention and depth-wise convolution in the feed forward network to efficiently capture local context. We integrate the edge enhancement module [Liang_2020] and a Uformer-like architecture in an efficient novel manner that helps us achieve state-of-the-art results.
3 Our Approach
In this section, we provide a detailed description about the components involved in our implementation.
3.1 Sobel-Feldman Operator
Inspired by [Liang_2020], we use Sobel–Feldman operator [article], also called Sobel Filter, for our edge enhancement block. Sobel Filter is specifically used in edge detection algorithms as it helps in emphasizing on the edges. Originally the operator had two variations - vertical and horizontal, but we also include diagonal versions similar to [Liang_2020] (See Supplemental Material). Sample results of edge enhanced CT image have been shown in Figure 2. The set of image feature maps containing edge information are efficiently concatenated with the input projection and other parts of the network (refer to Figure 1).
3.2 Transformer based Encoder-Decoder
Denoising Autoencoders [redcnn, cpce, Enc-decoder], Fully Convolutional Networks [Liang_2020, Jifara2019, pmid31515756], and GANs [pmid29870364, https://doi.org/10.1002/mp.13415] have been successful in the past in the task of medical image denoising, but transformers have not yet been explored for the same, despite their success in other computer vision tasks. Our novel network Eformer is one such step in that direction. We take inspiration from Uformer [wang2021uformer] for this work. At every encoder and decoder stage, convolutional feature maps are passed through a locally-enhanced window (LeWin) transformer block that comprises of a non-overlapping window-based Multi-head Self-Attention (W-MSA) and a Locally-enhanced Feed-Forward Network (LeFF), integrated together (See Supplementary Material) .
here, LN represents the layer normalization. As shown in Figure 1, the transformer block is applied prior to the LC2D block in each encoding stage and post the LC2U block in each decoding stage, and also serves as the bottleneck layer.
3.3 Downsampling & Upsampling
Pooling layers are the most common way of downsampling the input image signal in a convolutional network. They work well in image classification tasks as they help in capturing the essential structural details but at the cost of losing finer details which we cannot afford, in our task. Hence we choose strided convolutions in our downsampling layer. More specifically, we use a kernel size ofwith stride of
and padding of.
Upsampling can be thought of as unpooling or reverse of pooling using simple techniques such as Nearest Neighbor. In our network, we use transpose convolutions [dumoulin2018guide]. Transpose convolution reconstructs the spatial dimensions and learns its own parameters just like regular convolutional layers. The issue with transpose convolutions is that they can cause checkerboard artifacts which are not desirable for image denoising. [odena2016deconvolution] states, to avoid uneven overlap, the kernel size should be divisble by the stride. Hence, in our upsampling layer, we use a kernel size of and a stride of .
3.4 Residual Learning
The goal of residual learning is to implicitly remove the latent clean image in the hidden layers. We input a noisy image to our network, here is the noisy image, in our case the low-dose image, is the ground truth, and is the residual noise. Rather than directly outputting the denoised image , the proposed Eformer predicts the residual image , , difference between the noisy image and the ground truth. According to [he2015deep], when the original mapping is more like an identity mapping, the residual mapping is much easier to optimize. Discriminative denoising models aim to learn a mapping function of whereas we adopt residual formulation to train our network to learn a residual mapping and then we obtain .
As a part of the optimization process, we employ multiple loss functions to achieve the best possible results. We initially use Mean Squared Error (MSE) which calculates the pixelwise distance between the output and the ground truth image defined as follows.
However, it tends to create unwanted artifacts such as over-smoothness and image blur. To overcome this, we employ both, a ResNet [he2015deep] based Multi-scale Perceptual (MSP) Loss [Liang_2020]. MSP can be described by the following equation
A ResNet-50 backbone was utilized as the feature extractor
. To be specific, the pooling layers from a ResNet-50 pretrained on the ImageNet dataset[5206848imagenet] were deleted, retaining the convolutional blocks following which the weights () were frozen. To calculate perceptual loss, the denoised output , where (as described in Section 3.4) and ground truth are passed to the extractor. Following this, feature maps are extracted from four stages of the backbone, as done in [Liang_2020]. This perceptual loss, in combination with MSE deals with both per-pixel similarity in addition to overall structural information. Our final objective is as follows,
where, and are pre-defined constants.
3.6 Overall Network Architecture
Composing the aforementioned individual modules, our pipeline can be described as follows. An input image is first passed through a Sobel Filter to produce followed by a GeLU activation [hendrycks2020gaussian]. As a part of the encoding stages, at each stage, we pass the input through a LeWin transformer block, proceeded by a concatenation with and consequent convolution operations, similar to [Liang_2020] to produce an encoded feature map. The feature map, along with is then downsampled using the procedure described in Section 3.3. Post encoding, at the bottleneck, we pass the encoded feature map to another LeWin Transformer block which is now ready to be decoded by the same number of stages as it was encoded. In each stage of the decoder, post deconvolution, the earlier downsampled itself are concatenated with the upsampled feature maps which are then passed through a convolutional block. The decoder stage can be viewed as an opposite of the encoder stage, with a shared . The final feature map produced after decoding is then passed through a ’output projection’ block to produce the desired residual. This ’output projection’ is a convolutional layer, that simply projects the -channel feature map to a -channel grayscale image. In our experiments, we set the depth of the LeWin block, attention heads and number of encoder-decoder stages each to . A concise representation of the architecture can be seen in Figure 1 which resembles the alphabet ’E’ hence the name Eformer.
4 Results and Discussions
This subsection highlights the results attained by measuring three different metrics to judge noise reduction and the quality of the reconstructed low dose CT images. We use the following metrics for the evaluation - Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), and Root Mean Square Error (RMSE). PSNR is targeted at noise reduction and is a measure of the quality of reconstruction. SSIM is a perceptual metric that focuses on the visible structures in an image and is a measure of the visual quality. RMSE keeps track of the absolute pixel to pixel loss between the two images. We compare our results, examples shown in Figure3, with architectures that share similarities with our model in the sense they are based on a convolutional architecture. As seen in Table 1, CPCE [cpce], WGAN [arjovsky2017wasserstein] and EDCNN [Liang_2020] like ours use a combination of commonly used losses to train their model while REDCNN [redcnn] only uses MSE. Table 2 shows that our proposed models, Eformer and Eformer-Residual, outperform the state-of-the-art methods in both the PSNR and MSE metrics, indicating efficient denoising and our comparable performance in SSIM also suggests that the visual quality of the image is high and important details are not lost in the reconstruction.
To conclude, this paper presents a residual learning based image denoising model evaluated in the medical domain. We leverage transformers, and an edge enhancement module to produce high quality denoised images, and achieve state-of-the-art performance using a combination of multi-scale perceptual loss and the traditional MSE loss. We believe our work will encourage the use of transformers in medical image denoising. In the future, we plan to explore the capabilities of our model on a multitude of related tasks.
We want to thank the members of Computer Vision Research Society (CVRS111https://sites.google.com/view/thecvrs) for their helpful suggestions and feedback.
7 Dataset Details
For our research work, we have utilized the AAPM-Mayo Clinic Low-Dose CT Grand Challenge Dataset [mccollough2017low] provided by The Cancer Imaging Archive (TCIA). The dataset contains 3 types of CT scans collected from 140 patients. These 3 types of CT scans are abdomen, chest, and head which are collected from a total of 48, 49, and 42 patients respectively. The data from each patient comprises of low-dose CT scans paired with its corresponding normal-dose CT scans. The low dose CT scans are synthetic CT scans which are generated by poisson noise insertion into the projection data. Poisson noise was inserted to reach a noise level of 25% of the full dose. Each CT scan is given in DICOM (Digital Imaging and Communications in Medicine) file format. It is a standard format which establishes rules for the exchange of medical images and associated information between different vendors, computers and hospitals. This format meets health information exchange (HIE) standards and HL7 standards for transmission of health-related data. A DICOM file consists of a header and image pixel intensity data. The header contains information regarding the patient demographics, study parameters, stored in seperate ’tags’ and image pixel intensity data contains the pixel data of the CT scan which in our case contains pixel data of images of size . In our model, for training, we extracted the image pixel data from a Dicom file to a NumPy array using Pydicom library 222https://pydicom.github.io/ and then, the pixel data in NumPy array is scaled from 0 to 1 to avoid heterogenous spanning of pixel data for different CT scans.
8 Parameter Details and Network Training
The structure and architecture of the model have been previously described in Section 3.6 and Figure 1 of the main text. We use the Pytorch framework[paszke2017automatic] to run our experiments. The convolutional layers are initialized using the default scheme except the Sobel convolutional block. We enforce the filter parameters to follow the pattern as shown in Figure 4 where is a learnable parameter. All our experiments were run on a 16GB NVIDIA TESLA P100 GPU. The model was trained with an ADAM [Adam] optimizer using a learning rate of and default parameters. The model was trained using an input size of pixels by resizing each image from its original size of pixels. The results obtained have been shown in Figure 6
9 LeWin Transformer
To make our submission self-containing, we have provided architecture details of the LeWin transformer block [wang2021uformer] in the supplementary material. LeWin transformer block (Figure 5) contains 2 core designs which are described below. First, non-overlapping Window-based Multi-head Self-Attention (W-MSA), which works on low-resolution feature maps and is sufficient to learn long-range dependencies. Second, Locally-enhanced Feed-Forward Network (LeFF), which integrates a convolution operator with a traditonal feed-forward network and is vital in learning local context. In LeFF, the image patches are first passed through linear projection layers followed by depth-wise convolutional layers. Further the patch features are flattened and finally passed to another linear layer to match the dimension of input channels. The structure of the LeWin Transformer Block is pictorially represented in Figure 5. Corresponding equations are as follows.
Here and are the outputs of the W-MSA module and LeFF module respectively, LN represents layer normalization. In the W-MSA module, the given 2D feature map is split into non-overlapping windows with window size . Following this, self-attention is performed on the flattened features of each window . Suppose the head number is and the head dimension is . Then, consequent computations are,
denotes the output for the -th head. Now, output for all the heads can be concatenated and then linearly projected to get the final results. We formulate attention calculation in the same manner as done in [wang2021uformer].