X-ray computed tomography (CT) is a critical medical imaging tool in modern hospitals and clinics. However, the potential radiation risk has attracted increasingly more public concerns on the use of x-ray CT [1, 5]. Lowering the radiation dose tends to significantly increase the noise and artifacts in the reconstructed images, which can compromise diagnostic information. To reduce noise and suppress artifacts in low-dose CT images, extensive efforts were made via image post-processing. For example, the non-local means (NLM) method was adapted for CT image denoising . Based on the compressed sensing theory, an adapted K-SVD method was proposed in  to reduce artifacts in CT images. Moreover, the block-matching 3D (BM3D) algorithm was used for image restoration in several CT imaging tasks 
. Image quality improvement was clearly demonstrated in those applications, however, over-smoothness and/or residual errors were also observed in the processed images. Despite these efforts, CT image denoising remains challenging because of the non-uniform distribution of CT imaging noise.
With the recent explosive development of deep neural networks, researchers tried to tackle this denoising problem through deep learning. Dong et al. 
developed a convolutional neural network (CNN) for image super-resolution and demonstrated a significant performance improvement compared with other traditional methods. The work was then adapted for low-dose CT image denoising, where similar performance gain was obtained. However, over-smoothing remains a problem in the denoised images, where important textural clues were often lost. The root cause of the problem is the image reconstruction error measurement used in all the learning based methods. As revealed by the recent research [9, 10], using the per-pixel mean squared error (MSE) between the recovered image and the ground truth as the reconstruction loss to define objective function results in over-smoothness and lacking of details. As an algorithm tries to minimize per-pixel MSE, it overlooks any image features critical for human perception.
In this paper, we propose a new method for CT image denoising by designing a perceptive deep CNN that relies on a perceptual loss as the objective function. During our research, it was drawn to our attention that minimizing MSE between the denoised CT image and the ground truth leads to the loss of important details, although the peak signal to noise ratio (PSNR) based evaluation numbers are excellent. That is because PSNR is equivalent to the per-pixel Euclidean intensity difference. Therefore, a model maximizing PSNR after successful training always achieves very high PSNR values. However, the perceptual evaluation of the denoised images generated by such a model is not necessarily better than that of the original noisy images from experts’ point of view.
In our proposed method, instead of directly computing MSE summarizing pixel-to-pixel intensity differences, we compare the denoised output against the ground truth in another high-dimensional feature space, achieving denoising and keeping critical structures at the same time. We introduce a new perceptual similarity as the objective function of the CNN for CT image denoising. The rationale behind our work is two-fold. First, when human compares two images, the perception is not performed pixel-by-pixel. Human vision actually extracts features from images and compare them 
. Therefore, instead of using pixel-wise MSE, we employ another pre-trained deep CNN (the famous VGG) for feature extraction and compare the denoised output against the ground truth in terms of the extracted features. Second, from a mathematical point of view, CT images are not uniformly distributed in a high-dimensional Euclidean space. They reside more likely in some low-dimensional manifold. With MSE, we are not measuring the intrinsic similarity between the images, but just their superficial differences, i.e., the Euclidean distance. However, by comparing images using extracted features, we actually project the them onto a manifold and calculate the geodesic distance therein. By measuring the intrinsic similarity between images, our proposed approach can produce results with not only lower noise but also sharper details.
In this section, we first present the loss functions that we use for measuring the image reconstruction error. The proposed denoising deep network is then described.
2.1 Loss Functions
Our proposed method defines the objective loss function of the denoising CNN using feature descriptors. Let denote different feature maps of an image . Each map has the size of , where , and
denote height, width and depth, respectively. The feature reconstruction loss can then be defined as
where and are the denoised image and corresponding ground truth, respectively. In our work, the well-known pre-trained VGG network  has been used for feature extraction. Although VGG was originally trained for natural image classification, technical analysis shows that many feature descriptors learned by VGG are quite meaningful for human , which suggests that it also learns general perceptual features not specific to any particular kind of images.
2.2 Network Architecture
Our developed network consists of two parts, the CNN denoising network and the perceptual loss calculator, as shown in Fig. 1. To learn denoising images containing different structures and intensities, a deep enough network is required to handle the sophistication. In our work, the CNN denoising network was constructed by 8 convolutional layers. Following the common practice in the deep learning community , small kernels were used in each convolutional layer. Due to the stacking structure, such a network can cover a large enough receptive field efficiently. Each of the first 7 hidden layers of the denoising network had 32 filters. The last layer generates only one feature map with a single
The second part of the network is the perceptual loss calculator, which is realized by using the pre-trained VGG network . A denoised output image from the first part and the ground truth image are fed into the pre-trained VGG network for feature extraction. Then, the objective loss is computed using the extracted features from a specified layer according to Eqn. (1). The reconstruction error is then back-propagated to update the weights of the CNN network only, while keeping the VGG parameters intact.
The VGG network has 16 convolutional layers, each followed by a ReLU layer and 4 pooling layers. In our experiment, we tested the feature maps generated at the first ReLU layer before the first pooling layer, named relu1_1, and the first and fourth ReLU layers before the third pooling layer, named relu3_1 and relu3_4, respectively. The corresponding networks are referred to as CNN-VGG11, CNN-VGG31, and CNN-VGG34 respectively.
3.1 Materials and Network Training
In our work, we trained all the networks on a NVIDIA GTX980 GPU using random samples from the cadaver CT image dataset collected at Massachusetts General Hospital (MGH) . These cadavers were repeatedly scanned under a GE Discovery 750 HD scanner at different noise levels, with the noise index (NI) values of 10, 20, 30, and 40 respectively. In addition, the projection data were used for CT image reconstruction with two different methods. While one is the classic filtered back-projection (FBP) method, the other is a model-based fully iterative reconstruction (MBIR) vendor-specific technique named VEO (GE Healthcare, Waukesha, WI). The MBIR technique has a strong capability of noise suppressing, but the traditional FBP method does not. In our experiment, we used FBP reconstruction from 30NI dataset (high noise level) as the network input and the corresponding VEO reconstruction from 10NI dataset (low noise level) as the ground truth images.
The proposed network was implemented and trained using the Caffe toolbox. At the training phrase, we randomly extracted and selected 100,000 image patches of size from 2,600 CT images. We first trained a CNN with the same structure as shown in Fig. 1
but using the mean-square-error (MSE) loss, which is named CNN-MSE. The network was trained for 1,920 epochs. Then, the CNN-MSE weights were used to initialize the CNN-VGG11, CNN-VGG31, and CNN-VGG34 networks. In our experiments, we noticed that the new networks can be trained very quickly. In some cases, only 10 epochs were enough to obtain good results, and further training did not help much.
3.2 Experimental Results
At the validation stage, whole CT images were used as input. We tested the networks using 500 images from two cadavers’ whole body scan. For comparison, we also tested the classic BM3D method  and the recent work on SRCNN [2, 6] named as CNN-MSE.
Figs. 2 and 4 show two examples of the denoised images. To make the differences clearer, ROIs indicated in the red rectangular areas in those figures are zoomed and shown in Figs. 3 and 5, respectively. From these images, it is seen that the images recovered by CNN-MSE and CNN-VGG11 got over-smoothed with some details missing. On the contrary, CNN-VGG31 and CNN-VGG34 yielded images of better contrast and more similar to the VEO images. As for BM3D, it gave different visual effects on different images. In Fig. 3(h), the nodule pointed by the red arrow was smoothed out, while the streak artifacts were reserved in Fig. 5(h). This can be explained by the non-uniformity of image noise. In addition, although the low contrast lesions (pointed by red arrow in Figs. 3 and 5) can be seen in the FBP30NI and VEO30NI images, the blocky and pixelated effects in image appearance make them unacceptable for diagnostic use. The denoised images by CNN-VGG31 provide the best delineation of lesions relative to the ground truth of VEO10NI, while improving overall image appearance, which may greatly improve the diagnostic confidence.
The traditional metrics of PSNR and SSIM were also used for evaluation as shown in Table 1. PSNR is equivalent to the per-pixel loss. As measured by PSNR, a model trained to minimize per-pixel loss should always outperform a model trained to minimize feature reconstruction loss. Thus, it is not surprising that CNN-MSE achieves higher PSNR and SSIM than CNN-VGG31 and CNN-VGG34. However, these quantitative values are close, and the results of CNN-VGG31 and CNN-VGG34 are visually much more appealing. Overall, these two networks are better than CNN-MSE and CNN-VGG11.
In our experiments, we tested three feature maps of the VGG network. Generally speaking, lower-level layers of VGG extract primitive features, while higher-level layers give more sophisticated higher level features. This explains why CNN-VGG11 has a similar visual effect as CNN-MSE while CNN-VGG31 and CNN-VGG34 preserve more details.
As for the computational cost, it took about 16 hours to train the CNN-MSE network and 10 minutes to fine-tune the CNN-VGG networks on a GTX980 GPU. After the networks were trained, restoring a single image took less than 5 seconds. Thus, compared with the typical time of CT image reconstruction, computational cost would never be a problem for image denoising using deep neural networks in clinical applications.
In this work, we have proposed a convolutional neural network for CT image denoising with a perceptual loss measure, which is defined as the MSE between the feature maps of the CNN output and the ground truth respectively. The experimental results show that the proposed network increases the images’ PSNR and SSIM and that the perceptual regularization helps prevent image from over-smoothing and losing structure details. In our future work, we will refine, validate, and optimize our perceptive CNN with a larger dataset. More importantly, we will perform a reader study to compare the radiological reading reports with our deep learning results.
-  Brenner, D.J., Hall, E.J.: Computed tomography —- an increasing source of radiation exposure. New England Journal of Medicine 357(22), 2277–2284 (2007)
-  Chen, H., Zhang, Y., Zhang, W., Liao, P., Li, K., Zhou, J., Wang, G.: Low-dose CT denoising with convolutional neural network (2016), arXiv:1610.00321
-  Chen, Y., Yin, X., Shi, L., Shu, H., Luo, L., Coatrieux, J.L., Toumoulin, C.: Improving abdomen tumor low-dose CT images using a fast dictionary learning based processing. Physics in medicine and biology 58(16), 5803 (2013)
Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: BM3D image denoising with shape-adaptive principal component analysis. In: SPARS (2009)
De Gonzalez, A.B., Darby, S.: Risk of cancer from diagnostic x-rays: estimates for the uk and 14 other countries. The lancet 363(9406), 345–351 (2004)
-  Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)
-  Feruglio, P.F., Vinegoni, C., Gros, J., Sbarbati, A., Weissleder, R.: Block matching 3d random noise filtering for absorption optical projection tomography. Physics in medicine and biology 55(18), 5401 (2010)
-  Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding (2014), arXiv:1408.5093
-  Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution (2016), arXiv:1603.08155
-  Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., Shi, W.: Photo-realistic single image super-resolution using a generative adversarial network (2016), arXiv:1609.04802
-  Ma, J., Huang, J., Feng, Q., Zhang, H., Lu, H., Liang, Z., Chen, W.: Low-dose computed tomography image restoration using previous normal-dose scan. Medical physics 38(10), 5713–5731 (2011)
-  Mahendran, A., Vedaldi, A.: Visualizing deep convolutional neural networks using natural pre-images. Int J Comput Vis 120, 233–255 (2016)
-  Nixon, M., Aguado, A.S.: Feature Extraction & Image Processing. Academic Press, 2nd edn. (2008)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014), arXiv:1409.1556
Srinivas, S., Sarvadevabhatla, R.K., Mopuri, K.R., Prabhu, N., Kruthiventi, S.S.S., Babu, R.V.: A taxonomy of deep convolutional neural nets for computer vision. CoRR (2016),arXiv:1601.06615
-  Yang, Q., Kalra, M.K., Padole, A., Li, J., Hilliard, E., Lai, R., Wang, G.: Big data from CT scanning. JSM Biomedical Imaging Data Papers 2(1), 1003 (2015)