Medical images provide a lot of useful information and visual insight into diﬀerent hidden body organs. They are very effective for helping doctors in making correct diagnoses or can be used as valuable reference resources for better treatment. Moreover, with the rapid development of artificial intelligence (AI), many breakthrough applications have been built on top of medical images data[ronneberger2015u, gulshan2016development, rajpurkar2017chexnet, esteva2017dermatologist, yala2019deep].
However, obtaining medical images especially endoscopic or throat images is never an easy task. In practice, those images normally contain noise, hazy, uneven illumination, lack of focus, etc., due to many difficult shooting conditions inside the body. Thus, these could greatly affect the medical diagnostic process. Several studies applying machine learning techniques for diagnosing endoscopic and throat images have been reported that their systems are highly sensitive to the image conditions[askarian2019novel, tobias2019throat, he2018hookworm, hirasawa2018application, takiyama2018automatic]. Poor image quality could easily lead to a misdetection, making it a very challenging task.
We are developing a special camera device for supporting doctors in diagnosis oral and throat diseases. We also experienced that the inside environment of patient’s palate contains many negative factors that reduce the quality of images such as the hazy caused by patient’s breath on camera or the lack of focus. Fig. 1 illustrates examples of throat images with undesirably quality and this is an obstacle for doctors from making medical decisions. Therefore, a method to improve the quality of medical image to support the diagnosis is essential. We believe that this problem can be addressed by applying image dehazing technique.
Recent works have been utilizing a deep learning method called convolutional neural networks (CNNs) and shown tremendous success for recovering image quality from very dense haze and noise. Those dehazing techniques can be divided into two major approaches: the supervised approach[cai2016dehazenet, li2017aod, ren2018gated, yang2018proximal, yuan2019single] and the unsupervised approach [engin2018cycle, yang2018towards, dudhane2019cdnet, huang2019towards]. The former normally achieves compelling results thanks to the modeling power of CNNs. However, they require a large amount of paired ground-truth images for supervision which is almost unavailable to obtain in reality. The latter offers more practical settings for image dehazing by removing the need of paired label training data. They are all built on the success of CycleGAN [zhu2017unpaired], which is a generative adversarial network (GAN) [goodfellow2014generative] based image-to-image translation method. CycleGAN introduced the cycle-consistency constraint that generated image from a domain should be identical to its original form when transforming it back.
Despite their impressive results, there are two main problems of these methods when applying to our practical throat images data. Firstly, supervised and several unsupervised studies were still built based on the assumption that hazy images (training data) have unique haze and are generated by the atmosphere scattering model [narasimhan2000chromatic, narasimhan2002vision]. For this reason, they may not be practical in scenarios when the disturbance deviates from prior assumptions (e.g., when shooting environment changed, such as differences in equipment, camera-setting or protocols). Secondly, naïve CycleGAN is reported to not work well on high-resolution data [li2018unsupervised] and it does not generate sufficient resolution output for our purposes. We should note that the literature [engin2018cycle] suggested to use the Laplacian upscaling for the output of CycleGAN to obtain higher resolution results. However, the obtained images are normally overly smooth and sometimes fails to accurately represent detailed structures. Therefore, these abovementioned problems will make it difficult for doctors to diagnose through throat images. A framework that generates clean throat images with high-resolution from original low-quality (LQ) images could be a great tool for supporting doctors in making medical decisions.
In this paper, we propose a medical image improvement framework named MIINet for helping doctors to make medical diagnostic decisions. Our MIINet consists of two modules: the image dehazing module (IDM) and the image super-resolution (ISR) module. The IDM is developed based on the CycleGAN[zhu2017unpaired]
model with the aims of translating images from LQ domain to high-quality (HQ) domain. In this work, we introduce a new loss term based on the perceptual loss function[johnson2016perceptual] with the aims to preserve original input image attributes such as structure, color, texture. This function is essential since that original information is crucial in medical diagnosis. Besides the IDM, we introduce a CNN-based image super-resolution (ISR) module to enlarge the output from our IDM, obtaining high-resolution results. The ISR module acts as an optional module when doctors need to enlarge images for more diagnosis details.
Our contributions can be summarized as follows:
We propose the MIINet that improves the quality of practical LQ throat images while preserving the structure of the involved areas.
Our MIINet with the introduction of the ISR module is able to produce high-resolution throat images, making the disease diagnosis more favorable for doctors.
The dehazed throat images obtained by our MIINet shows a significantly higher of the mean doctor opinion score (MDOS) of 4.11 compared to the original LQ images of 2.36, in assessing the quality and the reproducibility of the images.
2 Proposed Method – MIINet
The proposed MIINet consists of two modules: (1) image dehazing module (IDM), and (2) image super-resolution (ISR) module. Fig. 2 shows the schematic of our framework. Given an input of original LQ throat image, our IDM will calibrate and convert that image into a HQ clean image. Since the output of our IDM is relatively small size, it will be fed to the ISR to enlarge into higher resolution with upscaling, this will help doctors to have a better visual inspection.
2.1 The Image Dehazing Module – IDM
Our IDM is an improved version of CycleGAN [zhu2017unpaired] for unpaired throat image improvement. It consists of a mapping function that translates image from source domain () to target domain (), and an invert mapping function to enforce the cycle-consistency. Corresponding to two generators are the two adversarial discriminators and , where is trying to discriminate the real image from the generated image with . Similarly, distinguishes the real image from the generated image . In this work, we assume is the LQ image domain while is the HQ image domain.
Fig. 3 shows the dataflow of the translation from . Given a LQ image , the generator will transform it into a HQ clean image . Then, the image and are then fed into the discriminator . Note that the translation is symmetric to the translation .
Based on the GAN literature [goodfellow2014generative], the adversarial losses for both mapping functions and are and , respectively. Where:
Note here that and . The cycle consistency loss is formulated as follows:
As we mentioned before that preserving the original attributes of input images (i.e., structure, texture, color) is crucial in medical diagnosis. Therefore, we introduce a new loss term based on the perceptual loss [johnson2016perceptual]
. To ensure that the attributes of original input and output are as similar as possible, we minimize the L1 distance between the features extracted by a CNN model of both input and generated image. Based on our preliminary experiments, we use the
pooling layer of the ImageNet[deng2009imagenet] pre-trained VGG16 [Simonyan15] model to extract the features. The will be defined as:
where is the features extracted from the VGG16 model. Finally, our final objective function can be summed up as:
where are the coefficient to control the balance of different loss terms.
2.2 The Image Super-Resolution Module - ISR
Our ISR module is a GAN-based single image super-resolution (SISR) model, which aims to learn an end-to-end mapping function to recover a high-resolution (HR) image from a single low-resolution (LR) image [dong2015image, ledig2017photo, wang2018esrgan]. Many SISR models have also been proposed and widely used in many practical applications ranging from medical imaging [dalca2018medical, zhao2019channel], security and surveillance [bulat2018super], satellite imaging [rangnekar2017aerial], to agriculture [cap2019super].
In this work, we propose an SISR module namely throat image super-resolution (ISR) for enlarging the resolution of the clean throat image output from our IDM. Our ISR module is built based on an excellent SR model so-called ESRGAN [wang2018esrgan] which generates realistic perceptual quality results and achieved impressive performances in many benchmarks [blau20182018]. Similar to ESRGAN, our ISR module consists of two networks: a generator which generates super-resolved images from LR images and a discriminator that discriminates the HR image from the super-resolved ones. We use the architecture of the generator, we design our network to take the input of instead of the original as in ESRGAN since this setting helps our model gains slightly better performance based on our preliminary experiments. The two networks are then trained together in an alternating manner to solve a minimax problem [goodfellow2014generative]. For more technical training details, please refer to the original ESRGAN article [wang2018esrgan].
3 Experimental Results
3.1 Throat Image Dataset
In this work, we collected 1,600 throat images from over 160 patients in which contain both ill-conditioned images and clean images. They were taken by a special camera designed for throat diagnosis and each of which has the size of pixels. Experts were asked to manually inspect and carefully select 200 images with hazy and lack of focus (see Fig. 1) as low-quality images and we refer it as the LQ Throat dataset. Note that those images are the most difficult cases for physicians to diagnose. From this LQ Throat dataset, 100 images are used for training and the others 100 are for testing. The rest 1,400 images are clean and high-quality. We refer it as the HQ Throat dataset.
3.2 Training The IDM
Since the number of images between the two datasets LQ Throat and HQ Throat is quite different from each other. We randomly selected 100 images from HQ Throat dataset (i.e., same amount as the LQ Throat test dataset) to train our IDM. We then combined and applied different data augmentation techniques such as horizontal flip, random scale, random resize on both datasets beforehand. Since the IDM (or other image-to-image translation GAN models such as CycleGAN) cannot handle high-resolution data due to the limitation of available GPU memory, we resized input images to the size of pixels before training. As a result, each dataset has 2,300 images after data augmentation.
We applied the same training procedures as described in CycleGAN [zhu2017unpaired] to train our MIINet. The Adam optimizer [kingma2015adam] was used to train the network. We set the and
in Eq. (5) equal to 10.0 and 1.5, respectively. The training process finished after 400 epochs. Please refer to[zhu2017unpaired] for more training details.
3.3 Training The ISR Module
In this paper, we built our ISR module for super-resolving the output from the IDM. The scaling factor of was used for enlarging HR from LR throat images. We used the HQ Throat dataset described in section 3.1 to train our ISR model. During the training, the HR images were obtained by randomly cropping from training images with the size of . The LR images areas we observed this helped our ISR module to generate better visual results. Note here again that this ISR module acts as an optional module when doctors need to enlarge images for more diagnosis details. Since the HR (clean) version of the input LR throat images is unavailable, we do not report the numerical results of our ISR module in this paper. The training details are the same as in the ESRGAN literature [wang2018esrgan] and was completed after 400 epochs.
3.4 The Mean Doctor Opinion Score
Since there are no quantitative metrics for assessing the throat image quality for diagnostic purposes, we introduce a new evaluation criteria called mean doctor opinion score (MDOS) based on the mean opinion score to evaluate the quality of throat images. Specifically, only experienced doctors were requested to give the scores. We asked each doctor to assess a given image under two aspects: the quality (i.e., how good is this image for diagnosis?) and the reproducibility (i.e., how good is this image in preserving the structure, texture, color from the original throat image?). We should note that scores for original LQ and HQ throat images are given based on the quality aspect only.
We asked three specialized doctors to assign a score from 1 (bad quality) to 5 (excellent quality) to the throat images. The doctors rated three versions of each image on 100 test images from the LQ Throat dataset (i.e., original LQ throat images, generated images by CycleGAN and MIINet, respectively) and an addition 100 HQ images. Each doctor thus rated 400 instances.
For comparison purposes, we also trained a CycleGAN model and evaluated its dehazed images. Comparisons of original LQ images, generated images by CycleGAN and MIINet, and HQ images are shown in Fig. 4. Our proposed MIINet successfully generates clean versions from original LQ images and have a much better capability of preserving the original attributes (i.e., structure, color, texture) than the CycleGAN model. Note here that the HQ images provided in the examples have no association with the rest of the images, and we evaluated them as a reference for a better intuitive understanding about the MDOS in our experiment. Our MIINet also significantly improved the MDOS from the LQ images and is better than CycleGAN as shown in Table 1 and Fig. 5.
We confirm the effectiveness of our MIINet for supporting doctors in image-based throat diagnosis by using the MDOS testing. From the results in Figs. 4, 5 and Table 1, it is clear that the original LQ images yield the lowest MDOS since they are affected by negative factors such as hazy, uneven illumination, lack of focus etc., making it difficult for doctors to make their decisions.
As for the result of CycleGAN, even it improves a much better visual quality than original LQ images, there is a significant difference in scores distribution in comparison with our MIINet (see Fig. 5) since the CycleGAN could not preserve the original attributes (i.e., structure, color, texture) of LQ throat images. Visual results from Fig. 4 show that CycleGAN either changes the color or generates much different structure and shape from input images. This is because the original CycleGAN only learns to generate images that look close to the samples from the target domain but has no mechanisms to preserve those original attributes. We should note that the color distribution of the HQ Throat dataset is quite different from the LQ Throat dataset. Thus, CycleGAN generated outputs that have similar color as the target domain. Keeping the similar structure and color is very important for doctors to make their decisions and therefore, the generated images from CycleGAN are not preferable.
From doctor’s feedback, generated images by MIINet are recommended to support throat diagnosis. Thanks to the introduction of the perceptual loss, our MIINet not only learns to generate compelling quality images but also helps preserving the originality from inputs, significantly improved the MDOS from original LQ images from 2.36 to 4.11.
Although our system has achieved a promising result, there are several cases when CycleGAN generates slightly better visual focus images than our MIINet as shown in Fig. 6. This is the trade-off of adding the perceptual loss into the objective function of CycleGAN. MIINet is forced to keep the characteristics of original inputs while CycleGAN has more freedom to generate close outputs to the HQ Throat dataset. Despite that fact, it is worth to mention that the MDOS of MIINet generated images in most cases are higher than CycleGAN since the original attributes have been preserved. For better medical decisions, doctors recommend utilizing both results from CycleGAN and MIINet when diagnosing throat images if necessary. Moreover, proposing more objective quantitative evaluations beside the MDOS metric for our framework could be useful and we intend to develop it in future works.
In this paper, we proposed the medical image improvement method (MIINet) to improve the quality of throat images for supporting in making medical diagnosis. With the introduction of the simple but effective perceptual loss, our MIINet largely improved the quality of original LQ throat images and achieved a promising result on the real-world throat images dataset. From the results, we believe that our proposed MIINet method could be a useful tool for supporting doctors in making medical decisions and has a potential impact on different types of medical images.
This work was done while the ﬁrst author did a research internship at Aillis Inc., Japan. We would like to thank all researchers, specially doctor Sho Okiyama, Memori Fukuda, Kazutaka Okuda for their valuable comments and feedback.