The analysis of multi-modality positron emission tomography and computed tomography (PET-CT) images requires combining the sensitivity of PET to detect abnormal regions with anatomical localization from CT. However, current methods for PET-CT image analysis either process the modalities separately or fuse information from each modality based on knowledge about the image analysis task. These methods generally do not consider the spatially varying visual characteristics that encode different information across the different modalities, which have different priorities at different locations. For example, a high abnormal PET uptake in the lungs is more meaningful for tumor detection than physiological PET uptake in the heart. Our aim is to improve fusion of the complementary information in multi-modality PET-CT with a new supervised convolutional neural network (CNN) that learns to fuse complementary information for multi-modality medical image analysis. Our CNN first encodes modality-specific features and then uses them to derive a spatially varying fusion map that quantifies the relative importance of each modality's features across different spatial locations. These fusion maps are then multiplied with the modality-specific feature maps to obtain a representation of the complementary multi-modality information at different locations, which can then be used for image analysis, e.g. region detection. We evaluated our CNN on a region detection problem using a dataset of PET-CT images of lung cancer. We compared our method to baseline techniques for multi-modality image analysis (pre-fused inputs, multi-branch techniques, multi-channel techniques) and demonstrated that our approach had a significantly higher accuracy (p < 0.05) than the baselines.READ FULL TEXT VIEW PDF
Multimodal positron emission tomography-computed tomography (PET-CT) is ...
'Radiomics' is a method that extracts mineable quantitative features fro...
Clinical data elements (CDEs) (e.g., age, smoking history), blood marker...
Current tomographic imaging systems need major improvements, especially ...
Combining multiple complementary techniques together has long been regar...
Ultrasound (US) is a critical modality for diagnosing liver fibrosis.
When a toddler is presented a new toy, their instinctual behaviour is to...
Medical imaging is a cornerstone of modern healthcare providing unique diagnostic, and increasingly therapeutic, capabilities that affect patient care. The range of medical imaging modalities is wide but in essence they provide anatomical and functional information about structure and physiopathology. The multi-modality F-Fluorodeoxyglucose (FDG) positron emission tomography and computed tomography (PET-CT) scanner is regarded as the imaging device of choice for the diagnosis, staging, and assessment of treatment response in many cancers . PET-CT combines the sensitivity of PET to detect regions of abnormal function and the anatomical localization provided by CT . With PET, sites of disease usually display greater FDG uptake (glucose metabolism) than normal structures. The spatial extent of the disease within a particular structure, however, cannot be accurately determined due to the inherent lower resolution of the PET when compared to CT and MR imaging, tumor heterogeneity, and the partial volume effect . CT provides the anatomical localization of sites of abnormal FDG uptake in PET and so adds precision in the imaging interpretation . One example clinical domain that has benefited greatly from PET-CT imaging is the evaluation of non-small cell lung cancer (NSCLC), the most common type of lung cancer. In NSCLC, the extent of the disease at diagnosis is the most important determinant of patient outcome; PET-CT is able to detect sites of disease where there are no abnormalities in the underlying structure on CT, hence its value in patient management [5, 6, 7, 8].
The role of PET-CT in cancer care has provoked extensive research into methods to detect, classify, and retrieve PET-CT images[9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. These methods are divided into two main categories: (i) methods that process each modality separately and then combine the modality-specific features [9, 10, 11, 12, 13, 14, 15, 16, 17, 18], and (ii) methods that combine or fuse complementary features from each modality [19, 20, 21, 22]. While our experimental evaluation in this paper is focused on region detection, we provide an overview of the broader field below and in Section II.
Methods that process each modality separately are inherently limited when the intent is to consider both the function and anatomical extent of disease. For example, in a chest CT depicting a lung tumor that is causing collapse in adjacent lung tissue, both the tumor and the collapse can appear identical. Similarly, some areas of high FDG uptake on PET images may be linked to normal physiological uptake, such as in the heart, and these regions need to be filtered out based upon knowledge about anatomical characteristics from CT to differentiate them from abnormal PET regions [23, 24, 25, 26]. In contrast, methods that fuse information from the two modalities often use a priori knowledge about characteristics of the different modalities to prioritize information from the two modalities for different tasks. For example, Song et al.  used both PET and CT for lesion characterization but disregarded PET features when computing spatial appearance features due to the low spatial resolution of PET. Alternatively, they may fuse information using a representation that models semantically-derived relationships between the two modalities. For example, a graph structure based upon the criteria from cancer staging manuals  was previously used to associate PET tumor features with the CT features of nearby anatomical structures . As such, these fusion methods are highly dependent upon an external predefined specification of the relationship between the features from both modalities. Hence, the ability to derive an application-specific fusion would reduce this dependency.
Current image fusion strategies in the general (non-medical) domain derive spatially varying fusion weights from the local visual characteristics of the different image data 
. Features such as pixel variance, contrast, and color saturation are used to derive task-specific fusion ratios for different regions of interest (ROIs) within the images[28, 29]. These fusion methods can thus adapt to and prioritize different content at different locations in the images according to the underlying image features that are relevant to the different images being analyzed. This results in the capacity to enhance specific information for different image data. Adapting such spatially varying pixel-level fusion for PET-CT image analysis will enable the adaptive fusion of visual characteristics of different diseases (e.g. uptake in homogeneous vs. heterogeneous tumors) and different anatomical locations (e.g. tissue density in the lungs vs. mediastinum). Our hypothesis is that a spatially varying fusion of multi-modality image data that is derived from the underlying visual features will enable better integration of the complementary information in multi-modality images for automatic image analysis applications such as region detection.
are deep learning methods for object detection, classification, and analysis of image data. CNNs have shown superiority across this spectrum when compared to non-deep learning methods in various benchmarks, e.g., the ImageNet Large Scale Visual Recognition Challenge
. This dominance relates to the ability of CNNs to implicitly learn image features that are ‘meaningful’ for a given task directly from the image data. In medical image analysis, CNNs have shown improved object detection, classification, and segmentation performance compared to traditional approaches such as support vector machines[34, 35, 36, 37, 38, 39, 40, 41].
For multi-modality medical image analysis, many investigators have used CNN-derived features, directly or with some tuning, that were obtained from training on natural (photographic) images [42, 10, 19, 11, 13]. CNNs that were designed for multi-modality medical images were either used as a filter after modality-specific processing , or alternatively focused on images of the same modality obtained with different acquisition protocols,e.g., T1- and T2-weighted magnetic resonance (MR) images showing different tissue properties [39, 40]. Other deep learning approaches have included auto-encoders to learn shared representations that capture the similarities in features across multiple modalities 
. In these studies, however, CNNs were generally used as feature extractors and classifiers without consideration of how the features from each modality were combined, thus relying either on pre-fusion of the input data or using multi-channel approaches where each input modality was initially processed by an independent CNN kernel (or weight tensor). In both circumstances, the images or features were fused without consideration of the spatially varying visual characteristics at different image locations. We provide further description of related studies in SectionII.
In this paper, our aim is to improve fusion of the complementary information in multi-modality images for automatic medical image analysis. We present a new CNN that learns to fuse complementary anatomical and functional data from PET-CT images in a spatially varying manner, for the detection of different ROIs. The novelty of our CNN is its ability to produce a fusion map that explicitly quantifies the fusion weights for the features in each modality. This is in contrast to CNNs that use multi-channel inputs [19, 39] or modality-specific encoder branches [9, 41, 13], where modalities are implicitly fused. We employ an experimental comparison of our CNN with these baseline methods to empirically support our hypothesis that the spatially varying feature fusion enabled by our CNN enhances the detection of different ROIs in PET-CT lung cancer images.
The computerized analysis of multi-modality medical imaging has been a widely pursued area of research. In recent work, Bagci et al.  proposed a method to simultaneously delineate ROIs in PET, PET-CT, PET-MR imaging, and fused MR-PET-CT images using a random walk segmentation algorithm with an automated similarity-based seed selection process. Zhao et al.  combined dynamic thresholding, watershed segmentation, and support vector machine (SVM) classification to classify solitary pulmonary nodules on the basis of CT texture features and PET metabolic features. Similarly, Lartizien et al. 
used texture feature selection and SVM classification for staging of lymphoma patients based on their PET-CT imaging data. Y. Song et al. and Q. Song et al.  used the context of PET and CT regions to characterize tumors with spatial and visual consistency. Han et al.  segmented tumors from PET-CT images, formulating the problem as a Markov Random Field with modality-specific energy terms for PET and CT characteristics. In our prior work [20, 21], we used multi-stage discriminative models to classify ROIs in the thoracic PET-CT images and in full-body lymphoma studies. In our PET-CT retrieval research [22, 43], we have also derived a graph-based model that attempts to bridge the semantic gap by modeling the spatial characteristics that are important for lung cancer staging .
Recently, the strength of deep learning methods for feature learning and pattern recognition[31, 30], and the strength of CNNs for image processing [32, 33] have provoked their application to medical image analysis. Many initial studies validated the applicability of deep learning to the medical domain using approaches that transfered features learned from a non-medical domain and tuning them to a specific medical task [34, 35], e.g., classification of the modality of the medical images depicted in research literature  and the localization of planes in fetal ultrasound images . Later studies designed new CNNs for specific clinical challenges, such as the classification of interstitial lung disease in CT images .
In the multi-modality medical image analysis domain, MR images obtained with different acquisition protocols have been treated as multi-modality images, with the reasoning that the different MR images showed different aspects of the same anatomical structure [39, 40, 41]. Zhang et al.  designed a CNN-based segmentation approach for brain MR images. Tseng et al.  segmented ROIs with complementary features that were learned via a convolution across the different MR images. Van Tulder and de Bruijne  used an unsupervised approach to learn a shared data representation of MR images, which acted as a robust feature descriptor for classification applications. Liu et al. 
used a convolutional autoencoder to detect air, bone, and soft tissue for attenuation correction in PET-MR images. Teramoto et al. used a CNN as a second stage classifier to determine if candidate lung nodules in PET-CT were false positives. Bi et al.  used domain-transferred CNNs to extract PET features for PET-CT lymphoma classification. Bradshaw et al. , fine-tuned a CNN, which was pre-trained on ImageNet data, for PET-CT images in a multi-channel input approach, using a CT slice and two maximum intensity projections of the PET data as the inputs. Xu et al.  cascaded two V-Nets  to detect bone lesions, using CT alone as the input to the first V-Net and a pre-fused PET-CT image for the second. Similarly, Zhong et al.  trained one U-Net  for PET and one for CT, combining the results using a graph cut algorithm. None of this prior work, however, considered how the visual characteristics, specific to each image at different locations, could be integrated in a spatially varying manner.
Our dataset comprised 50 FDG PET-CT scans of patients with pathologically proven NSCLC. The studies were acquired on a Biograph 128-slice mCT (PET-CT scanner; Siemens Healthineers, Hoffman Estates, Il, USA). The mCT is a high-resolution tomograph with high-definition reconstruction, time-of-flight and flow motion characteristics. The studies were performed in the Department of Molecular Imaging at the Royal Prince Alfred Hospital, Sydney, Australia. Each study comprised one CT volume and one PET volume: the CT resolution was pixels at 0.98mm 0.98mm, the PET resolution was 200 200 pixels at 4.07mm 4.07mm, with a slice thickness and an interslice distance of 3mm. Studies contained between 1 to 7 tumors (inclusive) in the thorax; we only used the slices from the thorax containing the ROIs (852 total slices). All data were de-identified. All images were rescaled to a resolution of pixels (x-y axes). The PET images were normalized by a transformation to the standard uptake value (SUV) to account for variation in PET tracer uptake related to isotope dose and patient body mass .
The ground truth was derived from the diagnostic imaging report which detailed the locations of the primary tumor and any involved thoracic lymph nodes. All reports were done by a single, experienced imaging specialist who has read over 80,000 PET and PET-CT scans. We used the report findings to drive a semi-automatic process for ROI labeling. We applied a commonly used adaptive thresholding algorithm  to extract the lung ground truth from the CT volume. Similarly, we used connected thresholding to coarsely determine the mediastinum. We extracted the tumor ground truth using 40% peak SUV connected thresholding to detect the ‘hot spots’ identified in the diagnostic reports . Minor manual adjustments of the thresholding parameters were done to ensure the ROIs corresponded to areas described in the report.
We randomly divided the 50 PET-CT studies into a training set of 40 studies (690 slices) and a separate test set of 10 studies (162 slices). The training set was further randomly subdivided into two groups, each comprising of 20 studies, for use in two-fold parameter validation (see Section III-F).
Fig. 1 shows the architecture of our proposed CNN; note that the number alongside each feature map in the figure refers to the number of output channels in the feature map. Our CNN comprises four main components: two encoders (one for each modality), one co-learning and fusion component, and a reconstruction component. The purpose of the two encoders is to derive the image features that are most relevant to each specific image modality. The co-learning component uses the modality-specific features produced by the encoders to derive a spatially varying fusion map to weight the modality-specific features at different locations. Finally, the reconstruction component integrates the modality-specific fused features across multiple scales to produce the final prediction. The structure and behavior of these components are described in detail in the following subsections.
Our CNN contains an encoder for PET images and a separate encoder for CT images. The purpose of each encoder is to extract the visual features that are relevant to the input image modality. Thus, the encoders were designed with stacked convolutional layers in a similar manner to the deep CNNs that have achieved high accuracy in image classification tasks, e.g., AlexNet  and VGGNet [48, 49]. As shown in Fig. 1
, each encoder comprises four blocks that each contain two convolutional layers for feature map generation and a max pooling layer to down-sample the feature maps.
A consequence of this stacked structure is that as the weights of each layer change, the distribution of the outputs they produce also change, potentially influencing the convolutional layers later in the network. During training, this means that even small changes in the weights of one layer may cascade and be amplified in deeper layers, requiring the layers to continuously adapt to new input distributions . As our CNN includes inputs from two different imaging modalities, the co-learning and reconstruction components will be affected by the cascading weight changes from both encoders, which will slow convergence and thus hinder the learning process.
Let be the output feature map of a convolutional layer where is the input to the convolution layer, is the convolution operation, is the learned weights of the convolution layer, and
is the learned bias of the convolution layer. We use a batch normalization layer to normalize every dimension of the output feature map to a distribution with zero mean and unit variance, which acts to reduce the impact when the feature map is used as an input for subsequent convolutional layers.
where is a normalized feature and is a parameter controlling the ‘leakiness’ of the activation function, with the constraint that
. The Leaky ReLU activation avoids the dead neuron problem that can occur with the standard ReLU function where some weights in can be updated to a value where their training gradients are forever stuck at 0, thus preventing the weights from being updated in the future. The parameter enables the introduction of a small non-zero gradient when , thereby preventing the weights from being stuck at an unrecoverable value. For simplicity of notation, we refer to the output of a convolutional layer by as the feature map generated from after convolution, batch normalization, and activation.
The co-learning component consists of two parts: (i) a co-learning unit can be thought of as a CNN that learns to derive spatially varying fusion maps, and a (ii) fusion operation that uses the fusion maps to prioritize different features. Fig. 2 shows a conceptual example of the function of the multi-modality co-learning unit. The inputs to the co-learning unit are two feature maps and (each from a block of one modality-specific encoder), each of size with width, height, and channels. These feature maps are stacked to form , a tensor with number of modalities. The channels of are then convolved with the channels of a learnable 3D kernel of size , where is the width and height of the kernel, and is the number of modalities.
By performing a 3D convolution 
without padding the modality dimension, we obtain for a given channela feature map with a singleton third dimension where the value at location is determined from the neighborhood of both and :
We then squeeze the singleton third dimension to obtain an output feature map of size , the same width and height as the two modality-specific input feature maps and and double the number of channels, which is important for the weighting of modality-specific by the co-learned fusion maps as described below.
Our intention is that the co-learned fusion map controls the level of importance given to information from each modality at each location, in contrast to the global fusion ratio in PET-CT pixel intermixing [54, 55, 56]. Thus the co-learned fusion maps directly affect the input distribution of the learnable layers that immediately follow the co-learning unit. Hence, we do not normalize the output of the 3D convolution. As with the encoders (see Section III-C), we used a Leaky ReLU activation function to obtain the multi-modality co-learned fusion map:
where are the learned biases. Note that the multi-modality fusion map is obtained by the co-learning unit based on the spatial integration of the features from both modalities, since the 3D convolution operation considers the 3D neighborhood defined by the width, height, and modality of the stacked feature map .
The fusion operation (depicted in Fig. 3) integrates the modality-specific feature maps according to the values (coefficients) in the multi-modality fusion map, as follows:
where is the fused co-learned feature map, is the stacking operation, and is an element-wise multiplication. This process merges the two modality-specific feature maps and and weights them by the co-learned multi-modality fusion map , similar to pixel intermixing. Our CNN (Fig. 1) generates four fused feature maps, one for each pair of encoder blocks. These fused feature maps are passed to the reconstruction part of the CNN (see Section III-E).
The reconstruction part of our CNN creates a prediction map of the ROIs within the PET-CT image. It does this by integrating the co-learned feature maps from different encoder blocks and upsampling them to the dimensions of the original inputs. Similar to the encoders, the reconstruction component comprises four blocks each with one upsampling layer, one deconvolutional layer  layer, and two convolutional layers.
The input to a reconstruction block is the output co-learned feature map from a co-learning unit stacked with the output of any prior reconstruction block. The upsampling layer first doubles the width and height of the stacked feature map using nearest neighbor interpolation to enable eventual reconstruction of the detected regions at the same scale as the original input. The deconvolution layer merges the information from the stacked modality-specific feature maps, which is further refined by the two convolutional layers. The concept behind each reconstruction block is to generate higher dimensional feature maps that better correspond to the features for different ROIs by merging lower dimensional information with features that were fused from multiple image modalities. As with the modality-specific encoders (see SectionIII-C), we use batch normalization  and Leaky ReLU  activations.
After the last reconstruction block, the output feature map has the same width and height as the input PET-CT image, with 64 channels in the third dimension. This is analogous to a final 64-dimensional feature vector for each pixel in the original image. We then use a 11 convolution to map these feature vectors into the number of ROIs (4 in our experimental set-up), obtaining for each pixel a vector
corresponding to the observed activations for each ROI class. Finally, we transform these observations into a probability or prediction map that corresponds to the likelihood of the pixel belonging to a particular ROI class using the softmax function:
where is the probability that the pixel with observation vector belongs to the ROI , is the -th element of vector and is the activation corresponding to ROI , and is the total number of ROIs. Fig. 4 is an example of the prediction maps generated for four ROI classes: lung fields, mediastinum or soft tissue, tumors, and background (all other ROIs).
We trained our CNN using stochastic mini-batch stochastic gradient descent with momentum
using the following loss function and training parameters. We used the training data as specified in SectionIII-A; to improve the robustness of our training and to avoid overfitting we applied data augmentation through the standard technique of random cropping and flipping of training samples [48, 36].
|2D convolution kernel size||33|
|2D deconvolution kernel size||33|
|max pool size||22|
|number of channels ()||64|
|3D convolution kernel size||332|
|ReLU leakiness ()||0.1|
|regularization strength ()||0.1|
|Metrics [Mean Standard Deviation %]|
|TB||77.46 6.10*||99.58 0.51||98.29 0.49*||98.36 0.46*|
|TC||74.58 6.87*||99.68 0.79||97.96 0.66*||98.07 0.60*|
|FS||80.19 6.40*||98.60 1.47*||98.56 0.50*||98.57 0.47*|
|our method||82.57 5.62||99.51 0.82||98.76 0.40||98.81 0.37|
|TB||62.86 13.46||92.00 8.33*||99.06 0.60||98.97 0.59|
|TC||62.17 14.80||93.55 5.47*||99.09 0.43||98.99 0.46|
|FS||56.17 14.27*||89.00 17.30*||98.82 0.58*||98.69 0.66*|
|our method||64.75 12.87||95.74 7.90||99.08 0.65||99.04 0.64|
|TB||57.70 28.72*||61.34 33.49*||99.86 0.12*||99.71 0.29*|
|TC||57.14 27.83*||63.16 36.14*||99.82 0.20*||99.72 0.23*|
|FS||45.30 24.43*||82.86 22.66*||99.73 0.20*||99.67 0.21*|
|our method||71.17 26.61||75.99 27.53||99.91 0.12||99.83 0.15|
|TB||99.86 0.19*||97.27 0.79*||98.50 1.80*||97.37 0.78*|
|TC||99.89 0.18||96.88 0.96*||98.85 1.69||97.04 0.92*|
|FS||99.81 0.29*||97.12 22.66*||98.02 0.20*||97.20 0.95*|
|our method||99.91 0.08||97.74 0.79||99.04 0.89||97.85 0.74|
|TB||73.77 5.54*||96.46 3.53*||99.09 0.26*||99.02 0.32*|
|TC||71.62 6.20*||97.22 2.51*||98.97 0.30*||98.93 0.33*|
|FS||73.10 5.74*||96.25 3.24*||99.05 0.29*||98.98 0.33*|
|our method||78.16 5.17||98.00 1.80||99.26 0.26||99.23 0.29|
, derived from a -test.
We modified the well-established categorical cross-entropy loss function for training our CNN. Let be the set of pixel observations in an image and be the true class of , from a set of ROIs. Then our loss is given by:
is the class specific scaling, and
is the cross-entropy loss . Under this formulation, is the number of pixels in the true ROI, is the number of pixels in ROI , is an indicator function that is 1 when and 0 otherwise, is defined by Equation 5, is the regularization strength, and is the -th weight in , the set of all weights in the CNN. The distribution of the number of pixels in each class varies depending on the particular ROI (e.g., there are many more lung pixels than there are tumor pixels). As such, in Equation 6 acts as a scaling coefficient for the cross entropy loss ; this formulation is designed to reduce any bias that may be caused by ROIs with different sizes (e.g., tumor ROIs are often much smaller than lung fields) . The final term in Equation 6 is a regularization to reduce overfitting. Our aim was to ensure that the convolution kernel weights (and as a consequence, the features) corresponding to one modality did not overpower the weights (and the features) of the other. As such, we used an -regularization, which acts to prioritize lower weights across the entirety of .
We implemented our CNN using Tensorflow 1.4 on a machine running Ubuntu 14.04 with CUDA 8.0 and CuDNN . Training was performed on an 11GB NVIDIA GTX 1080 Ti.
We compared our fusion method to several baseline strategies for fusing information from multi-modality images. To limit the number of variable changes in our experimentation, for all baselines we used a similar architecture as in our method (Fig. 1), replacing the co-learning component with a fusion strategy from the literature. The baselines were:
A two-channel (TC) input CNN, implementing a fusion strategy where each modality was treated as different channels of a single input [39, 19]. The CNN was similar to a single encoder form of the architecture in Fig. 1, with no co-learning component and the CT and PET modalities input as separate channels.
We used the same training and test datasets for all experiments. We used greyscale inputs for both modalities, as was common in the baseline fusion strategies [39, 19, 9] and other multi-modality CNN research [10, 42, 39, 40]. Our comparisons used the following metrics based on per-pixel overlap with the ground truth (GT): precision, sensitivity (recall), specificity, and accuracy. We computed the -value for these comparisons with the two-sample -test.
Table II shows the precision, sensitivity, specificity, and accuracy of our method when compared with the baseline CNNs; the data are presented for each of the four types of ROI and collectively for all ROI. Our co-learning method has higher mean accuracy when compared to all baselines for all individual ROIs and overall. The improvement in accuracy offered by our method is statistically significant () for all cases except for the mediastinum. Our co-learning fusion method improves upon the baselines in 17 of the 20 metrics and 13 of these improvements are statistically significant. The largest overall improvement was in the precision metric, indicating that our method resulted in an increase in the ratio of true positives to false positives.
Fig. 5 is a visual comparison of the ROIs detected by our method and by the baselines; a larger version is included as Fig. S3 in the Supplementary Materials. The figure shows that our method consistently detected regions that were a similar size to the ground truth. In contrast, the TC baseline detected fewer pixels (as shown by the tumor region) while the TB and FS baselines detected more pixels than within the region. In particular, the TB CNN gave pixels within the chest wall a high probability of being within the mediastinum.
Fig. 6 depicts the co-learned fusion maps that were derived for an image with a single tumor; a larger version is included as Fig. S4 in the Supplementary Materials. In the figure, each feature map channel has been independently normalized so that their real valued pixels could be viewed in the paper. In any particular channel, a higher absolute intensity implies a greater importance placed on that pixel during fusion. The figure shows how different information are prioritized differently for each region. For example, the 7th CT fusion channel (row 1, column 7) places a greater emphasis on the lungs while the 26th PET fusion channel (row 8, column 2) places the greatest emphasis on the tumor. The figure also indicates that the fusion weights are derived from features of both modalities. For example, the 7th CT fusion channel (row 1, column 7) emphasizes the lungs including the area that contains the tumor. Meanwhile, the 13th CT fusion channel (row 2, column 5) also emphasizes the lungs but de-emphasizes the area containing the tumor. Further analyses are included in Section SIII of the Supplementary Materials.
Our findings show that our co-learning method for feature fusion results in improved overall accuracy (see Table II) and a more consistent detection of regions (see Fig. 5) when compared with the baseline CNNs. We attribute these findings to the ability of the co-learning unit to derive a spatially varying fusion map that more precisely integrates functional and anatomical visual features across different regions.
Our co-learning CNN achieved a higher precision, sensitivity, specificity, and accuracy than the TB CNN for fusion across all regions and also overall. Our explanation for this outcome is that the design of our CNN explicitly fuses features at multiple scales through the multiple co-learning units, which prevents information loss that can occur from the standard pooling (downsampling) operations used for feature map dimensionality reduction in CNNs. In contrast, the TB CNN implements a late fusion approach in which modality-specific feature maps are merged just prior to the reconstruction, meaning that useful complementary information could possibly have already been lost. An examination of Fig. 5 shows that the TB CNN tends to have larger predicted regions compared to the GT (e.g. larger tumor area, additional regions in mediastinum), indicating that the lost complementary information makes the TB output less precise.
In a similar fashion, the TC CNN implements an early fusion approach in which no modality-specific feature maps are derived and where the first convolutional layer combines both modalities to derive fused feature maps. However, as indicated by the metrics in Table II and the images in Fig. 5 this tends to prioritize information from some modalities at the expense of information from the other modality. The clearest example is in the less precise detection of the tumor region, which is barely noticeable in Fig. 5; only the part of the tumor with peak SUV (highest radiotracer uptake) is detected and the less subtle tumor regions are missed altogether.
The FS baseline is another variant of early fusion; the PET and CT modalities are pre-fused via pixel intermixing and the intermixed image is used as the input. It shares a similar weakness to the TC CNN in that the pre-fusion acts to prioritize information from one modality at the expense of the others, resulting in high precision for the lungs (80.19% in Table II) but much lower precision for the other regions. Examination of Fig. 5 shows that the tumor and mediastinum regions detected by the FS CNN are larger than the GT, indicating that there are a greater number of false positives.
All approaches examined (baselines and our method), had consistently high overall specificity (Table II). This is expected due to the large background region in PET-CT images, caused by areas that are outside the field of view of the scanner. All the methods achieved a high precision () in detecting the background, correctly recognizing that background regions are distinct from other regions (i.e., they are true negatives). While all methods were able to discriminate between the background and other regions, our method had the best ability to discriminate between all the different regions.
The manner of feature fusion is a key difference in our CNN versus the baseline CNNs. Our CNN derives a fusion map for each image that is explicitly multiplied across the feature maps of the different modalities (see Equation 4), thereby acting as feature weights. As such, our method can potentially derive different fusion maps for different input PET-CT images, prioritizing different characteristics at different locations. In contrast, all the baseline CNNs use the convolution operation to fuse the different modalities; each channel is convolved with its own learned kernel and the results of each channels’ convolutions are added together. Our CNN also involves such convolutions but they occur after the prioritization of information by multiplication with the fusion map.
The fusion maps shown in Fig. 6 indicate that our method prioritizes different information at different locations. For example, the 7th CT fusion channel (row 1, column 7) and the 13th CT fusion channel (row 2, column 5) emphasize the lung fields relative to the area containing the tumor. We suggest the co-learning unit has produced these specific fusion channels because (in combination with other channels) they contain information to distinguish the lung fields from any tumors they may contain. Similar patterns are noticed in the fusion maps of other PET-CT images. While it may appear that several channels in the fusion map are redundant (similar in appearance to other channels), this is merely a visualization issue caused by normalizing 32-bit floating point greyscale images for display within the paper. As shown in Fig. S4 in the Supplementary Materials, PET fusion channels 33 to 37 (row 13, columns 1 to 5) appear visually similar but closer examination of the distribution of fusion weights within the images indicates that each channel is prioritizes information in subtly different ways. Section SIII in the Supplementary Materials contains a detailed example showing the differences in these visually similar fusion channels and their impact when considering inputs with heterogeneous tumors. We suggest that the capacity of our co-learning CNN to derive these subtly different fusion weights enables more precise integration of the complementary information in each modality.
In our experiments, we compared our co-learning concept for fusion to other fusion approaches. To focus mainly on the differences in the approach to fusion, we built variant baseline CNNs that were similar to our own CNN but that implemented fusion in a different manner. This was done so that the main difference between the baselines and our CNN was the presence of our co-learning component, limiting the number of architectural differences. It also meant that we could use similar hyperparameters for fairer experimental comparisons. Our findings indicate that the addition of the co-learning component improved the final results and as such we suggest that other CNNs may also see improvements if they were to follow a similar conceptual approach for feature fusion; we have left this for future research.
Similarly, we suggest our co-learning CNN could also be extended or adapted to be better optimized for different datasets and applications. Such extensions could be the inclusion of improved encoders that go beyond stacked CNNs by borrowing designs from Residual , Inception 
, or other newer CNN architectures; enhanced application-specific encoders would better optimize the feature extraction for different applications. Similarly, the co-learning unit could also be similarly adapted such as by using multiple stacked convolutions to derive fusion maps with even finer refinements. Finally, it is expected that the final blocks of the reconstruction component will be redesigned for a different applications, e.g., such as by using fully connected layers or global average pooling for classification applications. We will examine some of these extensions and adaptations in our future research.
In addition, we used greyscale inputs for all experiments rather than use color lookup tables (CLUTs) for PET. CLUTs are sometimes used to enhance the appearance of functional information, particularly in image visualization. Our experimental aim was to focus on how the information from each modality was prioritized and the colorization of PET may have biased the functional information. However, we acknowledge that color information may provide additional visual features and we will explore this in a future study.
The evaluation of our co-learning CNN examined its performance on PET-CT lung cancer images. Other datasets of different body regions or diseases may prioritize a different set of anatomical and functional characteristics. Our findings show that our CNN was able to detect regions that were mainly dependent on single modality information (e.g. lungs relying mainly on anatomical information from CT) as well as those that dependent on multi-modality information (e.g., tumors adjacent to the different anatomical structures). This suggests that our method can be trained to adapt to the underlying multi-modality information important for different regions in different datasets. We will examine the behavior of our CNN across different datasets in future work.
We presented a new supervised CNN for fusing complementary information from multi-modality images. Our CNN leveraged modality-specific features to derive a spatially varying fusion map that quantified the importance of each modality’s features across different spatial locations. Our findings from region detection experiments on PET-CT lung cancer images demonstrated that our approach achieved a significantly higher accuracy () than several baseline CNN-based methods for multi-modality image analysis. We suggest that our conceptual approach of having a specific CNN architectural component to derive explicit fusion maps could be a useful technique for medical image analysis applications that requires considering complementary information from different image modalities, e.g. PET-CT and PET-MR.
Int J Comput Vision, vol. 115, no. 3, pp. 211–252, 2015.
H. C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers, “Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning,”IEEE T Med Imaging, vol. 35, no. 5, pp. 1285–1298, 2016.
A. Kumar, J. Kim, M. Fulham, and D. Feng, “Efficient PET-CT image retrieval using graphs embedded into a vector space,” inIEEE EMBC, 2014, pp. 1901–1904.
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” inProceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, 2016, pp. 265–283.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” in
Proc. AAAI Conf Artificial Intelligence, vol. 4, 2017, p. 12.
We used a two-fold cross-validation approach to verify that the architecture and training parameters were appropriate for our dataset. The two folds each comprised the slices from 20 studies, taken from the 40 studies in full training dataset (see Section III-A in the main body of the paper). Slices from the same study were restricted to the same fold, i.e., no slices from one study appeared in both folds. As the studies in the folds were randomly selected, each fold contained a different number of individual slices. For this reason, the second fold had a larger number of iterations compared to the first fold over the same number of epochs. Fig. S1 shows the Tensorboard logs of the training accuracy across both folds, recorded every 20 iterations. Similarly, Fig. S2 shows the Tensorboard logs of the validation accuracy across both folds, recorded every 20 iterations. The figures show very similar levels of training and validation accuracy across both folds (0.943 training, 0.951 validation for Fold 1; 0.951 training, 0.947 validation for Fold 2), despite the difference in fold size, indicating that the parameters of our architecture resulted in stable learning.
Figure S3 is a larger version of Figure 5 from the main text to show the results in greater detail.
Figure S4 is a larger version of Figure 6 from the main text, showing all 128 fusion maps from the first colearning unit.
Figure S5 is an analysis of the fusion maps generated by our co-learning units. In this example, we examine the distribution of pixels within the tumor region within the PET image and three channels of the generated fusion map.
The tumor region has high intensity compared to the other parts of the image and as such it is difficult to visually ascertain the differences among the fusion maps. As such, we used the ground truth tumor region and calculated the intensity histogram for the pixels within the tumor in the PET image and in the fusion maps; these histograms are also shown in Figure S5.
The histogram of the PET image shows that the tumor is heterogeneous, with a maximum SUV of approximately 20. The mode of the tumor pixels are at the tail end of the distribution (approximate SUV of 10), below the mean SUV of 12. Overall the distribution of the tumor is skewed towards the lower SUV values.
The fusion maps all have distributions that are different to the tumor’s original SUV distribution and are also distinct from each other. The tumor region within Fusion Map A has a relatively homogeneous distribution where the weights are clustered: the minimum and maximum fusion weights (coefficients) are within two standard deviations of the mean. A potential interpretation is that this fusion map differentiates the tumor region from the surrounding non-tumor areas, which have lower intensity as seen in the image. In contrast, Fusion Maps B and C have histograms showing heterogeneous distributions; neither of these match the distribution pattern of the original tumor’s SUV distribution, implying that they are each prioritizing different aspects of the tumor region.
The source code and documentation for our CNN can be found on https://github.com/ashnilkumar/colearn.