In the field of computer vision and image recognition, deep convolutional neural networks (DCNNs) have been the primary model, owing to AlexNetkrizhevsky2012imagenet
having had great success during the ImageNet competitions in 2012. DCNNs are thus becoming the de facto solution for image recognition tasks. The DCNN is a multi-layered neural network that has the same architecture as Neocognitronfukushima1980neocognitron ; fukushima1988neocognitron , inspired by biological human visual systems. The brain’s vision center has a hierarchical mechanism that understands visual stimulus hubel1962receptive . The DCNN uses a similar hierarchical structure to extract features by using stacks of “convolution” and “spatial pooling” operations. The distinctive feature of a DCNN is its automation of obtaining tack feature representations, which suits the given tasks. Whereas DCNNs provide significant performance with image recognition tasks, they require massive amounts of training data compared to conventional machine learning models. The deep network structure exhibits higher expressive power than shallow models, which have the same complexity ba2014deep . Alternatively, most deep models have a large number of free parameters. Han et al. reported that deep neural networks require one-tenth of the number of free parameters training data needed to obtain the good generalization ability han2015learning . However, when the acquisition of a training dataset is difficult (e.g., medical imagery), the data will sometimes be insufficient. Generally, for learning approaches, the amount of training data has a strong effect on model performance. Deficient training data sometimes causes generalization problems such as overfittings.
A conventional approach for overcoming data deficiency is transfer learning pan2010survey . This is a learning technique that reutilizes knowledge gained from other learning tasks, called the “source domain,” to improve model performance in the desired task, called the “target domain
.” In the case of transfer learning for an image classification task, the model will first be trained to classify the source domain. Then, it will be trained for the target domain. In the case of DCNNs, we expect feature extraction to be improved by reutilizing its feature extraction capability. Note that this paper distinguishes two common styles of transfer learning. One is “fine-tuning,” which retrains only the classification part while maintaining the feature extraction part. In other words, the fine-tuning style assumes that the feature extraction part has enough ability to represent input signals. Another is “feature transfer,” which retrains the entire DCNN, containing the feature extraction layers, to adopt the feature extraction part for target task. This paper focuses on the latter case of transfer learning. In most transfer learning approaches for image recognition tasks, massive natural image datasets, such as ImageNetdeng2009imagenet , are used as the source domain tajbakhsh2016convolutional . The reason a natural image dataset is usually adopted is that of the availability of pre-trained models and their known performance. However, the appropriateness of utilizing a natural image dataset when the target domain greatly differs from the natural images is slightly questionable, because features of the source domain do not appear in the target domain. Azizpour et al. suggested that the possibility of knowledge transfer is affected by similarities between the source and target domains. They reported that it is preferable that transfer learning takes in similar data azizpour2016factors . However, only a few studies have focused on model performance variation by changing source and target domains, and their scope of tasks was limited to object recognition.
This paper proposes a two-stage feature transfer method that focuses on textural image recognition. By this method, the DCNN will successively be trained with natural and textural images as an initial state. Afterward, all of the DCNN, which includes not only classification part but also feature extraction part, will be trained again with the textural target domain. We will show that this type of successive and multi-domain feature transfer improves the generalization performance of the model and provides robustness with a decrease in the size of the training dataset. Moreover, we discuss the why feature transfer on DCNNs works so well. We visualize how feature representations of DCNNs come from different feature transfer processes and reveal that feature transfer improves feature representations of DCNNs, corresponding to both source domains.
In our experiment, we apply two-stage feature transfer to a classification task of textural X-ray high resolution computed tomography (HRCT) images of diffuse lung diseases (DLDs) and show performance improvements.
2 Related Works and Contributions
shouno2015transfer ; gao2016holistic applied a feature transfer to the classification of DLDs and used conventional single-staged feature transfer, which uses a natural image dataset. They reported that feature tranfer improves the classification performance over learning from scratch. However, the appropriateness of the source domain was not discussed, despite noting that the targets were textural. christodoulidis2017multi proposed an ensemble method that used multiple models trained with different domains for lung disorder classification. The term, “transfer learning,” references fine-tuning. The essence of this method entails ensemble modeling, rather than an actual transfer process. A notable study of transfer learning in the field of medical image analysis, tajbakhsh2016convolutional , systematically surveyed and analyzed the effects of transfer learning for various types of medical images, including textural images. They compared transfer learning from natural images and several modern parameter initialization methods in various medical image classification tasks, which had limited amounts of training data. They concluded that transfer learning from natural images to medical images is possible and meaningful, despite the large difference between the source and target domains. Nonetheless, the reason transfer learning works in DCNNs is still not fully understood.
In this paper, we study two-stage feature transfer, focusing on diffuse lung disease classification, making the following contributions.
We demonstrate the superiority of feature transfer over fine-tuning by comparing the model performance under the same source domains.
We demonstrate how the source domain of feature transfer affects the performance of DCNNs by comparing learning-from-scratch, single-stage feature transfer, and our proposed method.
We show that transfer learning provides robust performance with a decrease in the size of the training dataset.
We analyze how feature representations in intermediate DCNN layers of change corresponding to the transfer processes of the feature visualization method. This change implies a DCNN mechanism of feature transfer that has not been fully researched.
3 Deep Convolutional Neural Networks (DCNNs)
DCNNs are well-known deep learning models, which are a type of multi-layered neural network, widely used in computer vision. The most common DCNNs consist of “convolutions” and “spatial pooling” layers, which serve as feature extractors, and fully-connected layers, which serve as classifiers. The set of convolution and pooling layers are defined as “stages,” in the same manner described byfukushima1988neocognitron . The stages deform the input pattern into an intermediate representation, serving as a feature-extractor. Generally, DCNNs, which have several input channels, take 2D images and repeatedly transform them into feature maps via a stack of stages. Fig. 1 shows a schematic diagram of a typical DCNN.
To understand the feature extraction of DCNNs, let us consider the activation of -th stage. Here, we denote as an -th channel activation, at the location, , in the -th stage. Convolution layers provide convolutional filtering to derive feature maps (i.e., activations) from previous stages. The activation of the convolution layer is written as
where is the channel of the derived feature map, and is the convolution kernel (i.e., a “filter tensor”). Eq. 1
shows that the convolution layer makes a feature map as an inner product of a filter tensor,
, and all regions of input. Most neural networks modulate responses of each layer with an activation function to provide a non-linearity. We chose the rectified linear unit (ReLU), commonly used in deep neural networks, as the activation function. Following the convolution layer, all feature maps,, are modulated with ReLU.
The pooling layer gathers spatial neighbors to reduce the repercussions of local pattern deformations and the dimensionality of the feature map. The response to the pooling layer of the feature map, , is computed as
where is the spatial neighbor at location,
, in the feature map. This type of pooling operation, which uses the maximum value of spatial neighbors as a representative value, is called “max-pooling.”
These layers appear as early DCNN layers, which sum inputs and provide well-posed inputs for a given task. Trainable parameters of these formulations are the filter tensors, .
, extracted feature representations of the input image appear as the first fully-connected layer, “Fc 6.” Layers, “Fc7” and “Fc8” comprise a multi-layered perceptron, which plays the role of classifier.
The most remarkable trait of DCNNs is its effective feature representation, corresponding to tasks that are obtained as an intermediate representation of the feature extraction parts, consisting of convolution and spatial-pooling layers. These are obtained via a back-propagation algorithm, which minimizes classification errors.
4.1 Two-Stage Feature Transfer
Transfer learning is a technique that reutilizes feature expressions that come from similar tasks pan2010survey . This paper proposes a two-stage feature transfer method focused on textural recognition tasks.
Fig. 2 shows the schematic diagram of two-stage feature transfer, which, for DCNNs, means the reutilization of the feature extraction parts of the pre-trained network.
These parts consist of convolution layers and no classification layers.
Thus, the fully-connected layers (i.e., “Fc7” and “Fc8”) are cut off from their connections, as shown in Fig. 1.
After reconfiguring the network, we randomly initialize connections of the classifier part
111 Fully-connected weights, without a softmax layer (e.g., “Fc8”) can be reused as the initial state for the transfer. In our experiment, however, the resulting performance has been worsened.
Fully-connected weights, without a softmax layer (e.g., “Fc8”) can be reused as the initial state for the transfer. In our experiment, however, the resulting performance has been worsened.and train the entire DCNN again using back-propagation. Thus, feature transfer utilizes the feature extraction parts from other domains as its initial state.
In our proposed method, we first train the DCNN with massive natural images in the same manner as conventional feature transfer. At this stage, we expect that all connections are well-trained for extracting visual features from input images of natural scenes, such as edge structures krizhevsky2012imagenet ; zeiler2014visualizing . Second, we apply feature transfer again, using the texture image dataset and natural images to acquire better feature representation and fitting for the textural images, which do not appear in the natural images.
4.2 Feature Visualization
For analysis, to understand the mechanism of knowledge transfer in DCNNs, and to reveal how feature transfer influences improvements, we should discuss what is attended by the DCNN feature extraction process. We adopt DeSaliNet, proposed by mahendran2016salient , as our feature visualization method. This includes similar methods proposed by zeiler2014visualizing and simonyan2013deep as its special cases. DeSaliNet reveals which input component influences the feature representation of the feature extraction parts. Fig. 3 shows the process flow of a feature visualization using DeSaliNet.
The main idea of DeSaliNet is to propagate the feature map backward into the input space. DeSaliNet construes DCNN operations as functions and describes itself as a composite function. Let be a map to the -th layer’s feature map that we want to visualize. can thus be denoted by each layer activation, up to the -th layer, as
where is the layer type, such as convolution, max-pooling, and ReLU. Here, we also denote the “backward path,” , which is illustrated on the left side of Fig. 3 as an inverse map of .
where denotes inverse maps associated with its corresponding layer, . Details of each inverse map are discussed in Appendix A.1. Then, the visualization result, , is obtained as a member of the input space.
The origin of the visualization method, based on backward propagation, is the selective attention modelfukushima1987neural ; shuono1995connected . This type of feature visualization enables us to analyze what component is paid attention to in input images, in contrast to saliency maps simonyan2013deep , which analyze where it is paid attention to. Textural images are “what-based,” because the textural images do not have locality as a characteristic.
5.1 Target Domain
We examined the effectiveness of our proposed two-stage feature transfer method with the classification of X-ray and HRCT DLDs. DLD is a collective term for lung disorders that can spread to large areas of the lung. X-ray HRCT is effective for finding early-stage DLDs when they are small and mild. DLD conditions are seen as textural patterns on HRCTs. In this work, these patterns are classified into seven classes: consolidations (CON), ground-glass opacities (GGO), honeycombing (HCM), reticular opacities (RET), emphysematous changes(CON), nodular opacities (NOR), and normal (NOR). These categorizations were introduced by uchiyama2003quantitative . Fig. 4 shows portions of HRCT images for each class.
The DLD image dataset was acquired from Osaka University Hospital, Osaka, Japan. We collected 117 HRCT scans from different subjects. Each slice was converted to gray-scale images with a resolution of 512 × 512 pixels and slice-thickness of 1.0 [mm]. Lung region slices were annotated for their seven types of patterns by experienced radiologists. The annotation region shapes and their labels were the results of diagnoses by three physicians. The annotated CT images were partitioned into regions of interest (ROI) patches, which were 32 32 pixels, corresponding to about 4 [cm]. This is a small ROI size for DCNN input. Thus, we magnified them by 224
224 pixels using bicubic interpolation. Therefore, from these operations, we collected 169 patches for CON, 655 for GGO, 355 HCM, 276 for RET, 4702 for RET, 827 for NOD, and 5726 for NOR. We then divided these patches for DCNN training and for an evaluation, because each class does not contain patches from the same patients. For the training, we used 143 CONs, 609 GGOs, 282 HCMs, 210 RETs, 4406 EMPs, 762 NODs, and 5371 NORs. The remaining 26 CONs, 46 GGOs, 73 HCMs, 66 RETs, 296 EMPs, 65 NODs, and 355 NORs were used for the evaluation.
5.2 Source Domains
Two-stage feature transfer uses both natural image and texture datasets. We used ILSVRC2012 dataset, which is a subset of ImageNet deng2009imagenet , as the natural image dataset. We also used the Columbia-Utrecht Reflectance and Texture Database (CUReT) dana1999reflectance as the texture dataset, as provided by Columbia University and Utrecht University. Fig. 5 shows examples of textural images in CUReT database. The database contains macro photographs of 61 classes of real-world textures. Each class has approximately 200 samples. Each sample was imaged under various combinations of illumination and viewing angles. To train DCNNs, we cropped the textured regions and resized them into 224 224 to accommodate network input.
. We trained the network using momentum stochastic gradient descent with a momentum ofand a dropout rate of . When the network was trained for the first time, we set the learning rate to 0.05. Otherwise, we set the learning rate to 0.0005 because it is reported that small learning rate is preferable for pre-trained networks in tajbakhsh2016convolutional . We trained the network until training loss plateaus, as to steadily converge the network parameters.
For evaluation metrics, we used accuracy, recall, precision, and F1-score. Accuracy is the proportion of correct predictions to the total number of predictions. Recall is the fraction of samples collectively classified over the number of samples of its class. Precision is the fraction of samples correctly classified as class,, over all samples classified as a class .
In our experiment, we compared models from different learning processes, as follows.
Learning a randomly initialized model from scratch in the most naive way (i.e., no feature transfer)
Feature transfer from textural images, i.e., CUReT database
Feature transfer from natural images, i.e., ILSVRC 2012 dataset
Two-stage feature transfer, training the DCNN from ILSVRC 2012 and CUReT, sequentially (proposed)
6.1 Classification Performance
First, we compared the classification performance of each models (1) (4). In addition to this, to reveal the effectiveness of feature transfer, we also compared to fine-tuning models as follows:
Fine-tuning from natural images (ILSVRC 2012 dataset)
Fine-tuning from textural images (CUReT database)
Results are shown in Table 1. Feature transfer models (1) (4) surpass fine-tuning models (a) and (b) at all classification performances. This suggests that the feature representation obtained in natural and textural images does not suit for DLD classification, in other words, feature extraction part ought to be retrained with the target domain. Also, the two-stage feature transfer (4) displays the best performance. However, despite feature transfer, the model (2) using CUReT performed worse than learning from scratch. This implies that CUReT, by itself, is useless as the source domain for conventional feature transfer.
|Transfer||None||single-stage (conventional)||two-stage (proposed)||fine-tuning|
6.2 Model Robustness for amounts of Training data
Moreover, we demonstrated how the robustness of each model, with respect the decrease in the amount of training data, improved. We transitioned the accuracies and losses of the softmax layer (i.e., “Fc8” in Fig. 1) by changing the amount of DLD training samples by the ratio, , from 20[%] to 100[%] 222For example, when and , the amounts of training DLD examples are and , respectively. Proportions of each class are retaining.. Fig. 7 shows the models’ performance comparison. In all cases, two-stage feature transfer showed the best performance for both accuracy and loss, especially in the case of a small training dataset.
Fig. 7 shows the fluctuation of model performance with a decline in the amount of DLDs images. To quantify the degree of model robustness, we assumed that these variations have linearity to the amount of data, and compared slopes,
, of the linear regression model:, where is the percentage of data, and is the intercept coefficient. Clearly, a small absolute value of slope indicates that the model is more robust with . All feature transfer models show better results than learning from scratch, as shown in Table 2. two-stage feature transfer showed the best robustness, both with accuracy and with loss.
|Slopes of accuracies||1.3560||0.9920||0.9475||0.7479|
|Slopes of losses||-0.5054||-0.4938||-0.3221||-0.2230|
7 Analysis of the Feature Extraction
Fig. 8 shows the visualization results of extracted features (i.e., the activation of Fc6 in Fig. 1) for each model, from (1) to (4). Model (1), learned from scratch, did not show salient activities in any region of input. This suggests that the model could not extract meaningful features from inputs because of the lack of training data. Model (2), transferred from textural images, showed activation in the regions where the textural structure appeared (e.g., pits of CON or cyst wall contours of HCM). Alternatively, model (3), transferred from natural images, showed activations in the regions where edge structures appeared (e.g., entire of CON or bottom right of RET, which both colored in blue). It is intuitive, considering that the models trained for natural images show an activation for edge structures (e.g., the object contours and lines), as reported by most studies on the visualization of DCNNs zeiler2014visualizing ; simonyan2013deep ; mahendran2016salient . Interestingly, Model (4), which came from two-stage feature transfer, responded to both edge and textural structures. The models (2) and (3) show the strong responses to the edge and textural regions respectively. In contrast, we can see these models show weak responses to the opposite regions. Given the results of (2) and (3), such feature representations seem to be additively obtained from both natural images and textural domains during two-stage feature transfer. Performance improvements occur because the DCNNs obtain better feature representation, which suits textural patterns with the two-stage feature transfer.
We proposed a two-stage feature transfer, which improved the performance of DCNNs for classification tasks of textural images, as an extension of conventional transfer learning methods, which use a single domain as the source. We applied two-stage feature transfer to the classification of HRCT images of lung diseases and demonstrated that two-stage feature transfer improves classification performance and robustness while decreasing the amount of training data, compared to learning from scratch and conventional transfer learning. To assess these improvements, we analyzed and compared each feature representation using a feature visualization method. Two-stage feature transfer seems to have provided appropriate feature representations for both edge and textural structures transferred from natural images and textural images, respectively. These results indicate the consequence of source domain selection.
This work is partially supported by Grant-in-Aids for Scientific Research KAKENHI (C) 16K00328, and Innovative Areas 16H01452, MEXT, Japan. We really appreciate Prof. Honda, Osaka University Hospital providing the HRCT images of DLDs.
Appendix A DeSaliNet’s Inverse Maps
This appendix provides details of the inverse maps used in DeSaliNet mahendran2016salient . In Eq. 5, each denotes the inverse map of each forward operation, , where the is a layer type of the -th stage. DeSaliNet considers only the case where , otherwise the layers be ignored. This results in an identity map.
- Convolution layer
- Max-pooling layer
The inverse map of the max-pooling layer, , is denoted as
where contains the stored maximum value locations of forward calculation in max-pooling. The pooled map is sparsely restored into a maximum position.
- Rectifying layer
The inverse map of the ReLU layer, , is denoted as
where is stored positive locations in the forward ReLU caculation, where zero was not modulated.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological cybernetics, 36(4):193–202, 1980.
-  Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks, 1(2):119–130, 1988.
-  David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106–154, 1962.
-  Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
-  Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
-  Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  Nima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, R Todd Hurst, Christopher B Kendall, Michael B Gotway, and Jianming Liang. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging, 35(5):1299–1312, 2016.
-  Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan Carlsson. Factors of transferability for a generic convnet representation. IEEE transactions on pattern analysis and machine intelligence, 38(9):1790–1802, 2016.
-  Hayaru Shouno, Satoshi Suzuki, and Shoji Kido. A transfer learning method with deep convolutional neural network for diffuse lung disease classification. International Conference on Neural Information Processing, pages 199–207, 2015.
-  Mingchen Gao, Ulas Bagci, Le Lu, Aaron Wu, Mario Buty, Hoo-Chang Shin, Holger Roth, Georgios Z Papadakis, Adrien Depeursinge, Ronald M Summers, et al. Holistic classification of ct attenuation patterns for interstitial lung diseases via deep convolutional neural networks. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pages 1–6, 2016.
-  S. Christodoulidis, M. Anthimopoulos, L. Ebner, A. Christe, and S. Mougiakakou. Multisource transfer learning with convolutional neural networks for lung pattern analysis. IEEE Journal of Biomedical and Health Informatics, 21(1):76–84, Jan 2017.
-  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014.
-  Aravindh Mahendran and Andrea Vedaldi. Salient deconvolutional networks. In European Conference on Computer Vision, pages 120–135. Springer, 2016.
-  Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
-  Kunihiko Fukushima. Neural network model for selective attention in visual pattern recognition and associative recall. Applied Optics, 26(23):4985–4992, 1987.
-  Hayaru Shuono and Kunihiko Fukushima. Connected character recognition in cursive handwriting using selective attention model with bend processing. Systems and computers in Japan, 26(10):35–46, 1995.
-  Yoshikazu Uchiyama, Shigehiko Katsuragawa, Hiroyuki Abe, Junji Shiraishi, Feng Li, Qiang Li, Chao-Tong Zhang, Kenji Suzuki, et al. Quantitative computerized analysis of diffuse lung disease in high-resolution computed tomography. Medical Physics, 30(9):2440–2454, 2003.
-  Kristin J Dana, Bram Van Ginneken, Shree K Nayar, and Jan J Koenderink. Reflectance and texture of real-world surfaces. ACM Transactions on Graphics (TOG), 18(1):1–34, 1999.