Transfer Learning for Endoscopic Image Classification

08/24/2016 ∙ by Shoji Sonoyama, et al. ∙ Hiroshima University 0

In this paper we propose a method for transfer learning of endoscopic images. For transferring between features obtained from images taken by different (old and new) endoscopes, we extend the Max-Margin Domain Transfer (MMDT) proposed by Hoffman et al. in order to use L2 distance constraints as regularization, called Max-Margin Domain Transfer with L2 Distance Constraints (MMDTL2). Furthermore, we develop the dual formulation of the optimization problem in order to reduce the computation cost. Experimental results demonstrate that the proposed MMDTL2 outperforms MMDT for real data sets taken by different endoscopes.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Nowadays in many hospitals, an endoscopic examination (colonoscopy) using Narrow Band Imaging (NBI) system is widely performed to diagnose colorectal cancer [1], which is one of the major cause of cancer death [2]. During examinations, endoscopists observe and examine a polyp based on its visual appearance such as an NBI magnification findings [3, 4]. To support diagnosis during examinations, a computer-aided diagnosis system based on the texture appearances of polyps would be helpful, and therefore patch-based classification methods for endoscopic images have been proposed [5, 6, 7, 8, 9, 10, 11].

The problem we address in this paper is the inconsistency between training and testing images [12]

. As other usual machine learning approaches, training of classifiers assumes that the distributions of features extracted from both training and testing image datasets are the same. However, different endoscopies may be used to collect training and testing datasets, leading the assumption to be violated. One reason is due to the rapid development of medical devices (endoscopies in our case), and hospitals could introduce new endoscopes at a certain point in time after training images were taken. Another reason is that a dataset is constructed for training classifiers with a training data set collected by a certain type of endoscope in one hospital, while another hospital could want to use the classifiers for images taken by a different endoscope. In general, such kind of inconsistency could lead to a deterioration of the classification performance, hence collecting new images for a new training dataset may be necessary in general. It is however not the case with our medical images: it is impractical to collect a enough set of images for all types and manufactures of endoscopes.

Figure 2 shows an example of the difference between appearance of texture taken by different endoscope systems. These images are the same scene of a printed sheet of a colorectal polyp image, but taken by different endoscope systems, at almost the same distance to the sheet from the endoscopes. Even for the same manufacture (Olympus) and the same modality (NBI), images differ to each other in resolutions, image quality and sharpness, brightness, viewing angle, and so on. This kind of difference may affect the classification performance.

In order to tackle this problem, we have proposed a method [12] based on transfer learning [13, 14, 15, 16]

for estimating a transformation matrix between feature vectors of training and testing data sets taken by different (old and new) devices. In this prior work, we formulated the problem as a constraint optimization, and developed an algorithm to estimate the linear transformation. The problem however is that corresponding data sets are required, in other words, each of test image (taken by a new device) must have a corresponding training image (taken by an old device), and moreover these images must capture the same polyp in order to estimate the linear transformation. This restriction is very strong and even impractical.

In this paper, we propose another method for the task, but without image-by-image correspondences between training and test data sets. More specifically, we extend Max–Margin Domain Transfer (MMDT) [17] in order to use distance constraints as a regularization, called Max–Margin Domain Transfer with Distance Constraints (MMDTL2). This extension is our first contribution, and the second contribution is the derivation of the dual problem to the original primal problem, which greatly reduce the computation cost. Experimental results with real endoscopic images show that the our proposed method, MMDTL2, outperforms the previous method, MMDT.

Fig. 1: NBI magnification findings [18].
(a) (b)
Fig. 2: An example of appearance difference of different endoscope systems. (a) An image taken by an older system (video system center: Olympus EVIS LUCERA CV-260, endoscope: Olympus OLYMPUS EVIS LUCERA CF-H260AZL/I [19]). (b) An image of the same scene taken by an newer system (video system center: Olympus EVIS LUCERA ELITE CV-290, endoscope: OLYMPUS EVIS LUCERA ELITE CF-HQ290ZL/I [20]).

Here we formulate the problem setting. We are given a data set of images taken by an older device, and corresponding feature vectors and labels . By following the terms of domain adaptation, we call this data set “source” (or source domain) because it is the set or domain from which features are transferred. Then another data set of images with corresponding features and labels . We call this “target” (or target domain) because this is the destination to which features are transferred. If the target dataset was large enough so that we could train classifiers by using the target dataset as a training set. In our task we assume that the target data set is small compared to the source data set because of a usual setting that images are collected before changing the image device but still we want to have a good performance, as we are currently facing.

Ii Previous work: MMDT

A method for transfer learning proposed by Hoffman et al. [17], called MMDT, does the estimation of the linear transformation between source features and target features and the learning parameters of -th binary SVM (for a class problem) at the same time by solving the following optimization problem (note that the following form is obtained after introducing slack variables ).


where and are the normal and bias of the hyper plane of a binary SVM classifier for -th class, and . They solve the optimization problem by alternatively estimating and .


A problem of the method is that the estimated linear transformation can be degenerated due to minimizing the Frobenious norm. An example is shown in Figure 3 where the target distributions of a spherical shape are transformed into a collinear distribution, and are not similar to the source distributions. Because of this effect, the margin after the transformation is smaller than the one between classes of the source distribution. We show that this is problematic in experimental results.

Fig. 3: A toy example of a transformation estimated by MMDT and MMDTL2. (left) Source and target distributions of two-class problem. Red and blue markers are different classes, and circles are samples in the source data set, and crosses are in the target data set. Triangle markers are transformed target samples (here those are all zero because of the initial value of .) (middle) The transformed target samples are aligned in a line and distributed far away from source samples. (right) The transformed target samples are close to the source distribution.

Iii Proposed method: MMDTL2

In order to solve the problem of MMDT above, we add the distance constraints to ensure that the source and transformed target distributions are similar to each other. The optimization problem for estimating is now formulated as follows:


where is a scaler, and is a weight between and .

For simplicity, hereafter we restrict that the last column of is zero, in other words, and all forms of become . Then by rewriting the problem we have the following standard form of quadratic programing:


where and so that , and


This quadratic programming however is not efficient to solve because the number of variables is , therefore solving this quadratic programming requires about computations (e.g., [21]). This means that the primal problem becomes intractable quickly as the feature dimension increases. Next we therefore derive the dual problem to this primal problem in order to reduce the computation cost.

To this end, we first obtain the Lagrangian


where and are Lagrangian multipliers. Then we take the derivatives with respect to and and let them to zero;


After that, we substitute Eqs. (9) and (10) into Eq. (5), and rearrange the equation with respect to then finally we have the following quadratic programming.


Later we compute by using estimated and Eq. (12) as follows:




This dual formulation is much smaller than the primal problem because the number of variables is now which is much smaller than in general.

Iv Experimental results

In this section, we show two kind of experimental results. First, we compare the computation cost between the primal and dual problem formulations. Second, we compare the proposed MMDTL2 with MMDT.

We use two different datasets. The source data set has 400 NBI images (Type A: 200, Type B,C3: 200) taken by OLYMPUS EVIS LUCERA endoscopy [19], and the target data set has 180 NBI images (Type A: 90, Type B,C3: 90) taken by newer OLYMPUS EVIS LUCERA ELITE endoscopy [20].

Iv-a Comparison between primal and dual

Table I show computation time of the primal and dual formulations, where the feature dimension is . The setup time counts how long it takes to compute coefficients of in Eq. (5) for the primal, or coefficients of in Eq. (11) for the dual. The optimization time is the computation time to solve a quadratic program, the calculation time is to compute from , meaningful only for dual. In this setting, the dual formulation of MMDTL2 is about 12 times faster than the primal.

primal dual
setup 402.90 538.18
optimization 6396.84 0.060
calculation N/A 0.023
total 6799.74 538.26
TABLE I: Computation time (in second) of MMDTL2.

Iv-B Performance comparison

In this experiment, we compare performances of the proposed method with the following settings, which are similar to [12]. These settings are shown in Figure 4 in terms of 10-fold cross validation.


For comparison, we first perform an experiment without any transfer learning by using the source dataset only. In this case, the source dataset is divided into 10 folds; a normal 10-fold cross validation with the source dataset.

Source only

The second setting also doesn’t use any transfer learning but with the source and target datasets. This means that source images are used for training, and the resulting classifier is simply used to classify target (test) images. We divide the source and target datasets in to 10 folds (each source fold in the source corresponds to a target fold). Then 9 source folds are used for training, and the remaining one target fold is used for testing.

Not transfer

The third setting doesn’t use any transfer learning, too. Target images are however also used for training unlike the previous settings above. We divide the source and target datasets in to 10 folds, then 18 folds (9 source and 9 target folds) are used for training a classifier. The remaining target fold is used for testing by simply applying the trained classifier. Note that the number of training images is now doubled while the number of testing images remains the same.


Fourth setting is MMDT, the existing method. We use 9 source and 9 target folds for estimating the linear transformation and classifier . For training, those 18 folds are used. Features in the 9 target folds are transformed by as . The remaining target fold is used for testing, however features in it are transformed by as in the training, and then the trained classifier is applied.


The last setting is MMDTL2, the proposed method. The setting is the same with MMDT.

Fig. 4: Performance with different experiment settings.
Fig. 5: Performance with different experiment settings. Horizontal axis is the number of features .

Figure 5 show performances of the different setting. As expected, the source only setting is slightly worse than the baseline as the distributions of the source and target data sets are different actually. Results of MMDT are worse in any cases. It might be due to the problem that estimated linear transformations are degenerated. The proposed MMDTL2 is much better than MMDT because of the additional constraints, and comparable to the baseline in higher dimension. Also MMDTL2 is expected to behave better than the not transfer setting when .

V Conclusions

We have proposed the MMDTL2, which extends MMDT by adding distance constraints, for transfer learning of medical images. Also we have derived the dual formulation of the quadratic programming in order to achieve smaller computation cost. Still the proposed method needs a high computation cost for preparing the quadratic programming, therefore we will reduce further the computation cost so that feature vectors in much larger dimension can be handled.


  • [1] S. Tanaka, T. Kaltenbach, K. Chayama, and R. Soetikno, “High-magnification colonoscopy (with videos),” Gastrointest Endosc, vol. 64, pp. 604–13, Oct 2006.
  • [2] Cancer Research UK, “Bowel cancer statistics.”
  • [3] H. Kanao, S. Tanaka, S. Oka, M. Hirata, S. Yoshida, and K. Chayama, “Narrow-band imaging magnification predicts the histology and invasion depth of colorectal tumors.,” Gastrointest Endosc, vol. 69, pp. 631–636, Mar 2009.
  • [4] S. Oba, S. Tanaka, S. Oka, H. Kanao, S. Yoshida, F. Shimamoto, and K. Chayama, “Characterization of colorectal tumors using narrow-band imaging magnification: combined diagnosis with both pit pattern and microvessel features,” Scand J Gastroenterol, vol. 45, pp. 1084–92, Sep 2010.
  • [5] M. Häfner, A. Gangl, M. Liedlgruber, A. Uhl, A. Vécsei, and F. Wrba, “Classification of endoscopic images using delaunay triangulation-based edge features,” in Image Analysis and Recognition (A. Campilho and M. Kamel, eds.), vol. 6112 of Lecture Notes in Computer Science, pp. 131–140, Springer Berlin Heidelberg, 2010.
  • [6] M. Häfner, A. Gangl, M. Liedlgruber, A. Uhl, A. Vecsei, and F. Wrba, “Endoscopic image classification using edge-based features,” in Pattern Recognition (ICPR), 2010 20th International Conference on, pp. 2724–2727, Aug 2010.
  • [7] R. Kwitt, A. Uhl, M. Häfner, A. Gangl, F. Wrba, and A. Vécsei, “Predicting the histology of colorectal lesions in a probabilistic framework,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pp. 103–110, June 2010.
  • [8] S. Gross, T. Stehle, A. Behrens, R. Auer, T. Aach, R. Winograd, C. Trautwein, and J. Tischendorf, “A comparison of blood vessel features and local binary patterns for colorectal polyp classification,” 2009.
  • [9] T. Stehle, R. Auer, S. Gross, A. Behrens, J. Wulff, T. Aach, R. Winograd, C. Trautwein, and J. Tischendorf, “Classification of colon polyps in nbi endoscopy using vascularization features,” 2009.
  • [10] J. J. W. Tischendorf, S. Gross, R. Winograd, H. Hecker, R. Auer, A. Behrens, C. Trautwein, T. Aach, and T. Stehle, “Computer-aided classification of colorectal polyps based on vascular patterns: a pilot study,” Endoscopy, vol. 42, pp. 203–7, Mar 2010.
  • [11] T. Tamaki, J. Yoshimuta, M. Kawakami, B. Raytchev, K. Kaneda, S. Yoshida, Y. Takemura, K. Onji, R. Miyaki, and S. Tanaka, “Computer-aided colorectal tumor classification in NBI endoscopy using local features,” Medical Image Analysis, vol. 17, no. 1, pp. 78 – 100, 2013.
  • [12] S. Sonoyama, T. Hirakawa, T. Tamaki, T. Kurita, B. Raytchev, K. Kaneda, T. Koide, S. Yoshida, Y. Kominami, and S. Tanaka, “Transfer learning for bag-of-visual words approach to nbi endoscopic image classification,” in Engineering in Medicine and Biology Society (EMBC), 2015 37th Annual International Conference of the IEEE, pp. 785–788, Aug 2015.
  • [13] S. J. Pan and Q. Yang, “A survey on transfer learning,” Knowledge and Data Engineering, IEEE Transactions on, vol. 22, no. 10, pp. 1345–1359, 2010.
  • [14] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: transfer learning from unlabeled data,” in Proceedings of the 24th international conference on Machine learning, pp. 759–766, ACM, 2007.
  • [15] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, “Boosting for transfer learning,” in Proceedings of the 24th international conference on Machine learning, pp. 193–200, ACM, 2007.
  • [16] D. Silver, G. Bakir, K. Bennett, R. Caruana, M. Pontil, S. Russell, and P. Tadepalli, “Nips workshop on “inductive transfer: 10 years later”,” Whistler, Canada, 2005.
  • [17] J. Hoffman, E. Rodner, J. Donahue, K. Saenko, and T. Darrell, “Efficient learning of domain-invariant image representations,” in International Conference on Learning Representations, 2013.
  • [18] H. Kanao, S. Tanaka, S. Oka, M. Hirata, S. Yoshida, and K. Chayama, “Narrow-band imaging magnification predicts the histology and invasion depth of colorectal tumors,” Gastrointestinal endoscopy, vol. 69, no. 3, pp. 631–636, 2009.
  • [21] Y. Ye and E. Tse, “An extension of karmarkar’s projective algorithm for convex quadratic programming,” Mathematical Programming, vol. 44, no. 1-3, pp. 157–179, 1989.