In many hospitals, endoscopic examinations (i.e., colonoscopies) using narrow band imaging (NBI) systems are widely performed to diagnose colorectal cancer , which is a major cause of cancer deaths worldwide . During examinations, endoscopists observe and examine a polyp based on its visual appearance, including via NBI magnification findings [3, 4], as shown in Figure 1. To support proper diagnosis during examinations, a computer-aided diagnostic system based on the textural appearance of polyps would be helpful; thus, numerous patch-based classification methods for endoscopic images have been proposed [5, 6, 7, 8, 9, 10, 11].
This paper focuses on the inconsistencies between training and test images. As with other frequently used machine learning approaches, training classifiers assumes that the distribution of features extracted from both training and test image datasets are the same; however, different endoscope systems may be used to collect training and test datasets, causing such an assumption to be violated. Further, given the rapid development of medical devices (i.e., endoscopies in this case), hospitals can introduce new endoscopes after training images have already been taken. In addition, classifiers may be trained with a training dataset collected by a certain type of endoscope in one hospital, while another hospital might use the same classifiers for images taken by a different endoscope. In general, such inconsistencies lead to the deterioration of classification performance; hence, collecting new images for a new training dataset may be necessary or is at least preferred. However, this is not the case with medical images. It is impractical to collect enough sets of images for all types and manufacturers of endoscopes. Active learning deals with small samples; however, this method is not helpful because it selects small samples from unlabeled large training samples.
Figure 2 shows an example of differences between textures captured by different endoscope systems. More specifically, the images shown in Figures 2(a) and 2(b) are the same scene from a printed sheet of a colorectal polyp image taken by different endoscope systems at approximately the same distance to the sheet from the endoscopes. Even for the same manufacture (e.g., Olympus) and the same modality (e.g., NBI), images may differ in terms of factors such as resolution, image quality, sharpness, brightness, and viewing angle. These differences may impact classification performance.
To address this problem, Sonoyama et al. 
proposed a method based on transfer learning[14, 15, 16, 17]
to estimate a transformation matrix between feature vectors of training and test datasets captured by different (i.e., old and new) devices. In this prior study, we formulated the problem as a constraint optimization problem and developed an algorithm to estimate a linear transformation; however, a key limitation is that corresponding datasets are required, i.e., each test image (i.e., taken by a new device) must have a corresponding training image (i.e., taken by an old device). Further, these images must capture the same polyp to properly estimate the linear transformation. These restrictions are rather strong, causing our system to be somewhat impractical.
Therefore, this paper proposes an improved method for a task that does not require image-by-image correspondences between training and test datasets. More specifically, we extend the transfer learning method proposed by Hoffman et al. [21, 22], called maximum margin domain transfer (MMDT). Their approach was a domain adaptation technique to handle the domain shift problem, in which distributions of classes (or categories) in one domain, called the source, change in another domain, called the target. This situation occurs in various applications, and hence, domain adaptation and transfer learning have already been widely studied.
Compared to previous studies, MMDT had the following advantages: (1) applicability to multiclass problems with one-vs-rest support vector machines (SVMs); (2) the ability to use different feature dimensions in the source and target domains; (3) the ability to handle unlabeled target samples by estimating global class-independent affine transformation matrix ; and (4) scalability to the number of constraints, i.e., MMDT solves constraints as compared to in the previous studies, where is the number of classes and and are the feature dimensions of the source and target domains.
In this paper111A conference version of this paper was presented . This paper extends that version with more rigorous derivations, the compact form of the dual formulation, and the kernelization of the method, as well as experiments from different aspects., we therefore propose a non-trivial extension to MMDT for handling inconsistencies between NBI endoscopic devices. We summarize the contributions of this paper as follows.
First, we add distance constraints to MMDT; our method is thus called MMDTL2. The original formulation of MMDT uses the Frobenius norm of transformation as a regularizer; however, pulling transformation
into a zero matrix is not intuitive and might not have a good effect on transferring samples. Other regularizers were discussed by
, e.g., an identity matrix when, but no examples were given for cases of . For the latter cases, the target samples in one category should be transformed into the same category as that of the source domain. Therefore, we propose using the distances between the source and transformed target samples to regularize .
Third, we derive the dual QP problem of MMDTL2, where the QP form mentioned above is the primal form. The computational costs of MMDT and the primal QP form of MMDTL2 can be very large for samples with large dimensions because:
for MMDT, affine transformation matrix , which is of size (where and are the feature dimensions of the source and target domains), can be very large;
for the primal QP form of MMDTL2, a matrix of size appears in the computation, which is much larger than .
In contrast, our derived dual QP form of MMDTL2 is more scalable because it involves a matrix, where is the number of target samples and is the number of classes. Typically, is limited or much smaller than the number of source samples, which is common and therefore reasonable for domain adaptation problems, including our problem of NBI device inconsistency.
Finally, we show that the primal QP solution can be converted from the dual QP solution with much lower computational cost. With our derived formula, this conversion needs matrices of size at most, while without elaboration, a matrix of size appears in the conversion.
In addition, we derive a kernelization of the dual QP formulation, which enables us to use a nonlinear transformation as for better performance.
Note that our dual form is different from the dual form derived by Rodner et al. 
. Their motivation was to make MMDT scalable in terms of the number of target samples, because in their study, they attempted to adapt large datasets such as ImageNet. Therefore, their dual form still suffers from the large feature dimensions of the source and target samples. In contrast, our formulation is scalable in terms of feature dimensions.
The rest of the paper is organized as follows. We formulate problems of MMDT and MMDTL2 in Section 2, and then derive the primal form in Section 3. In Section 4, we show the dual form, and in Section 5, we obtain the primal solution from the dual solution. In Section 6, we present our experimental results using datasets of actual NBI endoscopic images. Finally, in Section 7, we conclude this paper and provide avenues for future work.
Ii Problem formulation
In this section, we introduce the problems of MMDT  and our proposed MMDTL2.
Problem 1 (Mmdt).
Suppose we are given a training set in the source domain and another set in the target domain for a -class classification problem.
MMDT solves the following optimization problem:
where and are weights and
are hinge loss functions.
Here, is an SVM hyperplane parameter
are hinge loss functions. Here,
is an SVM hyperplane parameter (including weightsand bias , i.e., for the th class stacked into a matrix , is a label in terms of the th hyperplane, and and are weights.
Note that denotes an augmented vector with 1 as its last element, i.e., .
In this problem, we simultaneously obtain SVM classifiers and transformation . One-vs-all SVMs are used for multiclass classification; thus, hyperplane parameters are obtained in the source domain. Target samples are transformed by from the target domain to the source domain, and then the loss function causes them to be classified by the SVMs.
Because this problem is non-convex, an alternating optimization approach was used in .
Problem 2 (MMDT with iteration).
MMDT solves problem 1 by iteratively solving subproblems
As noted in the Introduction, the use of Frobenius norm is not intuitive for the transformation matrix. Further, it might not be a good choice for small values of because the obtained solution is pulled toward a zero matrix; however, the use of large values of impacts the SVM subproblem (5) because C-SVM solvers are known to be unstable for large values of parameter .
In this paper, we therefore propose the following problem, which we call MMDTL2. MMDTL2 incorporates additional constraints of distances to pull target samples to source samples of the same category.
Problem 3 (Mmdtl2).
MMDT with constraints solves problem 1 by iteratively solving subproblems
with the same initialization as problem 2, where
where is a weight between samples and and is the balance weight.
MMDTL2 reduces to MMDT when and .
We add weight to the first term of Eq. (7) because is used for both Eqs. (7) and (8). In both equations, modifies the balance between the first and second terms. Therefore, without , the first terms of Eqs. (7) and (8) have the same weight, which might not be suitable in general. We will investigate the effect of in section 4.4.
Iii Proposed method
In this section, we highlight the main results only. Full derivations and proofs are given in the appendix.
Iii-a Primal problem
First, we rephrase subproblem (7) with inequality constraints instead of loss functions.
Problem 4 (Estimation of ).
We want to find that minimizes the following objective function:
Iii-B Primal QP problem
The problem can be written in the form of a canonical QP problem.
The problem can be written as
where variables are defined in the proof.
This primal QP problem involves very large matrices that are impractical to compute. More precisely, is a matrix of size , which can be very large when the dimensions of features ( and ) are large. Next, we therefore derive the dual form of the problem, which we expect to be less expensive to compute.
Iii-C Dual problem
The dual of the problem is given in the corollary below.
The dual form of the original primal problem is given by
Iii-D Retrieving the primal solution
After solving the dual problem with a QP solver, we need to convert the dual solution to and by
then finally to .
Here, we again face the problem of large matrix , that should not be used. In the corollary below, we show primal solution directly, i.e., avoiding conversions from to , and then to .
The solution to the primal problem is given by
where is element-wise multiplication and variables are given in the proof.
We derive the kernel version of the dual formulation. To do so, we apply to target sample by multiplying it from the left, i.e., ; Therefore, all computations with target samples are inner products, which means we can use kernels to replace the inner products.
To transform target sample with , we have
Iv Results and discussions
In this section, we show experimental results by using datasets consisting of two NBI endoscopic devices to understand how our proposed method behaves given differing numbers of target domain training samples.
We used the following parameter values throughout the experiments unless stated otherwise: for MMDTL2 based on our preliminary experiments, and and for MMDT, as used in the code of the original MMDT [21, 25, 22].
Iv-a NBI endoscopic image dataset
The NBI dataset used in this experiment consisted of two domains. For the first (i.e., source) domain, we used the NBI image dataset consisting of 908 NBI patches collected from endoscopic examinations at Hiroshima University by using OLYMPUS EVIS LUCERA endoscope system ; patches were labeled based on NBI magnification findings [3, 4], which categorizes appearances of tumors into types A, B, and C, with type C further sub-classified into C1, C2, and C3 based on microvessel structures (see Figure 1). In this study, we used only types A, B, and C3 in accordance with our previous work [11, 26, 13, 27]. In general, a patch is trimmed from a larger frame of the entire endoscopic image such that the trimmed rectangular region represents the typical texture pattern of the colorectal polyp appearing in the frame. To align the size of patches between source and target domains, we further trimmed the center of each patch to and created 734 patches with 289 in type A, 365 in type B, and 80 in type C3. Note that this study was conducted with the approval from the Hiroshima University Hospital ethics committee, and informed consents were obtained from the patients and/or family members for the endoscopic examinations.
For the second (i.e., target) domain, we used another NBI image dataset consisting of 279 patches with 92 in type A and 187 in type B. These images were taken by OLYMPUS EVIS LUCERA ELITE endoscope system , which is a newer model than that of the system of the source domain. Due to the limited number of endoscopic examinations using this newer endoscope system, we trimmed the center square to from video frames of 41 NBI examination videos; hence, there are two factors of domain shift here: (1) the NBI endoscopic devices and (2) the differences between still images (i.e., source) and video frames (i.e., target). Note that type C3 samples obtained using the new endoscopy are not yet available because the number of type C3 samples, which corresponds to a developed cancer , is smaller in general, and the new endoscopy of our facility has not yet had the opportunity to capture an image of a developed cancer of this type.
From these two domains, we computed convolutional neural network (CNN) features extracted using CaffeNet; more specifically, this is the fc6 feature of 4096 dimensions, which is known to work well for many tasks . These features were used without dimensionality reduction.
To see the effect of the number of target samples, we prepared training/test sets as follows. First, we randomly split the source and target domain samples into half; we therefore had a source training set of 367 source samples and a target training set of 139 target samples. Next, we kept a specified number of target samples per category (up to 40) in each of the target training sets, discarding the rest. For the test set, we used 140 target samples. We created 10 training/test sets and reported average performance.
Figure 3 shows performance results of different methods over the different numbers of target training samples. As a baseline, we also show two SVM results, i.e., source SVM and target SVM. Source SVM estimates an SVM classifier with source and target training samples (and only source samples when no target samples are used). Target SVM does the same, but only with target training samples. MMDT results were obtained using the code provided by , just as in our first experiment. There were three results for MMDTL2, i.e., a linear version and two kernel versions with RBF and polynomial kernels.
Even with no target samples, source SVM performed relatively better and increased its performance as the number of target training samples increased. Target SVM started below 70%, but caught up to source SVM when 20 training target samples were given. The results indicate that our proposed MMDTL2 behaves between source and target SVMs. When one or two target samples are given, MMDTL2 behaves similarly to target SVM and far below the source SVM, but it quickly gains from the benefits of adaptation. With the RBF kernel, MMDTL2 is the best when 10 and 15 training samples are given, but the linear and polynomial kernels become better when more samples are given. MMDTL2 with the linear kernel and target SVM approach one another, which is expected because a sufficient number of target training samples are considered to be the best for classifying target domain samples. Overall, MMDTL2 with a polynomial kernel works the best.
|1||77.31 1.05 ++||69.01 5.90||68.30 3.10||70.28 5.31||65.60 7.46||70.99 4.81|
|2||78.01 1.34 ++||61.99 3.52||79.65 4.22||68.65 3.18 ++||59.65 4.16||71.28 2.96 ++|
|3||80.50 1.67 ++||67.30 4.98||75.67 5.11 ++||72.13 5.01||70.07 5.63||73.76 4.91 +|
|4||82.69 1.94 ++||71.21 5.98||76.38 4.92 ++||74.54 5.80||73.12 6.38||75.96 5.38|
|5||84.33 1.69 ++||77.80 3.39||77.87 4.17||81.91 2.68 +||81.99 2.69 +||82.62 2.58 ++|
|10||89.15 0.97 ++||85.67 1.83||77.23 3.52||88.09 1.73 +||90.07 2.22 ++||88.09 1.76 +|
|15||91.21 0.64||90.21 1.39||80.00 3.06||91.21 1.33||93.40 0.89 **++||91.56 1.02|
|20||91.77 0.61||92.41 0.93||87.73 2.49||93.55 0.74 **+||92.84 0.58 **||93.90 0.69 **++|
|25||91.99 0.59||92.77 0.78 *||86.74 3.23||93.12 0.69 **||92.55 0.70||93.90 0.66 **++|
|30||92.34 0.56||93.48 0.63 **||87.09 2.23||93.76 0.30 **||92.91 0.76||94.26 0.40 **++|
|35||92.62 0.62||93.97 0.53 **||91.21 0.89||94.26 0.48 **||92.91 0.63||95.18 0.48 **++|
|40||93.26 0.57||94.61 0.35 **||90.07 1.05||94.96 0.48 **||92.62 0.65||95.96 0.34 **++|
To make the discussion above more quantitative, we performed the one-tailed Welch’s unequal variances-test [30, 31] to show the statistical significance of the proposed method with respect to sourceSVM and targetSVM. Table I shows the same performance results presented in Figure 3, but with results of the statistical test. Tests with 1% and 5% significance levels (i.e., and ) are indicated by and , respectively, with respect to sourceSVM. Similarly, and indicate the significance of test results with respect to targetSVM. MMDTL2 with a polynomial kernel is better than sourceSVM and targetSVM when ( is is the number of target samples given per category). MMDTL2 with the RBF kernel works well only when . MMDTL2 is better than sourceSVM when .
Iv-B NBI endoscopic image dataset with high-dimension features
Figure 4 shows performance results of the given methods as in Figure 3 with the same protocols for training and evaluation. The difference here is the set of features used; more specifically, we use the conv3 features of 64,896 dimensions rather than fc6. For NBI patch classification problems , we have seen that conv3 features worked better than fc6 features. Therefore we choose conv3 features for this experiment. However, transformation matrix for the conv3 features could be very large without our efficient dual formulation. The ability to handle such large dimensions of features is the key advantage of our proposed dual MMDTL2. In contrast, it is not possible to train MMDT because it involves a matrix of size for storage (approximately 31 GB for double-type elements) and requires more space for working memory. The test phase is also impractical because MMDT uses expertly for converting target samples. Figure 4 shows that the sourceSVM is the best when a few target samples are available, just as in Figure 3. Further, our proposed MMDTL2 with a polynomial kernel becomes better at the right half of the plot, and the differences between sourceSVM and targetSVM are much more significant than those in Figure 3.
Table II shows the same performance results as shown in Figure 4 along with the results of the one-tailed Welch’s -test [30, 31]. The meaning of the marks is the same as that in Table I. Again, MMDTL2 with a polynomial kernel is better than sourceSVM and targetSVM when . MDTL2 with the linear kernel becomes better than sourceSVM and targetSVM when .
|1||83.97 1.25 ++||55.18 6.74||57.23 5.67||52.41 7.46||51.70 6.60|
|2||83.76 1.02 ++||50.50 3.98||56.95 4.39 ++||52.27 4.43||52.84 4.39|
|3||84.75 0.84 ++||56.38 4.13||62.13 3.84 ++||60.28 4.04||60.07 4.04|
|4||85.82 1.12 ++||63.33 5.08||69.01 5.06 +||68.94 5.42||68.09 5.26|
|5||86.74 0.96 ++||68.09 3.39||75.60 2.75 ++||78.94 2.64 ++||75.67 2.89 ++|
|10||89.72 0.78 ++||77.02 1.53||81.84 1.65 ++||90.28 2.41 ++||84.54 1.74 ++|
|15||90.78 0.83 ++||86.03 1.01||88.44 1.09 ++||92.62 1.06 **++||91.28 1.07 ++|
|20||91.63 0.78 ++||89.15 1.14||91.13 1.10 ++||92.70 0.50 **++||93.69 0.91 **++|
|25||92.48 0.64 ++||90.71 1.14||92.27 0.91 ++||91.63 0.98||94.18 0.88 **++|
|30||93.12 0.83||92.06 1.19||93.19 0.91||94.04 0.92 ++||94.75 0.88 **++|
|35||93.40 0.69||93.62 0.90||94.68 0.68 **+||92.55 1.03||96.03 0.62 **++|
|40||93.76 0.68||94.61 0.69||95.46 0.69 **+||93.40 1.00||96.52 0.52 **++|
Iv-C NBI endoscopic image dataset with features of different dimensions
Figure 5 shows another set of performance results. Here, we use different feature dimensions for source and target samples. In particular, we use the conv3 features of 64,896 dimensions for the source samples and the conv5 features of 43,264 dimensions for the target samples. In this case, MMDT does not work because of its high memory cost as described in the previous section, and furthermore, sourceSVM cannot be used because of the difference of feature dimensions. Table III shows the same performance results along with the results of the one-tailed Welch’s -test [30, 31] using the same marks as those in Table I. MMDTL2 with a polynomial kernel is better than targetSVM again, when . MMDTL2 with the linear and RBF kernels becomes better than targetSVM when .
|1||64.33 5.86||68.30 5.39||48.44 5.04||50.43 5.32|
|2||55.67 4.81||61.56 4.70 +||47.45 5.69||47.66 5.68|
|3||55.39 4.40||61.13 4.83 +||55.18 5.46||54.54 5.32|
|4||64.18 5.97||68.30 5.90||64.33 7.13||63.62 7.03|
|5||69.22 4.51||72.84 3.91||75.53 3.68 ++||70.85 4.46|
|10||79.93 2.21||83.40 2.28 ++||89.65 2.34 ++||83.69 2.19 ++|
|15||86.24 1.41||88.01 1.28 +||92.98 1.28 ++||88.87 1.35 ++|
|20||88.65 1.16||90.07 1.21 +||93.33 0.75 ++||91.63 0.98 ++|
|25||89.43 1.17||90.78 1.21 +||93.19 0.82 ++||92.48 0.89 ++|
|30||91.28 1.32||92.13 1.14||92.91 0.66 ++||93.19 0.99 ++|
|35||92.48 0.69||92.91 0.73||92.84 0.45||94.04 0.45 ++|
|40||93.55 0.55||93.76 0.57||92.84 0.84||94.61 0.48 ++|
Iv-D Effects of parameter values
We introduced parameters and for our MMDTL2 formulation in Section 2. Figure 6 shows performance results for different values of , whereas other parameters are fixed to their default values. The results are almost identical when ; however, the performance becomes unsatisfactory for smaller values of . This is because the inverse of is used in the formulation of MMDTL2 (see the Appendix). This suggests that a small amount of regularization for might help improve performance. Note that the instability of the results when may be due to the fact that variables of single precision have 8 digits of precision, and the inverses of smaller values lead to numerical instability. Therefore, we can say that results are insensitive to the choice of as long as larger values are used.
Figure 7 shows the performance results for different values of , whereas other parameters are fixed to their default values. To see the differences between the different values, the results for 20 target samples are shown in Figure 8. The results are almost identical when ; however, the performance decreases for . A detailed investigation of this effect is one of our future tasks; however, it is clear that the performance is better and stable for larger values of .
Iv-E Computation time
Figure 9 shows the computation time for the training phases of each method over different feature dimensions () using MATLAB implementations on an Intel Core i7 3.4 GHz CPU with 16 GB memory. The primal MMDTL2 involves quite a large matrix for , and it is not practical when in terms of memory and computation time. The cost of the dual MMDTL2 depends on the number of target samples; however, it is quite fast compared to the primal MMDTL2 for both 10 and 40 target training samples. The dual MMDTL2 can deal with much higher feature dimensions, while MMDT exceeds the memory limitation when because matrix must be explicitly stored.
In this paper, we proposed MMDT with constraints, i.e., MMDTL2, deriving the dual formulation with much lesser computational costs as compared to the naive QP problem. Further, we showed the kernelization of our method. Experimental results with NBI datasets from two different endoscopic devices showed that our proposed MMDTL2 with linear and polynomial kernels performed better than the given baselines (i.e., source and target SVMs). Our future work includes using other loss functions for problem formulation. We observed that the one-vs-rest multiclass classification by SVMs was a performance bottleneck of MMDTL2 in our experiments. Therefore, instead of relying on maximum margin loss functions, multiclass logistic loss might be better here. In the future, we plan to explore this idea and report performance results for the NBI dataset as well. In addition, we plan to investigate parameter tuning with cross validation and compare the proposed method with other adaptation methods such as that in .
Appendix A Primal problem
In this section, we rephrase subproblem (7) with inequality constraints instead of loss functions.
Problem 4 (Estimation of ).
We want to find that minimizes the following objective function:
First, we rewrite the objective function in a matrix form. To this end, we introduce the vec operator and some formulas below.
A-a Operator vec
Here, we define a vectorized operator for rearranging matrix-vector products.
For a given matrix , denoted by a set of row vectors as
we define operator , which vectorizes in the row-major order as
This definition is different from the one used in the literature, which is defined in the column-major order, for example, in .
Next, we can rewrite matrix-vector multiplications using the vec operator, as summarized in the following lemma.
For given matrix and vectors and , the following equations hold:
, is an identity matrix
and is the tensor product.
is the tensor product.
First, we have
Using this equation, we have
Also, we have
For later use, we also define
A-B Rewriting terms with vec operator
In this subsection, we rewrite the term in the objective function using the lemma below.
The constraint term of MMDTL2 can be written as
where , , and are given in the proof below.
A single term can be rewritten with lemma 1 as
By summing the terms with weights, we have
Here, , , and are the corresponding factors; we further rewrite them into the simpler forms shown below.
Here, we use data matrices
Next, we rewrite the conditions in the problem as shown below.
The condition in problem 4 can be written as
A-C Primal QP problem
In this subsection, we write the problem in the form of a canonical QP problem.
Problem 4 can be written as