# Multiple-Instance Learning by Boosting Infinitely Many Shapelet-based Classifiers

We propose a new formulation of Multiple-Instance Learning (MIL). In typical MIL settings, a unit of data is given as a set of instances called a bag and the goal is to find a good classifier of bags based on similarity from a single or finitely many "shapelets" (or patterns), where the similarity of the bag from a shapelet is the maximum similarity of instances in the bag. Classifiers based on a single shapelet are not sufficiently strong for certain applications. Additionally, previous work with multiple shapelets has heuristically chosen some of the instances as shapelets with no theoretical guarantee of its generalization ability. Our formulation provides a richer class of the final classifiers based on infinitely many shapelets. We provide an efficient algorithm for the new formulation, in addition to generalization bound. Our empirical study demonstrates that our approach is effective not only for MIL tasks but also for Shapelet Learning for time-series classification.

## Authors

• 4 publications
• 5 publications
• 4 publications
• 2 publications
• 2 publications
• 12 publications
• ### Theory and Algorithms for Shapelet-based Multiple-Instance Learning

We propose a new formulation of Multiple-Instance Learning (MIL), in whi...
05/31/2020 ∙ by Daiki Suehiro, et al. ∙ 0

• ### Instance Significance Guided Multiple Instance Boosting for Robust Visual Tracking

Multiple Instance Learning (MIL) recently provides an appealing way to a...
01/19/2015 ∙ by Jinwu Liu, et al. ∙ 0

• ### Bag Reference Vector for Multi-instance Learning

Multi-instance learning (MIL) has a wide range of applications due to it...
12/03/2015 ∙ by Hanqiang Song, et al. ∙ 0

• ### A multi-instance deep neural network classifier: application to Higgs boson CP measurement

We investigate properties of a classifier applied to the measurements of...
03/02/2018 ∙ by P. Białas, et al. ∙ 0

• ### A Visual Mining Approach to Improved Multiple-Instance Learning

Multiple-instance learning (MIL) is a paradigm of machine learning that ...
12/14/2020 ∙ by Sonia Castelo, et al. ∙ 0

• ### PIGMIL: Positive Instance Detection via Graph Updating for Multiple Instance Learning

Positive instance detection, especially for these in positive bags (true...
12/12/2016 ∙ by Dongkuan Xu, et al. ∙ 0

• ### Using Neural Network Formalism to Solve Multiple-Instance Problems

Many objects in the real world are difficult to describe by a single num...
09/23/2016 ∙ by Tomáš Pevný, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multiple-Instance Learning (MIL) is a fundamental framework of supervised learning with a wide range of applications such as prediction of molecule activity, image classification, and so on. Since the notion of MIL was first proposed by

Dietterich et al. (1997), MIL has been extensively studied both in theoretical and practical aspects (Gärtner et al., 2002; Andrews et al., 2003; Sabato and Tishby, 2012; Zhang et al., 2013; Doran and Ray, 2014; Carbonneau et al., 2018).

A standard MIL setting is described as follows: A learner receives sets called bags, each of which contains multiple instances. In the training phase, each bag is labeled but instances are not labeled individually. The goal of the learner is to obtain a hypothesis that predicts the labels of unseen bags correctly222Although there are settings where instance label prediction is also considered, we focus only on bag-label prediction in this paper.. One of the most common hypotheses used in practice has the following form:

 hu(B)=maxx∈B⟨u,Φ(x)⟩, (1)

where is a feature map and

is a feature vector which we call a

shapelet. In many applications, is interpreted as a particular “pattern” in the feature space and the inner product as the similarity of from . Note that we use the term “shapelets” by following the terminology of Shapelet Learning, which is a framework for time-series classification, although it is often called “concepts” in the literature of MIL. Intuitively, this hypothesis evaluates a given bag by the maximum similarity of the instances in the bag from the shapelet

. Multiple-Instance Support Vector Machine (MI-SVM) proposed by

Andrews et al. (2003) is a widely used algorithm that uses this hypothesis class and learns . It is well-known that MIL algorithms using this hypothesis class practically perform well for various multiple-instance datasets. Moreover, a generalization error bound of the hypothesis class is given by Sabato and Tishby (2012).

However, in some domains such as image recognition and document classification, it is said that the hypothesis class of (1) is not effective enough (see, e.g., Chen et al., 2006). To employ MIL on such domains more effectively, Chen et al. (2006) proposes to use a convex combination of various shapelets in a finite set , which is defined based on all instances that appear in the training sample,

 g(B)=∑u∈Uwumaxx∈B⟨u,Φ(x)⟩, (2)

where

is a probability vector over

. They demonstrate that this hypothesis with the Gaussian kernel performs well in image recognition. However, no theoretical justification is known for the hypothesis class of type (2) with the finite set made from the empirical bags. By contrast, for sets of infinitely many shapelets with bounded norm, generalization bounds of Sabato and Tishby (2012) are applicable as well to the hypothesis class (2), but the result of Sabato and Tishby (2012) does not provide a practical formulation such as MI-SVM.

### 1.1 Our Contributions

In this paper, we propose an MIL formulation with the hypothesis class (2) for sets of infinitely many shapelets. More precisely, we formulate a -norm regularized soft margin maximization problem to obtain linear combinations of shapelet-based hypotheses.

Then, we design an algorithm based on Linear Programming Boosting

(LPBoost, Demiriz et al., 2002) that solves the soft margin optimization problem via a column generation approach. Although the sub-problems (weak learning problem) become optimization problems over an infinite-dimensional space, we can show that an analogue of the representer theorem holds on it and allows us to reduce it to a non-convex optimization problem (difference of convex program, DC-program for short) over a finite-dimensional space. While it is difficult to solve the sub-problems exactly due to non-convexity, various techniques (e.g., Tao and Souad, 1988; Yu and Joachims, 2009) are investigated for DC programs and we can find good approximate solutions efficiently for many cases in practice.

Furthermore, we prove a generalization error bound of hypothesis class (2) with infinitely large sets . In general, our bound is incomparable with those of Sabato and Tishby (2012), but ours has better rate in terms of the sample size .

We introduce an important application of our result, shapelet learning for time-series classification (we show details later). In fact, in time-series domain, most shapelet learning algorithms have been designed heuristically. As a result, our proposed algorithm becomes the first algorithm for shapelet learning in time-series classification that guarantees the theoretical generalization performance.

Finally, the experimental results show that our approach performs favorably with a baseline for shapelet-based time-series classification tasks and outperforms baselines for several MIL tasks.

### 1.2 Comparison to Related Work

There are many MIL algorithms with hypothesis classes which are different from (1) or (2). (e.g., Gärtner et al., 2002; Zhang et al., 2006; Chen et al., 2006). Many of them adopt different approaches for the bag-labeling hypothesis from shapelet-based classifiers (e.g., Zhang et al. (2006) used a Noisy-OR based hypothesis and Gärtner et al. (2002) proposed a new kernel called a set kernel).

Sabato and Tishby (2012) proved generalization bounds of hypotheses classes for MIL including those of  (1) and (2) with infinitely large sets . They also proved the PAC-learnability of the class (1) using the boosting approach under some technical assumptions. Their boosting approach is different from our work in that they assume that labels are consistent with some hypothesis of the form (1), while we consider arbitrary distributions over bags and labels.

It is known that other boosting-based methods achieve successful results for several MIL tasks (Auer and Ortner, 2004; Andrews and Hofmann, 2004; Zhang et al., 2006). They use different hypothesis classes than ours under different assumptions.

### 1.3 Connection between MIL and Shapelet Learning for Time Series Classification

Here we briefly mention that MIL with type (2) hypotheses is closely related to Shapelet Learning (SL), which is a framework for time-series classification and has been extensively studied by (Ye and Keogh, 2009; Keogh and Rakthanmanon, 2013; Hills et al., 2014; Grabocka et al., 2014)

in parallel to MIL. SL is a notion of learning with a particular method of feature extraction, which is defined by a finite set

of real-valued “short” sequences called shapelets and a similarity measure (not necessarily a Mercer kernel) in the following way. A time series can be identified with a bag consisting of all subsequences of of length . Then, the feature of is a vector of a fixed dimension regardless of the length of the time series . When we employ a linear classifier on top of the features, we obtain a hypothesis of the form

 g(τ)=∑z∈Mwzmaxx∈BτK(z,x), (3)

which is essentially the same form as (2), except that finding good shapelets is a part of the learning task, as well as to finding good weight vector . This is one of the most successful approach of SL (Hills et al., 2014; Grabocka et al., 2014, 2015; Renard et al., 2015; Hou et al., 2016), where a typical choice of is . However, almost all existing methods heuristically choose shapelets and have no theoretical guarantee on how good the choice of is.

Note also that in the SL framework, each is called a shapelet, while in this paper, we assume that is a kernel and any (not necessarily for some ) in the Hilbert space is called a shapelet.

Curiously, despite MIL and SL share similar motivations and hypotheses, the relationship between MIL and SL has not yet been pointed out. From the shapelet-perspective in MIL, the hypothesis (1) is regarded as a “single shapelet”-based hypothesis, and the hypothesis (2) is regarded as “multiple shapelet”-based hypothesis. We refer to a linear combination of maximum similarities based on shapelets such as (2) and (3) as shapelet-based classifiers.

## 2 Preliminaries

Let be an instance space. A bag is a finite set of instances chosen from . The learner receives a sequence of labeled bags called a sample, where each labeled bag is independently drawn according to some unknown distribution over . Let denote the set of all instances that appear in the sample . That is, . Let be a kernel over , which is used to measure the similarity between instances, and let denote a feature map associated with the kernel for a Hilbert space , that is, for instances , where denotes the inner product over . The norm induced by the inner product is denoted by defined as for .

For each which we call a shapelet, we define a shapelet-based classifier denoted by , as the function that maps a given bag to the maximum of the similarity scores between shapelet and over all instances in . More specifically,

 hu(B)=maxx∈B⟨u,Φ(x)⟩.

For a set , we define the class of shapelet-based classifiers as

 HU={hu∣u∈U}

and let denote the set of convex combinations of shapelet-based classifiers in . More precisely,

 conv(HU)= {∫u∈Uwuhudu∣wu is a density over U} = {∑u∈U′wuhu∣∀u∈U′,wu≥0, ∑u∈U′wu=1,U′⊆U is a finite support}.

The goal of the learner is to find a hypothesis , so that its generalization error is small. Note that since the final hypothesis is invariant to any scaling of , we assume without loss of generality that

 U={u∈H∣∥u∥H≤1}.

Let denote the empirical margin loss of over , that is, .

## 3 Optimization Problem Formulation

In this paper we formulate the problem as soft margin maximization with -norm regularization, which ensures a generalization bound for the final hypothesis (see, e.g., Demiriz et al., 2002). Specifically, the problem is formulated as a linear programming problem (over infinitely many variables) as follows:

 maxρ,w,ξ ρ−1νmm∑i=1ξi (4) sub.to ∫u∈Uyiwuhu(Bi)du≥ρ−ξi∧ξi≥0, i∈[m], ∫u∈Uwudu=1,wu≥0, ρ∈R,

where is a parameter. To avoid the integral over the Hilbert space, it is convenient to consider the dual form:

 minγ,d γ (5) sub.to m∑i=1yidihu(Bi)≤γ,u∈U, 0≤di≤1/(νm),i∈[m], m∑i=1di=1, γ∈R.

The dual problem is categorized as a semi-infinite program (SIP) because it contains infinitely many constraints. Note that the duality gap is zero because the problem (5) is linear and the optimum is finite (Theorem 2.2 of Shapiro, 2009). We employ column generation to solve the dual problem: solve (5) for a finite subset , find to which the corresponding constraint is maximally violated by the current solution (column generation part), and repeat the procedure with until a certain stopping criterion is met. In particular, we use LPBoost (Demiriz et al., 2002), a well-known and practically fast algorithm of column generation. Since the solution is expected to be sparse due to the 1-norm regularization, the number of iterations is expected to be small.

Following the terminology of boosting, we refer to the column generation part as weak learning. In our case, weak learning is formulated as the following optimization problem:

 maxu∈Hm∑i=1yidimaxx∈Bi⟨u,Φ(x)⟩sub.to∥u∥2H≤1. (6)

Thus, we need to design a weak learner for solving (6) for a given sample weighted by . It seems to be impossible to solve it directly because we only have access to through the associated kernel. Fortunately, we prove a version of representer theorem given below, which makes (6) tractable.

###### Theorem 1 (Representer Theorem)

The solution of (6) can be written as for some real numbers .

Our theorem can be derived from an application of the standard representer theorem (see, e.g., Mohri et al., 2012). Intuitively, we prove the theorem by decomposing the optimization problem (6) into a number of sub-problems, so that the standard representer theorem can be applied to each of the sub-problems. The detail of the proof is given in the supplementary materials. Note that Theorem 1 gives justification to the simple heuristics in the literature: choosing the shapelets extracted from .

Theorem 1 says that the weak learning problem can be rewritten as the following tractable form:

OP 1: Weak Learning Problem

 minα −m∑i=1diyimaxx∈Bi∑z∈PSαzK(z,x) sub.to ∑z∈PS∑v∈PSαzαvK(z,v)≤1.

Unlike the primal solution , the dual solution is not expected to be sparse. In order to obtain a more interpretable hypothesis, we propose another formulation of weak learning where 1-norm regularization is imposed on , so that a sparse solution of will be obtained. In other words, instead of , we consider the feasible set , where is the 1-norm of .

OP 2: Sparse Weak Learning Problem

 minα −m∑i=1diyimaxx∈Bi∑z∈PSαzK(z,x) sub.to ∥α∥1≤1

Note that when running LPBoost with a weak learner for OP 3, we obtain a final hypothesis that has the same form of generalization bound as the one stated in Theorem 2, which is of a final hypothesis obtained when used with a weak learner for OP 3. To see this, consider a feasible space for a sufficiently small , so that . Then since , a generalization bound for also applies to . On the other hand, since the final hypothesis for is invariant to the scaling factor , the generalization ability is independent of .

## 4 Algorithms

For completeness, we present the pseudo code of LPBoost in Algorithm 1.

For the rest of this section, we describe our algorithms for the weak learners. For simplicity, we denote by a vector given by for every . Then, the objective function of OP 3 (and OP 3) can be rewritten as

 ∑i:yi=−1dimaxx∈BikTxα−∑i:yi=1dimaxx∈BikTxα,

which can be seen as a difference of two convex functions and of . Therefore, the weak learning problems are DC programs and thus we can use DC algorithm (Tao and Souad, 1988; Yu and Joachims, 2009) to find an -approximation of a local optimum. We employ a standard DC algorithm. That is, for each iteration , we linearize the concave term with at the current solution , which is with in our case, and then update the solution to by solving the resultant convex optimization problem .

In addition, the problems for OP 3 and OP 3 are reformulated as a second-order cone programming (SOCP) problem and an LP problem, respectively, and thus both problems can be efficiently solved. To this end, we introduce new variables for all negative bags with which represent the factors . Then we obtain the equivalent problem to for OP 3 as follows:

 minα,λ ∑i:yi=−1diλi−∑i:yi=1dikTxiα (7) sub.to kTxα≤λi (∀i:yi=−1,∀x∈Bi), ∑z∈PS∑v∈PSαzαvK(z,v)≤1.

It is well known that this is an SOCP problem. Moreover, it is clear that for OP 3 can be formulated as an LP problem. We describe the algorithm for OP 3 in Algorithm 2.

## 5 Generalization Bound of the Hypothesis Class

In this section, we provide a generalization bound of hypothesis classes for various and .

Let . Let . By viewing each instance

as a hyperplane

, we can naturally define a partition of the Hilbert space by the set of all hyperplanes . Let be the set of all cells of the partition, i.e., . Each cell is a polyhedron which is defined by a minimal set that satisfies . Let

 μ∗=minI∈Imaxu∈I∩Uminv∈VI|⟨u,v⟩|.

Let be the VC dimension of the set of linear classifiers over the finite set , given by .

Then we have the following generalization bound on the hypothesis class of (2).

###### Theorem 2

Let . Suppose that for any , . Then, for any , with high probability the following holds for any with :

 ED(g)≤ Eρ(g)+O⎛⎜ ⎜⎝R√d∗Φ,Slog|PS|ρ√m⎞⎟ ⎟⎠, (9)

where (i) for any , , (ii) if and is the identity mapping (i.e., the associated kernel is the linear kernel), or (iii) if and satisfies the condition that is monotone decreasing with respect to (e.g., the mapping defined by the Gaussian kernel) and , then .

For space constraints, we omit the proof and it is shown in the supplementary materials.

#### Comparison with the existing bounds

A similar generalization bound can be derived from a known bound of the Rademacher complexity of  (Theorem 20 of Sabato and Tishby, 2012) and a generalization bound of for any hypothesis class  (see Corollary 6.1 of Mohri et al., 2012):

 ED(g)≤Eρ(g)+O(log(∑mi=1|Bi|)log(m)ρ√m).

Note that Sabato and Tishby (2012) fixed . Here, for simplicity, we omit some constants of (Theorem 20 of Sabato and Tishby, 2012). Note that by definition. The bound above is incomparable to Theorem 2 in general, as ours uses the parameter and the other has the extra term. However, our bound is better in terms of the sample size by the factor of when other parameters are regarded as constants.

## 6 SL by MIL

### 6.1 Time-Series Classification with Shapelets

In the following, we introduce a framework of time-series classification problem based on shapelets (i.e. SL problem). As mentioned in Introduction, a time series can be identified with a bag consisting of all subsequences of of length . The learner receives a labeled sample , where each labeled bag (i.e. labeled time series) is independently drawn according to some unknown distribution over a finite support of . The goal of the learner is to predict the labels of an unseen time series correctly. In this way, the SL problem can be viewed as an MIL problem, and thus we can apply our algorithms and theory.

Note that, for time-series classification, various similarity measures can be represented by a kernel. For example, the Gaussian kernel (behaves like the Euclidean distance) and Dynamic Time Warping (DTW) kernel. Moreover, our framework can generally apply to non-real-valued sequence data (e.g., text, and a discrete signal) using a string kernel.

### 6.2 Our Theory and Algorithms for SL

By Theorem 2, we can immediately obtain the generalization bound of our hypothesis class in SL as follows:

###### Corollary 3

Consider time-series sample of size and length . For any fixed , the following generalization error bound holds for all in which the length of shapelet is :

 ED(g)≤Eρ(g)+O⎛⎜ ⎜⎝R√d∗Φ,Slog(m(L−ℓ+1))ρ√m⎞⎟ ⎟⎠.

To the best of our knowledge, this is the first result on the generalization performance of SL. Note that the bound can also provide a theoretical justification for some existing shapelet-based methods. This is because many of the existing methods find effective shapelets from all of the subsequences in the training sample, and the linear convex combination of the hypothesis class using such shapelets is a subset of the hypothesis class that we provided.

For time-series classification problem, shapelet-based classification has a greater advantage of the interpretability or visibility than the other time-series classification methods (see, e.g., Ye and Keogh, 2009). Although we use a nonlinear kernel function, we can also observe important subsequences that contribute a shapelet by solving OP 3 because of the sparsity (see also the experimental results). Moreover, for unseen time-series data, we can observe which subsequences contribute the predicted class by observing maximizer .

## 7 Experiments

In the following experiments, we demonstrate that our methods are practically effective for time-series data and multiple-instance data. Note that we use some heuristics for improving efficiency of our algorithm in practice (see details in supplementary materials). We use

-means clustering in the heuristics, and thus we show the average accuracies and standard deviations for our results considering the randomness of

-means.

### 7.1 Results for Time-Series Data

We used several binary labeled datasets333Our method is applicable to multi-class classification tasks by easy expansion (e.g., Platt et al., 2000) in UCR datasets (Chen et al., 2015), which are often used as benchmark datasets for time-series classification methods. The detailed information of the datasets is described on the left-side of Table 1. We used a weak learning problem OP 3 because the interpretability of the obtained classifier is required in shapelet-based time-series classification. We set the hyper-parameters as follows: Length of the subsequences ( corresponds to the dimension of instances in MIL) was searched in , where is the length of each time series in the dataset, and we choose from . We used the Gaussian kernel . We choose from . We found good , , and through a grid search via five runs of -fold cross validation. As an LP solver for WeakLearn and LPBoost we used the CPLEX software.

#### Accuracy and efficiency

The classification accuracy results are shown on the right-hand side of Table 1. We referred to the experimental results reported by Bagnall et al. (2017) with regard to accuracy of ST method (Hills et al., 2014) as a baseline. Bagnall et al. (2017) fairly compared many kinds of time-series classification methods and reported that ST achieved higher accuracy than the other shapelet-based methods. Our method performed better than ST for five datasets, but worse for the other six datasets. Our conjecture is that one reason for some of the worse results is that ST methods consider all possible lengths () of subsequences as shapelets without limiting the computational cost. The main scheme in ST method is searching effective shapelets, and the time complexity of it depends on (see also the real computation time in Hills et al., 2014). We cannot compare the time complexity of our method with that of ST because the time complexity of our method mainly depends on the LP solver (boosting converged in several tens of iterations empirically). Thus, we present the computation time per single learning with the best parameter in the rightmost column of Table 1. The experiments are done with a machine with Intel Core i7 CPU with 4 GHz and 32 GB memory. The result demonstrated that our method efficiently ran in practice. As a result, we can say that our method performed favorably with ST while we limited the length of shapelets in the experiment.

#### Interpretability of our method

In order to show the interpretability of our method, we introduce two types of visualization of our result.

One is the visualization of the characteristic subseqences of an input time series. When we predict the label of the time series , we calculate a maximizer in for each , that is, . In image recognition tasks, the maximizers are commonly used to observe the sub-images that characterize the class of the input image (e.g., Chen et al., 2006). In time-series classification task, the maximizers also can be used to observe some characteristic subsequences. Figure 1(a) is an example of visualization of maximizers. Each value in the legend indicates . That is, Subsequences with positive values contribute the positive class and subsequences with negative values contribute the negative class. Such visualization provides the subsequences that characterize the class of the input time series.

The other is the visualization of a final hypothesis , where ( is the set of representative subsequences, see details in supplementary materials). Figure 1(b) is an example of visualization of a final hypothesis obtained by our method. The colored lines are all the s in where both and were non-zero. Each value of the legends shows the multiplication of and corresponding to . That is, positive values on the colored lines indicate the contribution rate for the positive class, and negative values indicate the contribution rate for the negative class. Note that, because it is difficult to visualize the shapelets over the Hilbert space associated with the Gaussian kernel, we plotted each of them to match the original time series based on the Euclidean distance. Unlike visualization analyses using the existing shapelets-based methods (see, e.g., Ye and Keogh, 2009), our visualization, colored lines and plotted position, do not strictly represent the meaning of the final hypothesis because of the non-linear feature map. However, we can say that the colored lines represent “important patterns”, and certainly make important contributions to classification.

### 7.2 Results for Multiple-Instance Data

We selected the baselines of MIL algorithms as mi-SVM and MI-SVM (Andrews et al., 2003). Both algorithms are now classical, but still perform favorably compared with state-of-the-art methods for standard multiple-instance data (see, e.g., Doran, 2015). Moreover, the generalization bound of these algorithms are shown in (Sabato and Tishby, 2012) because the algorithms obtain a (single) shapelet-based classifier. Hence, the following comparative experiments simulate a single shapelet with theoretical generalization ability versus infinitely many shapelets with theoretical generalization ability. We combined a linear, polynomial, and Gaussian kernel with mi-SVM and MI-SVM, respectively. Parameter was chosen from , degree of the polynomial kernel is chosen from and parameter of the Gaussian kernel was chosen from . For our method, we chose from , and we only used the Gaussian kernel. We chose from

. Although we demonstrated both non-sparse and sparse weak learning, interestingly, sparse version beat non-sparse version for all datasets. Thus, we will only show the result on the sparse version because of space limitations. For all these algorithms, we estimated optimal parameter set via 5-fold cross-validation. We used well-known multiple-instance data as shown on the left-hand side of Table

2. The accuracies resulted from 10 runs of 10-fold cross-validation.

The results are shown in the right-hand side of Table 2. Because of space limitations, for baselines we only show the results of the kernel that achieved the best accuracy. Although the accuracy of our method for fox data was slightly worse, our method significantly outperformed baselines for the other 4 datasets.

## References

• Andrews and Hofmann (2004) Stuart Andrews and Thomas Hofmann. Multiple instance learning via disjunctive programming boosting. In S. Thrun, L. K. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 65–72. MIT Press, 2004.
• Andrews et al. (2003) Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector machines for multiple-instance learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 577–584. MIT Press, 2003.
• Auer and Ortner (2004) Peter Auer and Ronald Ortner. A boosting approach to multiple instance learning. In Jean-Fran¥ccois Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi, editors, Machine Learning: ECML 2004, pages 63–74, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.
• Bagnall et al. (2017) Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 31(3):606–660, May 2017. ISSN 1573-756X.
• Bartlett and Mendelson (2003) Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2003.
• Carbonneau et al. (2018) Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329 – 353, 2018. ISSN 0031-3203.
• Chen et al. (2015) Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. The ucr time series classification archive, July 2015.
• Chen et al. (2006) Yixin Chen, Jinbo Bi, and J. Z. Wang. Miles: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):1931–1947, Dec 2006. ISSN 0162-8828.
• Demiriz et al. (2002) A Demiriz, K P Bennett, and J Shawe-Taylor. Linear Programming Boosting via Column Generation. Machine Learning, 46(1-3):225–254, 2002.
• Dietterich et al. (1997) Thomas G. Dietterich, Richard H. Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1-2):31–71, January 1997. ISSN 0004-3702.
• Doran (2015) Gary Doran. Multiple Instance Learning from Distributions. PhD thesis, Case WesternReserve University, 2015.
• Doran and Ray (2014) Gary Doran and Soumya Ray. A theoretical and empirical analysis of support vector machine methods for multiple-instance classification. Machine Learning, 97(1-2):79–102, October 2014. ISSN 0885-6125.
• Gärtner et al. (2002) Thomas Gärtner, Peter A. Flach, Adam Kowalczyk, and Alex J. Smola. Multi-instance kernels. In Proceedings 19th International Conference. on Machine Learning, pages 179–186. Morgan Kaufmann, 2002.
• Grabocka et al. (2014) Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars Schmidt-Thieme. Learning time-series shapelets. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 392–401, 2014.
• Grabocka et al. (2015) Josif Grabocka, Martin Wistuba, and Lars Schmidt-Thieme. Scalable discovery of time-series shapelets. CoRR, abs/1503.03238, 2015.
• Hills et al. (2014) Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. Classification of time series by shapelet transformation. Data Mining and Knowledge Discovery, 28(4):851–881, July 2014. ISSN 1384-5810.
• Hou et al. (2016) Lu Hou, James T. Kwok, and Jacek M. Zurada. Efficient learning of timeseries shapelets. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence,, pages 1209–1215, 2016.
• Keogh and Rakthanmanon (2013) Eamonn J. Keogh and Thanawin Rakthanmanon. Fast shapelets: A scalable algorithm for discovering time series shapelets. In Proceedings of the 13th SIAM International Conference on Data Mining, pages 668–676, 2013.
• Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012.
• Platt et al. (2000) John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin dags for multiclass classification. In S. A. Solla, T. K. Leen, and K. M¥”uller, editors, Advances in Neural Information Processing Systems 12, pages 547–553. MIT Press, 2000.
• Renard et al. (2015) Xavier Renard, Maria Rifqi, Walid Erray, and Marcin Detyniecki. Random-shapelet: an algorithm for fast shapelet discovery. In

2015 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA’2015)

, pages 1–10. IEEE, 2015.
• Sabato and Tishby (2012) Sivan Sabato and Naftali Tishby. Multi-instance learning with any hypothesis class. Journal of Machine Learning Research, 13(1):2999–3039, October 2012. ISSN 1532-4435.
• Schölkopf and Smola (2002) B. Schölkopf and AJ. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, December 2002.
• Shapiro (2009) Alexander Shapiro. Semi-infinite programming, duality, discretization and optimality conditions. Optimization, 58(2):133–161, 2009.
• Suehiro et al. (2017) Daiki Suehiro, Kohei Hatano, Eiji Takimoto, Shuji Yamamoto, Kenichi Bannai, and Akiko Takeda. Boosting the kernelized shapelets: Theory and algorithms for local features. CoRR, abs/1709.01300, 2017.
• Tao and Souad (1988) Pham Dinh Tao and El Bernoussi Souad. Duality in D.C. (Difference of Convex functions) Optimization. Subgradient Methods, pages 277–293. Birkhäuser Basel, Basel, 1988.
• Ye and Keogh (2009) Lexiang Ye and Eamonn Keogh. Time series shapelets: A new primitive for data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 947–956. ACM, 2009.
• Yu and Joachims (2009) Chun-Nam John Yu and Thorsten Joachims. Learning structural svms with latent variables. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1169–1176, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1.
• Zhang et al. (2006) Cha Zhang, John C. Platt, and Paul A. Viola. Multiple instance boosting for object detection. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1417–1424. MIT Press, 2006.
• Zhang et al. (2013) Dan Zhang, Jingrui He, Luo Si, and Richard Lawrence. Mileage: Multiple instance learning with global embedding. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 82–90, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.

## Appendix A Supplementary materials

### a.1 Proof of Theorem 1

###### Definition 1

[The set of mappings from a bag to an instance]
Given a sample . For any , let be a mapping defined by

 θu,Φ(Bi):=argmaxx∈Bi⟨u,Φ(x)⟩,

and we define the set of all for as . For the sake of brevity, and will be abbreviated as and , respectively.

We can rewrite the optimization problem (6) by using as follows:

 maxθ∈Θmaxu∈H:θu=θ m∑i=1yidi⟨u,Φ(θ(Bi))⟩ (10) sub.to ∥u∥2H≤1.

Thus, if we fix , we have a sub-problem. Since the constraint can be written as the number of linear constraints, each sub-problem is equivalent to a convex optimization. Indeed, each sub-problem can be written as the equivalent unconstrained minimization (by neglecting constants in the objective)

 minu∈H −m∑i=1yidi⟨u,Φ(θ(Bi))⟩ sub.to ⟨u,Φ(x)⟩≤⟨u,Φ(θ(Bi))⟩(i∈[m],x∈Bi),

where and are the corresponding positive constants. Now for each sub-problem, we can apply the standard Representer Theorem argument (see, e.g., Mohri et al. (2012)). Let be the subspace . We denote as the orthogonal projection of onto and any has the decomposition . Since is orthogonal w.r.t. , . On the other hand, . Therefore, the optimal solution of each sub-problem has to be contained in . This implies that the optimal solution, which is the maximum over all solutions of sub-problems, is contained in as well.

### a.2 Proof of Theorem 2

We use and of Definition 1.

###### Definition 2

[The Rademacher and the Gaussian complexity Bartlett and Mendelson (2003)]
Given a sample , the empirical Rademacher complexity of a class w.r.t.  is defined as , where and each

is an independent uniform random variable in

. The empirical Gaussian complexity of w.r.t.  is defined similarly but each

is drawn independently from the standard normal distribution.

The following bounds are well-known.

###### Lemma 1

[Lemma 4 of Bartlett and Mendelson (2003)] .

###### Lemma 2

[Corollary 6.1 of Mohri et al. (2012)] For fixed , , the following bound holds with probability at least : for all ,

 ED(f)≤Eρ(f)+2ρRS(H)+3√log1δ2m.

To derive generalization bound based on the Rademacher or the Gaussian complexity is quite standard in the statistical learning theory literature and applicable to our classes of interest as well. However, a standard analysis provides us sub-optimal bounds.

###### Lemma 3

Suppose that for any , . Then, the empirical Gaussian complexity of with respect to for is bounded as follows:

 GS(H)≤R√(√2−1)+2(ln|Θ|)√m.

Since can be partitioned into ,

 GS(HU) =1mEσ[supθ∈Θsupu∈U:θu=θm∑i=1σi⟨u,Φ(θ(Bi))⟩] ≤1mEσ[supθ∈Θsupu∈U⟨u,(m∑i=1σiΦ(θ(Bi)))⟩] ≤1mEσ[supθ∈Θ∥∥ ∥∥m∑i=1σiΦ(θ(Bi))∥∥ ∥∥H] =1mEσ⎡⎢⎣supθ∈Θ ⎷∥∥ ∥∥m∑i=1σiΦ(θ(Bi))∥∥ ∥∥2H⎤⎥⎦ =1mEσ⎡⎢⎣ ⎷supθ∈Θ∥∥ ∥∥m∑i=1σiΦ(θ(Bi))∥∥ ∥∥2H⎤⎥⎦ ≤1m  ⎷Eσ⎡⎣supθ∈Θ∥∥ ∥∥m∑i=1σiΦ(θ(Bi))∥∥ ∥∥2H⎤⎦. (11)

The first inequality is derived from the relaxation of , the second inequality is due to Cauchy-Schwarz inequality and the fact , and the last inequality is due to Jensen’s inequality. We denote by the kernel matrix such that . Then, we have

 Eσ⎡⎣supθ∈Θ∥∥ ∥∥m∑i=1σiΦ(θ(Bi))∥∥ ∥∥2H⎤⎦=Eσ[supθ∈Θm∑i,j=1σiσjK(θ)ij]. (12)

We now derive an upper bound of the r.h.s. as follows.

For any ,

 exp(cEσ[supθ∈Θm∑i,j=1σiσjK(θ)ij]) ≤Eσ[exp(csupθ∈Θm∑i,j=1σiσjK(θ)ij)] = Eσ[supθ∈Θexp(cm∑i,j=1σiσjK(θ)ij)] ≤∑θ∈ΘEσ[exp(cm∑i,j=1σiσjK(θ)ij)]

The first inequality is due to Jensen’s inequality, and the second inequality is due to the fact that the supremum is bounded by the sum. By using the symmetry property of , we have , which is rewritten as

where

are the eigenvalues of

and is the orthonormal matrix such that

is the eigenvector that corresponds to the eigenvalue

. By the reproductive property of Gaussian distribution,

obeys the same Gaussian distribution as well. So,

 ∑θ∈ΘEσ[exp(cm∑i,j=1σiσjK(θ)ij)] = ∑θ∈ΘEσ[exp(cσ⊤K(θ)σ)] = ∑θ∈ΘEσ[exp(cm∑k=1λ(θ)k(v⊤kσ)2)] = ∑θ∈ΘΠmk=1Eσk[exp(cλ(θ)kσ2k)]  (replace σ=v⊤kσ) = ∑θ∈ΘΠmk=1(∫∞−∞exp(cλ(θ)kσ2)exp(−σ2)√2πdσ) = ∑θ∈Θ