1 Introduction
MultipleInstance Learning (MIL) is a fundamental framework of supervised learning with a wide range of applications such as prediction of molecular activity, and image classification. MIL has been extensively studied both in theoretical and practical aspects
(Gärtner et al., 2002; Andrews et al., 2003; Sabato and Tishby, 2012; Zhang et al., 2013; Doran and Ray, 2014; Carbonneau et al., 2018), since the notion of MIL was first proposed by Dietterich et al. (1997).A standard MIL setting is described as follows: A learner receives sets called bags; each contains multiple instances. In the training phase, each bag is labeled but instances are not labeled individually. The goal of the learner is to obtain a hypothesis that predicts the labels of unseen bags correctly^{1}^{1}1Although there are settings where instance label prediction is also considered, we focus only on baglabel prediction in this paper.. One of the most common hypotheses used in practice has the following form:
(1) 
where is a feature map and
is a feature vector which we call a
shapelet. In many applications, is interpreted as a particular “pattern” in the feature space and the inner product as the similarity of from . Note that we use the term “shapelets” by following the terminology of Shapelet Learning (SL), which is a framework for timeseries classification, although it is often called “concepts” in the literature of MIL. Intuitively, this hypothesis evaluates a given bag by the maximum similarity between the instances in the bag and the shapelet. MultipleInstance Support Vector Machine (MISVM) proposed by
Andrews et al. (2003) is a widely used algorithm that employs this hypothesis class and learns . It is wellknown that MIL algorithms using this hypothesis class perform empirically better in various multipleinstance datasets. Moreover, a generalization error bound of the hypothesis class is given by Sabato and Tishby (2012).However, in some domains such as image recognition and document classification, it is said that the hypothesis class (1) is not effective (see, e.g., Chen et al., 2006). To employ MIL on such domains more effectively, Chen et al. (2006) extend a hypothesis to a convex combination of :
(2) 
for some set of shapelets. In particular, Chen et al. consider , which is constructed from all instances in the training sample. The authors demonstrate that this hypothesis with the Gaussian kernel performs well in image recognition. The generalization bound provided by Sabato and Tishby (2012) is applicable to a hypothesis class of the form (2) for the set of infinitely many shapelets with bounded norm. Therefore, the generalization bound also holds for . However, it has never been theoretically discussed why such a fixed set using training instances effectively works in MIL tasks.
1.1 Our Contributions
In this paper, we propose an MIL formulation with the hypothesis class (2) for sets of infinitely many shapelets.
The proposed learning framework is theoretically motivated and practically effective. We show the generalization error bound based on the Rademacher complexity (Bartlett and Mendelson, 2003) and large margin theory. The result indicates that we can achieve a small generalization error by keeping a large margin for large training sample.
The learning framework can be applied to various kinds of data and tasks because of our unified formulation. The existing shapeletbased methods are formulated for their target domains. More precisely, the existing shapeletbased methods are formulated using a fixed similarity measure (or distance), and the generalization ability is shown empirically in their target domains. For example, Chen et al. (2006) and Sangnier et al. (2016) calculated the feature vectors based on the similarity between every instance using the Gaussian kernel. In timeseries domain, shapeletbased methods (Ye and Keogh, 2009; Keogh and Rakthanmanon, 2013; Hills et al., 2014) usually use Euclidean distance as a similarity measure (or distance). By contrast, our framework employs a kernel function as a similarity measure. Therefore, our learning framework can be uniformly applied if we can set a kernel function as a similarity measure according to a target learning task. For example, the Gaussian kernel (behaves like the Euclidean distance) and Dynamic Time Warping (DTW) kernel (Shimodaira et al., 2001). Our framework can be also applied to nonrealvalued sequence data (e.g., text, and a discrete signal) using a string kernel. Moreover, our generalization performance is guaranteed theoretically. The experimental results demonstrate that the provided approach uniformly works for SL and MIL tasks without introducing domainspecific parameters and heuristics, and compares with the stateoftheart shapeletbased methods.
We show that the formulation is tractable. The algorithm is based on Linear Programming Boosting (LPBoost, Demiriz et al., 2002) that solves the soft margin optimization problem via a column generation approach. Although the weak learning problem in the boosting becomes an optimization problem over an infinitedimensional space, we can show that an analog of the representer theorem holds on it and allows us to reduce it to a nonconvex optimization problem (difference of convex program) over a finitedimensional space. While it is difficult to solve the subproblems exactly because of nonconvexity, it is possible to find good approximate solutions with reasonable time in many practical cases (see, e.g., Le Thi and Pham Dinh, 2018).
Remarkably, our theoretical result gives justification to the heuristics of choosing the shapelets in the training instances. Our representer theorem indicates that at th iteration of boosting, the optimal solution (i.e., shapelet) of the weak learning problem can be written as a linear combination of the feature maps of training instances, that is, . Thus, we obtain a final classifier of the following form
Note that the hypothesis class used in the standard approach (Chen et al., 2006; Sangnier et al., 2016) corresponds to the special case where . This observation would suggest that the standard approach of using is reasonable.
1.2 Comparison to Related Work for MIL
There are many MIL algorithms with hypothesis classes which are different from (1) or (2). (e.g., Auer and Ortner, 2004; Gärtner et al., 2002; Andrews and Hofmann, 2004; Zhang et al., 2006; Chen et al., 2006). For example, these algorithms adopted diverse approaches for the baglabeling hypothesis from shapeletbased hypothesis classes (e.g., Zhang et al. (2006) used a NoisyOR based hypothesis and Gärtner et al. (2002) proposed a new kernel called a set kernel). Shapeletbased hypothesis classes have a practical advantage of being applicable to SL in the timeseries domain (see next subsection).
Sabato and Tishby (2012) proved generalization bounds of hypotheses classes for MIL including those of (1) and (2) with infinitely large sets . The generalization bound we provided in this paper is incomparable to the bound provided by Sabato and Tishby. When some datadependent parameter is regarded as a constant, our bound is slightly better in terms of the sample size by the factor of . They also proved the PAClearnability of the class (1) using the boosting approach under some technical assumptions. Their boosting approach is different from our work in that they assume that labels are consistent with some hypothesis of the form (1), while we consider arbitrary distributions over bags and labels.
1.3 Connection between MIL and Shapelet Learning for Time Series Classification
Here we mention briefly that MIL with type (2) hypotheses is closely related to SL, a framework for timeseries classification that has been extensively studied (Ye and Keogh, 2009; Keogh and Rakthanmanon, 2013; Hills et al., 2014; Grabocka et al., 2014)
in parallel to MIL. SL is a notion of learning with a feature extraction method, defined by a finite set
of realvalued “short” sequences called shapelets. A similarity measure is given by (not necessarily a Mercer kernel) in the following way. A time series can be identified with a bag consisting of all subsequences of of length . The feature of is a vector of a fixed dimension regardless of the length of the time series . When we employ a linear classifier on top of the features, we obtain a hypothesis in the form:(3) 
which is essentially the same form as (2), except that finding good shapelets is a part of the learning task, as well as to find a good weight vector . This task is one of the most successful approaches for SL (Hills et al., 2014; Grabocka et al., 2014; Grabocka et al., 2015; Renard et al., 2015; Hou et al., 2016), where a typical choice of is . However, almost all existing methods heuristically choose shapelets and with no theoretical guarantee on how good the choice of is.
Note also that in the SL framework, each is called a shapelet, while in this paper, we assume that is a kernel and any (not necessarily for some ) in the Hilbert space is called a shapelet.
Sangnier et al. (2016)
proposed an MILbased anomaly detection algorithm for time series data. They showed an algorithm based on LPBoost and the generalization error bound based on the Rademacher complexity
(Bartlett and Mendelson, 2003). Their hypothesis class is same as (Chen et al., 2006). However, they did not analyze the theoretical justification to use finite set made from training instances (the authors mentioned as future work). By contrast, we consider a hypothesis class based on infinitely many shapelets, and our representer theorem guarantees that our learning problem over the infinitely large set is still tractable. As a result, our study justifies the previous heuristics of their approach.There is another work which treats shapelets not appearing in the training set. Learning TimeSeries Shapelets (LTS) algorithm (Grabocka et al., 2014) tries to solve a nonconvex optimization problem of learning effective shapelets in an infinitely large domain. However, there is no theoretical guarantee of its generalization error. In fact, our generalization error bound applies to their hypothesis class.
For SL tasks, many researchers focus on improving efficiency (Keogh and Rakthanmanon, 2013; Renard et al., 2015; Grabocka et al., 2015; Wistuba et al., 2015; Hou et al., 2016; Karlsson et al., 2016). However, these methods are specialized in the timeseries domain, and the generalization performance has never been theoretically discussed.
Curiously, despite MIL and SL share similar motivations and hypotheses, the relationship between MIL and SL has not yet been pointed out. From the shapeletperspective in MIL, the hypothesis (1) is regarded as a “single shapelet”based hypothesis, and the hypothesis (2) is regarded as a “multipleshapelets”based hypothesis. In this study, we refer to a linear combination of maximum similarities based on shapelets such as (2) and (3) as shapeletbased classifiers.
2 Preliminaries
Let be an instance space. A bag is a finite set of instances chosen from . The learner receives a sequence of labeled bags called a sample, where each labeled bag is independently drawn according to some unknown distribution over . Let denote the set of all instances that appear in the sample . That is, . Let be a kernel over , which is used to measure the similarity between instances, and let denote a feature map associated with the kernel for a Hilbert space , that is, for instances , where denotes the inner product over . The norm induced by the inner product is denoted by defined as for .
For each which we call a shapelet, we define a shapeletbased classifier denoted by , as the function that maps a given bag to the maximum of the similarity scores between shapelet and over all instances in . More specifically,
For a set , we define the class of shapeletbased classifiers as
and let denote the set of convex combinations of shapeletbased classifiers in . More precisely,
(4) 
The goal of the learner is to find a hypothesis , so that its generalization error is small. Note that since the final hypothesis is invariant to any scaling of , we assume without loss of generality that
Let denote the empirical margin loss of over , that is, .
3 Optimization Problem Formulation
In this paper, we formulate the problem as soft margin maximization with norm regularization, which ensures a generalization bound for the final hypothesis (see, e.g., Demiriz et al., 2002). Specifically, the problem is formulated as a linear programming problem (over infinitely many variables) as follows:
(5)  
sub.to  
where is a parameter. To avoid the integral over the Hilbert space, it is convenient to consider the dual form:
(6)  
sub.to  
The dual problem is categorized as a semiinfinite program because it contains infinitely many constraints. Note that the duality gap is zero because the problem (6) is linear and the optimum is finite (see Theorem 2.2 of Shapiro, 2009). We employ column generation to solve the dual problem: solve (6) for a finite subset , find to which the corresponding constraint is maximally violated by the current solution (column generation part), and repeat the procedure with until a certain stopping criterion is met. In particular, we use LPBoost (Demiriz et al., 2002), a wellknown and practically fast algorithm of column generation. Since the solution is expected to be sparse due to the 1norm regularization, the number of iterations is expected to be small.
Following the boosting terminology, we refer to the column generation part as weak learning. In our case, weak learning is formulated following the optimization problem:
(7) 
Thus, we need to design a weak learner for solving (7) for a given sample weighted by . However, it seems to be impossible to solve it directly because we only have access to through the associated kernel. Fortunately, we prove a version of representer theorem given below, which makes (7) tractable.
Theorem 1 (Representer Theorem)
The solution of (7) can be written as for some real numbers .
Our theorem can be derived from a nontrivial application of the standard representer theorem (see, e.g., Mohri et al., 2012). Intuitively, we prove the theorem by decomposing the optimization problem (7) into a number of subproblems, so that the standard representer theorem can be applied to each of the subproblems. The detail of the proof is given in Appendix A.
This result gives justification to the simple heuristics in the standard approach: choosing the shapelets based on the training instances. More precisely, the hypothesis class used in the standard approach (Chen et al., 2006; Sangnier et al., 2016) corresponds to the special case where . Thus, our representer theorem would suggest that the standard approach of using is reasonable.
Theorem 1 says that the weak learning problem can be rewritten in the following tractable form:
Op 1
Weak Learning Problem
Unlike the primal solution , the dual solution is not expected to be sparse. To obtain a more interpretable hypothesis, we propose another formulation of weak learning where 1norm regularization is imposed on , so that a sparse solution of will be obtained. In other words, instead of , we consider the feasible set , where is the 1norm of .
Op 2
Sparse Weak Learning Problem
Note that when running LPBoost with a weak learner for OP 2, we obtain a final hypothesis that has the same form of generalization bound as the one stated in Theorem 2, which is of a final hypothesis obtained when used with a weak learner for OP 1. To see this, consider a feasible space for a sufficiently small , so that . Then since , a generalization bound for also applies to . On the other hand, since the final hypothesis for is invariant to the scaling factor , the generalization ability is independent of .
4 Algorithms
In this section, we present the pseudocode of LPBoost in Algorithm 1 for completeness. Moreover, we describe our algorithms for the weak learners. For simplicity, we denote by a vector given by for every . The objective function of OP 1 (and OP 2) is rewritten as
which can be seen as a difference of two convex functions and of . Therefore, the weak learning problems are DC programs and thus we can use DC algorithm (Tao and Souad, 1988; Yu and Joachims, 2009) to find an approximation of a local optimum. We employ a standard DC algorithm. That is, for each iteration , we linearize the concave term with at the current solution , which is with in our case, and then update the solution to by solving the resultant convex optimization problem .
In addition, the problems for OP 1 and OP 2 are reformulated as a secondorder cone programming (SOCP) problem and an LP problem, respectively, and thus both problems can be efficiently solved. To this end, we introduce new variables for all negative bags with which represent the factors . Then we obtain the equivalent problem to for OP 1 as follows:
(8)  
sub.to  
It is well known that this is an SOCP problem. Moreover, it is clear that for OP 2 can be formulated as an LP problem. We describe the algorithm for OP 1 in Algorithm 2.
One may concern that a kernel matrix may become large when a sample consists a large amount of bags and instances. However, note that the kernel matrix of which is used in Algorithm 2 needs to be computed only once at the beginning of Algorithm 1, not at every iteration.
As a result, our learning algorithm outputs a classifier
where and are obtained in training phase. Therefore, the computational cost for predicting the label of is in the worst case when all elements of are nonzero. However, when we employ our sparse formulation OP 2 which allows us to find a sparse , the computational cost is expected to be much smaller than the worst case.
5 Generalization Bound of the Hypothesis Class
In this section, we provide a generalization bound of hypothesis classes for various and .
Let . Let . By viewing each instance
as a hyperplane
, we can naturally define a partition of the Hilbert space by the set of all hyperplanes . Let be the set of all cells of the partition, that is, . Each cell is a polyhedron which is defined by a minimal set that satisfies . LetLet be the VC dimension of the set of linear classifiers over the finite set , given by .
Then we have the following generalization bound on the hypothesis class of (2).
Theorem 2
Let . Suppose that for any , . Then, for any
, with high probability the following holds for any
with :(10) 
where (i) for any , , (ii) if and is the identity mapping (i.e., the associated kernel is the linear kernel), or (iii) if and satisfies the condition that is monotone decreasing with respect to (e.g., the mapping defined by the Gaussian kernel) and , then .
We show the proof in Appendix B.
Comparison with the existing bounds
A similar generalization bound can be derived from a known bound of the Rademacher complexity of (Theorem 20 of Sabato and Tishby, 2012) and a generalization bound of for any hypothesis class (see Corollary 6.1 of Mohri et al., 2012):
Note that Sabato and Tishby (2012) fixed . For simplicity, we omit some constants of (Theorem 20 of Sabato and Tishby, 2012). Note that by definition. The bound above is incomparable to Theorem 2 in general, as ours uses the parameter and the other has the extra term. However, our bound is better in terms of the sample size by the factor of when other parameters are regarded as constants.
6 SL by MIL
6.1 TimeSeries Classification with Shapelets
In the following, we introduce a framework of timeseries classification problem based on shapelets (i.e., SL problem). As mentioned in the previous section, a time series can be identified with a bag that consists of all subsequences of of length . The learner receives a labeled sample , where each labeled bag (i.e. labeled time series) is independently drawn according to some unknown distribution over a finite support of . The goal of the learner is to predict the labels of an unseen time series correctly. In this way, the SL problem can be viewed as an MIL problem, and thus we can apply our algorithms and theory.
Note that, for timeseries classification, various similarity measures can be represented by a kernel. For example, the Gaussian kernel (behaves like the Euclidean distance) and Dynamic Time Warping (DTW) kernel. Moreover, our framework can generally apply to nonrealvalued sequence data (e.g., text, and a discrete signal) using a string kernel.
6.2 Our Theory and Algorithms for SL
By Theorem 2, we can immediately obtain the generalization bound of our hypothesis class in SL as follows:
Corollary 3
Consider timeseries sample of size and length . For any fixed , the following generalization error bound holds for all in which the length of shapelet is :
To the best of our knowledge, this is the first result on the generalization performance of SL.
Theorem 1 gives justification to the heuristics which choose the shapelets extracted from the instances appearing in the training sample (i.e., the subsequences for SL tasks). Moreover, several methods using a linear combination of shapeletbased classifiers (e.g., Hills et al., 2014; Grabocka et al., 2014), are supported by Corollary 3.
For timeseries classification problem, shapeletbased classification has a greater advantage of the interpretability or visibility than other timeseries classification methods (see, e.g., Ye and Keogh, 2009). Although we use a nonlinear kernel function, we can observe important subsequences that contribute to effective shapelets by solving OP 2 because of the sparsity (see also the experimental results). Moreover, for unseen timeseries data, we can observe the type of subsequences that contribute to the predicted class by observing maximizer .
6.3 Learning Shapelets of Different Lengths
For timeseries classification, many existing methods take advantage of using shapelets of various lengths. Below, we show that our formulation can be easily applied to the case.
A time series can be identified with a bag that consists of all length of subsequences of . That is, this is also a special case of MIL that a bag contains different dimensional instances.
There is a simple way to apply our learning algorithm to this case. We just employ some kernels which supports different dimensional instance pairs and . Fortunately, such kernels have been studied well in the timeseries domain. For example, DTW kernel and Global Alignment kernel (Cuturi, 2011) are wellknown kernels which support time series of different lengths. However, the size of the kernel matrix of becomes . In practice, it requires high memory cost for large timeseries data. Moreover, in general, the above kernel requires a higher computational cost than standard kernels.
We introduce a practical way to learn shapelets of different lengths based on heuristics. In each weak learning problem, we decomposed the original weak learning problem over different dimensional data space into the weak learning problems over each dimensional data space. For example, we consider solving the following problem instead of the weak learning problem OP 1:
sub.to 
where denotes the dimensional instances (i.e., length of subsequences) in , and denotes . The total size of kernel matrices becomes , and thus this method does not require so large kernel matrix. Moreover, in this way, we do not need to use a kernel which supports different dimensional instances. Note that, even using this heuristic, the obtained final hypothesis has theoretical generalization performance. This is because the hypothesis class still represented as the form of (2). In our experiment, we use the latter method by giving weight to memory efficiency.
6.4 Heuristics for computational efficiency
For the practical applications, we introduce some heuristics for improving efficiency in our algorithm.
Reduction of
Especially for timeseries data, the size often becomes large because . Therefore, constructing a kernel matrix of has high computational costs for timeseries data. For example, when we consider subsequences as instances for time series classification, we have a large computational cost because of the number of subsequences of training data (e.g., approximately when sample size is and length of each time series is , which results in a similarity matrix of size ). However, in most cases, many subsequences in time series data are similar to each other. Therefore, we only use representative instances instead of the set of all instances . In this paper, we use means clustering to reduce the size of . Note that our heuristic approach is still supported by our theoretical generalization error bound. This is because the hypothesis set with the reduced shapelets is the subset of , and the Rademacher complexity of is exactly smaller than the Rademacher complexity of . Thus, Theorem 2 holds for the hypothesis class considering the set of all possible shapelets , and thus Theorem 2 also holds for the hypothesis class using the set of some reduced shapelets . Although this approach may decrease the training classification accuracy in practice, it drastically decreases the computational cost for a large dataset.
Initialization in weak learning problem
DC program may slowly converge to local optimum depending on the initial solution. In Algorithm 2, we fix an initial as following: More precisely, we initially solve
(11)  
That is, we choose the most discriminative shapelet from as the initial point of for given . We expect that it will speed up the convergence of the loop of line 3, and the obtained classifier is better than the methods that choose effective shapelets from subsequences.
7 Experiments
In this section, we show some experimental results implying that our algorithm performs comparably with the existing shapeletbased classifiers for both SL and MIL tasks ^{2}^{2}2The code of our method is available in https://github.com/suehiro93/MILIMS_NECO.
7.1 Results for TimeSeries Data
We use several binary labeled datasets^{3}^{3}3Note that our method is applicable to multiclass classification tasks by easy expansion (e.g., Platt et al. (2000)). in UCR datasets (Chen et al., 2015), which are often used as benchmark datasets for timeseries classification methods. We used a weak learning problem OP 2 because the interpretability of the obtained classifier is required in shapeletbased timeseries classification.
We compare the following three shapeletbased approaches.

Shapelet Transform (ST) provided by Bagnall et al. (2017)

Learning TimeSeries Shapelets (LTS) provided by Grabocka et al. (2014)

Our algorithm using shapelets of different lengths (Ours)
We used the implementation of ST provided by Löning et al. (2019), and used the implementation of LTS provided by Tavenard et al. (2017). The classification rule of Shapelets Transform has the form:
where is a userdefined classification function (the implementation employs decision forest), (in the timeseries domain, this is called a shapelet). The shapelets are chosen from training subsequences in some complicated way before learning . The classification rule of Learning Timeseries Shapelets has the form:
where and are learned parameters, the number of desired shapelets is a hyperparameter.
Below we show the detail condition of the experiment. For ST, we set the shapelet lengths , where is the length of each time series in the dataset. ST also requires a parameter of time limit for searching shapelets, and we set it as 5 hours for each dataset. For LTS, we used the hyperparameter sets (regularization parameter, number of shapelets, etc.) that the authors recommended in their website^{4}^{4}4http://fs.ismll.de/publicspace/LearningShapelets/, and we found an optimal hyperparameter by fold crossvalidation for each dataset. For our algorithms, we implemented a weak learning algorithm which supports shapelets of different lengths (see Section 6.3). In this experiment, we consider the case that each bag contains lengths of the subsequences. We used the Gaussian kernel , chose from . We chose from . We use means clustering with respect to each class to reduce . The parameters we should tune are only and . We tuned these parameters via a procedure we give in Appendix B.1. As an LP solver for WeakLearn and LPBoost we used the CPLEX software. In addition to Ours, LTS employs means clustering to set the initial shapelets in the optimization algorithm. Therefore, we show the average accuracies for LTS and Ours considering the randomness of means clustering.
The classification accuracy results are shown in Table 1. We can see that our algorithms achieve comparable performance with ST and LTS. We conducted the Wilcoxon signedrank test between Ours and the others. The value of Wilcoxon signedrank test for Ours and ST is 0.1247. The value of Wilcoxon signedrank test for Ours and LTS is 0.6219. The values are higher than 0.05, and thus we cannot rejcect that there is no significant difference between the medians of the accuracies. We can say that our MIL algorithm works well for timeseries classification tasks without using domainspecific knowledge.
We would like to compare the computation time of these methods. We selected the datasets that these three methods have achieved similar performance. The experiments are performed on Intel Xeon Gold 6154, 36 core CPU, 192GB memory. Table 2 shows the comparison of the running time of the training. Note that again, for ST, we set the limitation of the running time as
hours for finding good shapelets. This running time limitation is a hyperparameter of the code and it is difficult to be estimated before experiments. LTS efficiently worked compared with ST and Ours. However, it seems that LTS achieved lower performance than ST and Ours on accuracy. Table
3 shows the testing time of the methods. LTS also efficiently worked, simply because LTS finds effective shapelets of a fixed number (hyperparameter). ST and Ours may find a large number of shapelets and this increases the computation time of prediction. For Wafer dataset, ST and Ours required large computation time compared with LTS.We can not fairly compare the efficiency of these methods because the implementation environments (e.g., programming languages) are different. However, we can say that the proposed method totally achieved high classification accuracy with reasonable running time for training and prediction.
Dataset  ST  LTS  Ours 

BeetleFly  0.8  0.765  0.835 
BirdChicken  0.9  0.93  0.935 
Coffee  0.964  1  0.964 
Computers  0.704  0.619  0.623 
DistalPhalanxOutlineCorrect  0.757  0.714  0.802 
Earthquakes  0.741  0.748  0.728 
ECG200  0.85  0.835  0.872 
ECGFiveDays  0.999  0.961  1 
FordA  0.856  0.914  0.89 
FordB  0.74  0.9  0.786 
GunPoint  0.987  0.971  0.987 
Ham  0.762  0.782  0.698 
HandOutlines  0.919  0.892  0.87 
Herring  0.594  0.652  0.588 
ItalyPowerDemand  0.947  0.951  0.943 
Lightning2  0.639  0.695  0.779 
MiddlePhalanxOutlineCorrect  0.794  0.579  0.632 
MoteStrain  0.927  0.849  0.845 
PhalangesOutlinesCorrect  0.773  0.633  0.792 
ProximalPhalanxOutlineCorrect  0.869  0.742  0.844 
ShapeletSim  0.994  0.989  1 
SonyAIBORobotSurface1  0.932  0.903  0.841 
SonyAIBORobotSurface2  0.922  0.895  0.887 
Strawberry  0.941  0.844  0.947 
ToeSegmentation1  0.956  0.947  0.906 
ToeSegmentation2  0.792  0.886  0.823 
TwoLeadECG  0.995  0.981  0.949 
Wafer  1  0.993  0.991 
Wine  0.741  0.487  0.72 
WormsTwoClass  0.831  0.752  0.608 
Yoga  0.847  0.69  0.804 
dataset  #train  length  ST  LTS  Ours 

Earthquakes  
GunPont  
ItalyPowerDemand  
ShapeletSim  
Wafer 
dataset  #test  length  ST  LTS  Ours 

Earthquakes  
GunPont  
ItalyPowerDemand  
ShapeletSim  
Wafer 
Interpretability of our method
We would like to show the interpretability of our method. We use CBF dataset which contains three classes (cylinder, bell, and funnel) of time series. The reason is that, it is known that the discriminative patterns are clear, and thus we can easily ascertain if the obtained hypothesis can capture the effective shapelets. For simplicity, we obtain a binary classification model for each class preparing onevsothers training set. We used Ours with fixed shapelet length . As following, we introduce two types of visualization approach to interpret a learned model.
One is the visualization of the characteristic subsequences of an input time series. When we predict the label of the time series , we calculate a maximizer in for each , that is, . For image recognition tasks, the maximizers are commonly used to observe the subimages that characterize the class of the input image (e.g., Chen et al., 2006). In timeseries classification tasks, the maximizers also can be used to observe some characteristic subsequences. Fig. 1 is an example of a visualization of maximizers. Each value in the legend indicates
. That is, subsequences with positive values contribute to the positive class and subsequences with negative values contribute to the negative class. Such visualization provides the subsequences that characterize the class of the input time series. For cylinder class, although both positive and negative patterns match almost the same subsequence, the positive pattern is stronger than negative, and thus the hypothesis can correctly discriminate the time series. For bell and funnel class, we can observe that the highlighted subsequences clearly indicate the discriminative patterns.
The other is the visualization of a final hypothesis , where ( is the set of representative subsequences obtained by means clustering). Fig. 2 is an example of the visualization of a final hypothesis obtained by our algorithm. The colored lines are all the s in where both and were nonzero. Each legend value shows the multiplication of and corresponding to . That is, positive values of the colored lines indicate the contribution rate for the positive class, and negative values indicate the contribution rate for the negative class. Note that, because it is difficult to visualize the shapelets over the Hilbert space associated with the Gaussian kernel, we plotted each of them to match the original time series based on the Euclidean distance. Unlike the previous visualization analyses (see, e.g., Ye and Keogh, 2009), our visualization does not exactly interpret the final hypothesis because of the nonlinear feature map. However, we can deduce that the colored lines represent “important patterns”, which make significant contributions to classification.
(cylinder) (bell) 
(funnel) 
(cylinder) (bell) 
(funnel) 
7.2 Results for MultipleInstance Data
We selected the baselines of MIL algorithms as miSVM and MISVM (Andrews et al., 2003), and MILES (Chen et al., 2006). miSVM and MISVM are classical method in MIL, but still perform favorably compared with stateoftheart methods for standard multipleinstance data (see, e.g., Doran, 2015). The details of the datasets are shown in Table 4.
mi and MISVM find a single but an optimized shapelet which is not limited to the instance in the training sample. The classifiers obtained by these algorithms are formulated as:
(12) 
MILES finds the multipleshapelets, but they are limited to the instances in the training sample. The classifier of MILES is formulated as follows:
(13) 
We used the implementation provided by Doran^{5}^{5}5https://github.com/garydoranjr/misvm for miSVM and MISVM. We combined the Gaussian kernel with miSVM and MISVM. Parameter was chosen from . For our method and MILES^{6}^{6}6MILES uses 1norm SVM to obtain a final classifier. We implemented 1norm SVM by using the formulation of Warmuth et al. (2008), we chose from , and we only used the Gaussian kernel. Furthermore, we chose from . We use means clustering with respect to each class to reduce . To avoid the randomness of means, we ran 30 times of training and selected the model which achieved the best training accuracy. For efficiency, we demonstrated the weak learning problem OP 2. For all these algorithms, we estimated optimal parameter set via 5fold crossvalidation. We used wellknown multipleinstance data as shown on the lefthand side of Table 5. The accuracies resulted from 10 times of 5fold crossvalidation.
dataset  sample size  # instances  dimension 

MUSK1  
MUSK2  
elephant  
fox  
tiger 
dataset  miSVM  MISVM  MILES  Ours 

MUSK1  
MUSK2  
elephant  
fox  
tiger 
dataset  MILES  Ours 

MUSK1  
MUSK2  
elephant  
fox  
tiger 
dataset  miSVM  MISVM  MILES  Ours 

MUSK1  
MUSK2  
elephant  
fox  
tiger 
dataset  miSVM  MISVM  MILES  Ours 

MUSK1  
MUSK2  
elephant  
fox  
tiger 
The results are shown in Table 5. MILES and Ours achieve significantly better performance than mi and MISVM. Ours achieves comparable performance to MILES. Table 6 shows the training accuracies of MILES and Ours. It can be seen that Ours achieves higher training accuracy. This result is theoretically reasonable because our hypothesis class is richer than MILES. However, in other words, this means that Ours has a higher overfitting risk than MILES.
Table 7 shows that the training time of the five methods. It is clear that MILES and Ours are more efficient than mi and MISVM. The main reason is that mi and MISVM solve Quadratic Programming (QP) problem while MILES and Ours solve LP problems. MILES worked averagely more efficient than Ours. However, for MUSK2 which has a large number of instances, Ours worked more efficiently than MILES.
The testing time of each algorithm is shown in Table 8. We can see that Ours is comparable to the other algorithms.
8 Conclusion and Future Work
We proposed a new MIL formulation that provides a richer class of the final classifiers based on infinitely many shapelets. We derived the tractable formulation over infinitely many shapelets with theoretical support, and provided an algorithm based on LPBoost and DC (Difference of Convex) algorithm. Our result gives theoretical justification for some existing shapeletbased classifiers (e.g., Chen et al., 2006; Hills et al., 2014). The experimental results demonstrate that the provided approach uniformly works for SL and MIL tasks without introducing domainspecific parameters and heuristics, and compares with the baselines of shapeletbased classifiers.
Especially for timeseries classification, the number of instances usually becomes large. Although we took a heuristic approach in the experiment, we think it is not an essential solution to improve the efficiency. We preliminarily implemented OP 1 with Orthogonal Random Features (Yu et al., 2016) that can approximate the Gaussian kernel accurately. It allows us to solve the primal problem of OP 1 directly, and allows us to avoid constructing a large kernel matrix. The implementation improved the efficiency drastically; however, it did not achieve high accuracy as compared with solutions of OP 2 with the heuristics. For SL tasks, there are many successful efficient methods using some heuristics specialized in timeseries domain (Keogh and Rakthanmanon, 2013; Renard et al., 2015; Grabocka et al., 2015; Wistuba et al., 2015; Hou et al., 2016; Karlsson et al., 2016). We will explore many ways to improve efficiency for SL tasks.
Moreover, we would like to improve the generalization error bound. Our bound is still incomparable with the existing bound. Since we think it requires to study more complex analysis, we reserve this for future work. Our heuristics might reduce the model complexity (i.e., the risk of overfitting); however, we do not know how the complexity can be reduced by our heuristics theoretically. To apply our method to various domains, we would like to explore the general techniques for reducing overfitting risk of our method.
Acknowledgement
This work is supported by JST CREST (Grant Number JPMJCR15K5) and JSPS KAKENHI (Grant Number JP18K18001). In the experiments, we used the computer resource offered under the category of General Projects by Research Institute for Information Technology, Kyushu University.
References
 Andrews and Hofmann (2004) Andrews, S. and Hofmann, T. (2004). Multiple instance learning via disjunctive programming boosting. In Advances in Neural Information Processing Systems, pages 65–72.
 Andrews et al. (2003) Andrews, S., Tsochantaridis, I., and Hofmann, T. (2003). Support vector machines for multipleinstance learning. In Advances in Neural Information Processing Systems, pages 577–584.

Auer and Ortner (2004)
Auer, P. and Ortner, R. (2004).
A boosting approach to multiple instance learning.
In
European Conference on Machine Learning
, pages 63–74.  Bagnall et al. (2017) Bagnall, A., Lines, J., Bostrom, A., Large, J., and Keogh, E. (2017). The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 31(3):606–660.
 Bartlett and Mendelson (2003) Bartlett, P. L. and Mendelson, S. (2003). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482.
 Carbonneau et al. (2018) Carbonneau, M.A., Cheplygina, V., Granger, E., and Gagnon, G. (2018). Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329 – 353.
 Chen et al. (2006) Chen, Y., Bi, J., and Wang, J. Z. (2006). Miles: Multipleinstance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):1931–1947.
 Chen et al. (2015) Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., and Batista, G. (2015). The ucr time series classification archive. www.cs.ucr.edu/~eamonn/time_series_data/.
 Cuturi (2011) Cuturi, M. (2011). Fast global alignment kernels. In International conference on machine learning, pages 929–936.
 Demiriz et al. (2002) Demiriz, A., Bennett, K. P., and ShaweTaylor, J. (2002). Linear Programming Boosting via Column Generation. Machine Learning, 46(13):225–254.
 Dietterich et al. (1997) Dietterich, T. G., Lathrop, R. H., and LozanoPérez, T. (1997). Solving the multiple instance problem with axisparallel rectangles. Artificial Intelligence, 89(12):31–71.
 Doran (2015) Doran, G. (2015). Multiple Instance Learning from Distributions. PhD thesis, Case WesternReserve University.
 Doran and Ray (2014) Doran, G. and Ray, S. (2014). A theoretical and empirical analysis of support vector machine methods for multipleinstance classification. Machine Learning, 97(12):79–102.
 Gärtner et al. (2002) Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. J. (2002). Multiinstance kernels. In International Conference on Machine Learning, pages 179–186.
 Grabocka et al. (2014) Grabocka, J., Schilling, N., Wistuba, M., and SchmidtThieme, L. (2014). Learning timeseries shapelets. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 392–401.
 Grabocka et al. (2015) Grabocka, J., Wistuba, M., and SchmidtThieme, L. (2015). Scalable discovery of timeseries shapelets. CoRR, abs/1503.03238.
 Hills et al. (2014) Hills, J., Lines, J., Baranauskas, E., Mapp, J., and Bagnall, A. (2014). Classification of time series by shapelet transformation. Data Mining and Knowledge Discovery, 28(4):851–881.
 Hou et al. (2016) Hou, L., Kwok, J. T., and Zurada, J. M. (2016). Efficient learning of timeseries shapelets. In AAAI Conference on Artificial Intelligence,, pages 1209–1215.
 Karlsson et al. (2016) Karlsson, I., Papapetrou, P., and Boström, H. (2016). Generalized random shapelet forests. Data Mining and Knowledge Discovery, 30(5):1053–1085.
 Keogh and Rakthanmanon (2013) Keogh, E. J. and Rakthanmanon, T. (2013). Fast shapelets: A scalable algorithm for discovering time series shapelets. In International Conference on Data Mining, pages 668–676.
 Le Thi and Pham Dinh (2018) Le Thi, H. A. and Pham Dinh, T. (2018). DC programming and DCA: thirty years of developments. Mathematical Programming, 169(1):5–68.
 Löning et al. (2019) Löning, M., Bagnall, A., Ganesh, S., Kazakov, V., Lines, J., and Király, F. J. (2019). sktime: A unified interface for machine learning with time series.
 Mohri et al. (2012) Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of Machine Learning. The MIT Press.
 Platt et al. (2000) Platt, J. C., Cristianini, N., and ShaweTaylor, J. (2000). Large margin DAGs for multiclass classification. In Advances in Neural Information Processing Systems, pages 547–553.

Renard et al. (2015)
Renard, X., Rifqi, M., Erray, W., and Detyniecki, M. (2015).
Randomshapelet: an algorithm for fast shapelet discovery.
In
IEEE International Conference on Data Science and Advanced Analytics
, pages 1–10.  Sabato and Tishby (2012) Sabato, S. and Tishby, N. (2012). Multiinstance learning with any hypothesis class. Journal of Machine Learning Research, 13(1):2999–3039.
 Sangnier et al. (2016) Sangnier, M., Gauthier, J., and Rakotomamonjy, A. (2016). Early and reliable event detection using proximity space representation. In International Conference on Machine Learning, pages 2310–2319.
 Schölkopf and Smola (2002) Schölkopf, B. and Smola, A. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Adaptive Computation and Machine Learning. MIT Press.
 Shapiro (2009) Shapiro, A. (2009). Semiinfinite programming, duality, discretization and optimality conditions. Optimization, 58(2):133–161.
 Shimodaira et al. (2001) Shimodaira, H., Noma, K.i., Nakai, M., and Sagayama, S. (2001). Dynamic timealignment kernel in support vector machine. In International Conference on Neural Information Processing Systems, pages 921–928.
 Tao and Souad (1988) Tao, P. D. and Souad, E. B. (1988). Duality in D.C. (Difference of Convex functions) Optimization. Subgradient Methods, pages 277–293.
 Tavenard et al. (2017) Tavenard, R., Faouzi, J., and Vandewiele, G. (2017). tslearn: A machine learning toolkit dedicated to timeseries data. https://github.com/rtavenar/tslearn.
 Warmuth et al. (2008) Warmuth, M., Glocer, K., and Rätsch, G. (2008). Boosting algorithms for maximizing the soft margin. In Advances in Neural Information Processing Systems, pages 1585–1592.
 Wistuba et al. (2015) Wistuba, M., Grabocka, J., and SchmidtThieme, L. (2015). Ultrafast shapelets for time series classification. CoRR, abs/1503.05018.
 Ye and Keogh (2009) Ye, L. and Keogh, E. (2009). Time series shapelets: A new primitive for data mining. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 947–956.
 Yu and Joachims (2009) Yu, C.N. J. and Joachims, T. (2009). Learning structural svms with latent variables. In International Conference on Machine Learning, pages 1169–1176.
 Yu et al. (2016) Yu, F. X. X., Suresh, A. T., Choromanski, K. M., HoltmannRice, D. N., and Kumar, S. (2016). Orthogonal random features. In Advances in Neural Information Processing Systems, pages 1975–1983.
 Zhang et al. (2006) Zhang, C., Platt, J. C., and Viola, P. A. (2006). Multiple instance boosting for object detection. In Advances in Neural Information Processing Systems, pages 1417–1424.
 Zhang et al. (2013) Zhang, D., He, J., Si, L., and Lawrence, R. (2013). MILEAGE: Multiple instance learning with global embedding. In International Conference on Machine Learning, pages 82–90.
Appendix A Proof of Theorem 1
First, we give a definition for convenience.
Definition 1
[The set of mappings from a bag to an instance]
Given a sample .
For any , let be a
mapping defined by
and we define the set of all for as . For the sake of brevity, and will be abbreviated as and , respectively.
Comments
There are no comments yet.