## 1 Abstract

Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show a weakness in an evaluation metric widely used in literature and we present improved versions of this metric and a general method, optisplit, for optimising cross validations splits. We present an extensive comparison of various types of cross validation methods in which we show that optisplit produces better cross validation splits than the existing methods and that it is fast enough to be used on big Gene Ontology (GO) datasets.

## 2 Introduction

Cross validation is a central procedure in machine learning, statistics and other fields. It evaluates model performance by testing models on data points, excluded from training data set. In standard cross validation a dataset is split randomly into non overlapping subsets (folds) . A model is trained for each fold on the data and evaluated on

. The averaged result over all the folds represents the final performance. Cross validation is typically used when the amount of data is limited. If an abundant amount of data is available, standard training test splits are faster to use and provide sufficiently good estimates of the model’s performance. The typically used random split approach assumes that the positive and negative class distributions are balanced. If the class distributions are imbalanced the resulting splits may not allow efficient learning. As an example, suppose that in binary classification settings one randomly generated fold contains all the examples in the data belonging to a one of the classes. Then the corresponding training set consisting of the rest of the folds does not contain any examples of that class and the model cannot learn anything about the class.

Stratified cross validation methods are variants of cross validation that ensure that the class distributions of the folds are close to the class distributions of the whole data. In this work we focus on multilabel classification which presents additional challenges. In multilabel classification each example can belong to multiple classes simultaneously. In order to use the data efficiently, the cross validation folds should be formed so that the correct class distributions are maintained for all classes and all folds at the same time. Random splitting has been historically a popular choice on multilabel data but is has been shown to lead to poor results [sechidis11].

As dataset sizes are growing across application domains, multilabel and extreme classification are of growing interest and importance [Bengio19]. Thus, assessing model quality on these settings is getting more and more important. However, we show in Section 4.1 and Section 6

that the currently used evaluation metrics and cross validation methods do not lead to optimal cross validation splits. Also, many existing methods are unusable slow for big extreme classification datasets. Here we present an algorithm for multilabel cross validation based on optimising the global distribution of all classes with respect to improved loss functions.

The methods presented here are developed in the context of gene ontology (GO) data [ashburner2000go]. The GO datasets used here are high dimensional and contain over half a million examples. The high number of classes, often in the order of thousands to tens of thousands, presents a challenge for analysis. Furthermore, the classes have a hierarchical tree structure, which causes the class distributions to be strongly imbalanced. So, as a result, most classes contain very small number of examples while some classes have very large number of examples. The high number of small classes, i.e. classes that have few positive examples, makes the negative data abundant for most of the classes and results in difficulties to have enough positive examples to train the models.

## 3 Related work

The most widely known method for generating cross validation splits with balanced class distributions is Iterative Stratification (IS). [sechidis11]. Iterative stratification works by dividing the examples evenly into the folds one class at a time. It always chooses the class with the fewest positive examples for the processing, and breaks ties first by the largest number of desired examples and further randomly. Smaller classes are more difficult to balance equally among the folds so starting from them makes sure they get well distributed. Bigger classes are easier to distribute so distributing them later is justifiable. Iterative stratification has also been extended to consider second order relationships between labels. This method is known as Second Order Iterative Stratification (SOIS) [szymanski17].

The recently introduced Stratified Sampling (SS) algorithm [merrillees21] is designed to produce balanced train/test splits for extreme classification data with a high number of examples and dimensions. It should be faster to use than iterative stratification variants and produce splits with better distributions. This method calculates for each class the proportion of it in training and test sets. Then for each example a score is calculated over its positive classes and the examples with the highest scores are redistributed from one to the other partition. This method needs three parameters that have to be adjusted according to the data and it does not produce cross validation splits directly but training/test splits.

The partitioning method based on stratified random sampling (PMBSRS) [charte16] is based on using the similarity of the label distributions between examples to group them and then dividing them into the cross validation folds [charte16]. A similarity score is defined as the product of the relative frequencies of the positive labels present in each example. Then the examples are ordered by the score into a list which is cut into as many disjoint subsets S as is the desired number of cross validation folds. Each cross validation fold is then generated by randomly selecting items without replacement from each set so that each fold gets an equal proportion of the samples from each . The end result is that each fold contains elements with different scores. As a non iterative method resembling the basic random sample method this is computationally less expensive than the iterative methods. However, this does not measure the split quality directly but is more aimed to ensure that all folds contain an equal amount of differently sized classes.

## 4 Cross validation evaluation metrics

The cross validation evaluation metrics present in literature are based on either evaluating the quality of the folds directly or applying some model on the folds and compare the learning results. Here we focus on directly comparing the quality of the folds in order to make the comparisons model independent. The most commonly used evaluation metrics in literature are the Labels Distribution (LD) measure and the Examples Distribution (ED) measure [sechidis11].

Let be the number of cross validation folds and be the number of classes. denotes the set of instances and denotes the folds which are disjoint subsets of . The subsets of and containing positive examples of label are denoted as and

We define the positive and negative frequencies for fold and label as and , respectively. Similarly for the whole data, positive frequency as and negative frequency as .

Then, we define

(1) |

and

(2) |

Intuitively, LD measures how the distribution of the positive and negative examples of each label in each subset compares to the distribution in the whole data ^{1}^{1}1LD

compares odds. Odds := positive to negative ratio.

and ED measures the deviation between the number of examples of each subset and the desired number of examples. Since the exact equality of the fold sizes is not generally important in practice, the ED score is merely useful in checking that the fold size differences are sufficiently small compared to the data size.As noted in Section 4.1, it is especially important for training and evaluation of models that the distributions of small classes in the cross validation folds are well balanced. Small classes are hardest to split well, since there are fewer ways to distribute the examples and even small differences in fold distributions give big relative differences. A good cross validation evaluation metric should be able to correctly quantify the quality of the folds even on small classes.

In the LD calculation, if we suppose that data size is constant, classes with more positive examples have larger positive to negative ratio compared to smaller classes. More formally, if the difference is zero and so is the contribution. Otherwise grows faster than , therefore the same absolute difference of from results in a larger contribution by larger classes.

LD does not account for this as it calculates a simple arithmetic mean over all class specific scores. Therefore, the final score is more affected by larger classes.

In Section 4.1, we show empirically that LD is not class size independent i.e. the size of the class affects its LD score. Hence, we propose the relative Labels Distribution (rLD) measure:

(3) |

which is the relative deviation of positive-frequencies compared to flat distribution across folds.

In addition, we present another metric that is insensitive to the class size, the Delta-Class Proportion (DCP)

(4) |

where the first part of the function represents the observed result and 1/k is the positive frequency of a flat distribution. Compared to rLD this metric does not measure the relative distributions of other folds than the largest. Hence, DCP is a coarser metric that is used here for comparison purposes. It should also be noted that DCP measures only the distributions of the positive examples. In practice, when then number of positive examples is smaller and there exists a class imbalance, the negative distributions are effectively balanced.

### 4.1 Comparison of evaluation metrics

We generated a synthetic dataset to experimentally compare the behavior of the cross validation split evaluation metrics presented in Section 4. The data consisted of a binary target matrix of size . The positive class sizes of the data were set to be an equally spaced sequence from 20 to half of the data size.

Each class was divided into 10 cross validation folds in three ways: 1) all folds were equally sized, 2) there was a 20% increase in one fold, 20% decrease in one fold, and the rest were equal and 3) the class was missing from one fold while the other folds were equally sized. That is, all the classes had the same fold distributions while the absolute class sizes were different. Therefore, a good evaluation metric should give a similar score for all classes. In practical settings it is especially important to correctly divide the smaller classes since the bigger classes are naturally better distributed.

We evaluated the folds using the metrics presented in Section 4. The results confirm that the widely used LD depends on the class size. LD gives smaller values to smaller classes and higher values to bigger classes even though their fold distributions are identical (see Figure 1 (a)). The behaviour of rLD and DCP on synthetic data, Figure 1 (b) and (c), show that both rLD and DCP are not affected by the class size. Therefore, it is recommendable to use them instead of LD for measuring cross validation split quality.

## 5 Algorithm

In this section, we present a new general cross validation method, optisplit, for optimising cross validation splits with respect to any cross validation metric that can produce class specific scores. The optisplit can also be used to generate standard train-test splits by generating a cross validation split of size with fold size of the desired test set and then forming the training set from the folds and the test set from the fold . The details of the optisplit are presented formally in Algorithm 1.

Intuitively, we start by randomly generating the initial subsets i.e. cross validation folds. Let be an evaluation metric that evaluates the folds with respect to a single class

. We calculate the score vector

for all classes and calculate the global loss function .Let be the initial value of . Let be the index of the class with the highest . We balance the class by moving examples from folds with excess examples to folds without enough examples so that the class distributions are balanced for the class (Function balance in Algorithm 1). After processing the class the global loss is recalculated. Let be the recalculated loss. If

, undo the changes and move to the next worse class. Otherwise keep the modification and continue to the next worse class. Continue this process for all classes for many epochs until either the global loss does not improve any more or a desired max iterations limit is reached. We only allow balancing operations that lead to improvement in the global loss

. Balancing a class can change the distribution of other classes. Thus it may be possible to balance later a class that is skipped in the first epoch.The time complexity of the Algorithm 1 consists of processing classes, for each calculating the loss (complexity: , finding the class with the highest loss () and redistributing the elements (). Therefore, the total time complexity is . In practice, optisplit could be also easily used on top of another possibly faster method to fine tune the results.

Unlike some of the competing methods, optisplit does not need any data specific parameters that have to be adjusted for different datasets.

Note, that in the accompanying practical implementation the classes that have more positive than negative examples are balanced with respect to the negative distribution. This is not important for GO data but could be useful in some other applications.

## 6 Cross validation experiments

In Section 4, we showed that it is not advisable to use LD as a cross validation split evaluation metric. Most existing multilabel stratified cross validation methods found in the literature (see Section 3) are optimised for producing splits with good LD or similar non class size independent scores. In this Section, we will examine the performance of the existing methods, namely, SOIS, IS, SS and PMBSRS as well as our own optisplit with respect to the proposed new metrics rLD and DCP. We will show that if one wants to get cross validation splits that are good with respect to rLD and DCP, optisplit is the best option available since it can be used to directly optimise them.

In the following experiments we set . All the results presented are averages over 10 runs with different random initialisations. The experiments were run using Python 3.6.8 on a machine with AMD opteron-6736 1.4GHz. Implementations of optisplit and the experiments presented in this article are available at https://github.com/xtixtixt/optisplit.

We used iterative stratification implementations from the popular skmultilearn library [skml] for first order and second order iterative stratification experiments. The Stratified Sampling [merrillees21] implementation used was the one provided with the article. Note that the SS implementation produces train test splits not cross validation splits. In order to compare it to the rest of the methods we have split the data by recursively splitting it to approximately 1/k sized test sets. Thus the method is run 5 times.

The methods compared here can be divided into three categories. IS and SOIS are iterative stratification based methods, SS and optisplit are optimisation based methods and PMBSRS is a random split based method.

We optimised optisplit with respect to both rLD and DCP to compare the effect of the cost function to the outcome.

### 6.1 Datasets

We used a wide range of diverse datasets: bibtex, delicious and mediamill from the MULAN dataset collection [mulan] that have been used to evaluate earlier similar methods. Datasets CC (cellular component) , MF (molecular function) and BP (biological process) are our own GO subset datasets used in protein function prediction (see [toronen2018pannzer2, zhou2019cafa] for more info). These and are considerably bigger and sparser than the MULAN datasets used here. The dataset Wiki10-31K [Bhatia16] is a large and very sparse extreme classification dataset. Here, classes without any positive or negative examples are excluded. Detailed properties of the datasets are presented in Table 1

Data | Examples | Labels | Density | Min | 25% | 50% | 75% | Max |
---|---|---|---|---|---|---|---|---|

bibtex | 7395 | 159 | 0.0151 | 51 | 61 | 82 | 130 | 1042 |

delicious | 16015 | 983 | 0.0193 | 21 | 58 | 105 | 258 | 6495 |

mediamill | 43907 | 101 | 0.0433 | 31 | 93 | 312 | 1263 | 33869 |

CC | 577424 | 1688 | 0.0077 | 5 | 66 | 225 | 891 | 577410 |

MF | 637552 | 3452 | 0.0028 | 11 | 61 | 150 | 498 | 637533 |

BP | 666349 | 11288 | 0.0028 | 4 | 41 | 123 | 493 | 666338 |

Wiki10-31K | 20762 | 30938 | 0.0006 | 2 | 2 | 3 | 6 | 16756 |

### 6.2 Results

Scores of the evaluation metrics for all datasets and methods are presented in Table 2 with the following exceptions: IS and SOIS results are not presented for the biggest datasets BP and Wiki10-31K because their runtime was prohibitively high. Wiki10-31K results are not presented for SS because the implementation used produced an error when run on that particular data. For comparison purposes we have also presented scores for random split (Random).

The results show that optisplit performs better than previous methods with respect to rLD and DCP scores when optimising with either of those. The runtimes of optisplit are also competitive when compared to other top performing methods. Generally, iterative stratification based methods perform quite well but are unusably slow on bigger datasets. Random split based methods are fast but produce poor quality folds compared to more advanced methods. Optimisation based methods (optisplit and SS) usually give best results and their runtimes are in the middle of iterative stratification based and random split based methods.

We can see that DCP and rLD are very correlated, the ordering of the methods is similar with respect to both metrics and optimising optisplit with respect to DCP produces nearly as good rLD results as optimising rLD directly. However, since rLD measures the folds more thoroughly i.e. it does not just concentrate on the biggest fold it seems to be a better practical choice than DCP in most cases.

Note that optisplit does not attempt to produce exactly equally sized splits. This results in quite high ED (Example Distribution) scores compared to some other methods. This should not be a problem in practical machine learning settings.

For completeness, we have included LD evaluations in Table 2. As is to be expected from the results presented in Section 4, we can see that the method ordering is often considerably different with respect to LD scores. In smaller and less imbalanced datasets LD gives results more in line with rLD and DCP. For bigger and more imbalanced datasets, when LD weakness gets more pronounced, the results differ more significantly. There, LD favours iterative stratification based methods and gives random split based PMBSRS noticeably better score than to Random, in contrast to rLD or DCP evaluations.

Dataset | Method | ED | LD | DCP | rLD | Runtime (s) |

bibtex | 27 | 0.0004 | 0.0073 | 0.0234 | 5 | |

38 | 0.0005 | 0.0068 | 0.0315 | 5 | ||

SOIS | 16 | 0.0005 | 0.0143 | 0.0425 | 5 | |

IS | 17 | 0.0007 | 0.0206 | 0.0604 | 1 | |

SS | 57 | 0.0007 | 0.0173 | 0.0465 | 4 | |

Random | 0 | 0.0022 | 0.0564 | 0.1693 | 1 | |

PMBSRS | 2 | 0.0022 | 0.0558 | 0.1660 | 1 | |

mediamill | 53 | 0.0005 | 0.0053 | 0.0187 | 10 | |

7 | 0.0005 | 0.0047 | 0.0176 | 11 | ||

SOIS | 1 | 0.0003 | 0.0205 | 0.0610 | 77 | |

IS | 1 | 0.0008 | 0.0280 | 0.0854 | 50 | |

SS | 36 | 0.0009 | 0.0068 | 0.0231 | 31 | |

Random | 0 | 0.0019 | 0.0379 | 0.1142 | 1 | |

PMBSRS | 2 | 0.0019 | 0.0386 | 0.1150 | 1 | |

delicious | 35 | 0.0010 | 0.0223 | 0.0666 | 75 | |

28 | 0.0010 | 0.0215 | 0.0772 | 76 | ||

SOIS | 16 | 0.0012 | 0.0458 | 0.1357 | 381 | |

IS | 13 | 0.0013 | 0.0489 | 0.1461 | 11 | |

SS | 75 | 0.0007 | 0.0221 | 0.0625 | 40 | |

Random | 0 | 0.0015 | 0.0507 | 0.1515 | 1 | |

PMBSRS | 2 | 0.0015 | 0.0512 | 0.1525 | 1 | |

CC | 193 | 8.1053 | 0.0065 | 0.0230 | 2762 | |

78 | 8.1195 | 0.0062 | 0.0248 | 2872 | ||

SOIS | 5 | 5.7623 | 0.0305 | 0.0894 | 204823 | |

IS | 1 | 5.6075 | 0.0448 | 0.1320 | 97120 | |

SS | 182 | 8.8831 | 0.0133 | 0.0416 | 1118 | |

Random | 0 | 10.2802 | 0.0455 | 0.1342 | 1 | |

PMBSRS | 1 | 6.3606 | 0.0448 | 0.1332 | 17 | |

MF | 607 | 3.2147 | 0.0064 | 0.0229 | 4980 | |

89 | 3.1782 | 0.0063 | 0.0252 | 5209 | ||

SOIS | 84 | 3.0536 | 0.0490 | 0.1450 | 53646 | |

IS | 1 | 2.7503 | 0.0490 | 0.1451 | 20442 | |

SS | 656 | 4.7935 | 0.0129 | 0.0400 | 1018 | |

Random | 0 | 6.2255 | 0.0493 | 0.1465 | 1 | |

PMBSRS | 1 | 4.8213 | 0.0498 | 0.1480 | 17 | |

BP | 612 | 2.1053 | 0.0161 | 0.0516 | 59436 | |

173 | 2.0951 | 0.0156 | 0.0576 | 52412 | ||

SS | 279 | 2.9140 | 0.0282 | 0.0857 | 2734 | |

Random | 0 | 2.896 | 0.0568 | 0.1664 | 1 | |

PMBSRS | 0 | 2.4250 | 0.0567 | 0.1662 | 21 | |

Wiki10-31K | 1678 | 0.0002 | 0.2579 | 0.7563 | 3065 | |

367 | 0.0002 | 0.2068 | 0.8033 | 3053 | ||

SS | error | error | error | error | error | |

Random | 0 | 0.0002 | 0.3008 | 0.9316 | 1 | |

PMBSRS | 2 | 0.0002 | 0.3013 | 0.9323 | 1 |

## 7 Discussion and future work

In this article, we have shown that the most widely used multilabel cross validation split evaluation metric, LD, does not measure split quality correctly when used on unequally sized classes. In response, we have presented new metrics with better properties and have presented a new general method, optisplit, for generating and optimising multilabel stratified cross validation splits. We have compared optisplit to existing methods and found that it produces better quality cross validation folds with respect to the new metrics than the previous methods and scales well for GO sized datasets. We note for future work that optisplit could be made faster by calculating the loss only for the classes that have been modified in the previous balancing operation. For sparse data that should allow it to be used even on considerably larger datasets. Also, optisplit uses now a greedy hill-climbing approach for optimising the target function. However, a Monte Carlo / simulated annealing based version could achieve even better performance.

Comments

There are no comments yet.