Quantification is the task of estimating the prevalence (frequency) of the classes in an unlabeled sample of data, that is, counting how many data points belong to each class (gonzalez2017review). Several practical applications, in diverse fields, rely on quantifying unlabeled data points. In social sciences, quantification predicts election results by analyzing different data sources that support the candidates (hopkins2010method)
. In natural language processing, it assesses how probable is each meaning for a word(Chan2006). In entomology, it infers the local density of mosquitoes in a specific area covered by an insect sensor (Chen2014).
As is the case with classification, quantifiers generally learn from a labeled sample. In fact, one can achieve quantification by merely counting the output of a classification model. However, this approach produces suboptimal quantification performance (tasche2016does). The challenge of designing accurate counting methods has led to the establishment of quantification as a research area of its own, driven by a thriving community of researchers.
Although this community has been responsible for several novel quantification methods (gonzalez2017review; maletzke2019dys), they mainly focused on cases where there is plenty of labeled data for all classes. Furthermore, they make the assumption that the set of classes is known a priori. However, depending on the problem at hand, we may be interested in counting observations that belong to a target class while not having substantial data from others.
For example, suppose we can use an intelligent sensor to count insects that belong to the Anopheles
mosquito genus, the vector of malaria, an infectious disease that affects more than 200 million people yearly. Even though we aim to count only a single class, the sensor will produce data points for other insect species in its vicinity. Taking into consideration that the number of insect species is estimated to be between six and ten million(chapman2013insects), it is unfeasible to build a dataset that reliably represents every non-target species. In another example, we may be interested in counting how many people in a social network would be interested in following a brand account. In this case, we only know which people already follow the a certain brand, from which we can induce the behavior of interested people. However, no data is available to model the behavior of users not particularly interested in this brand.
In applications like these, we need a method capable of counting the target class while not directly modeling the behavior of other classes. In other words, we cannot assume any available data to be representative of the behavior of future observations that do not belong to the target class.
The previous problem is in fact a major challenge that has been mostly overlooked by the Quantification community. Indeed, to the best of our knowledge, we were the first to address this setting within the Quantification literature (denisOCQ2018). In our previous work, we introduced the task of One-class Quantification (OCQ). In OCQ the goal is, from a training sample containing only data points belonging to the target class (positive data points), to induce a model that can estimate the prevalence of the positive class in a data sample containing unlabeled data points. We proposed two methods for achieving such a goal, the Passive-Aggressive Threshold (PAT) and One Distribution Inside (ODIn).
As previously mentioned, we were the first researchers to define OCQ in the context of the Quantification community. However, it is important to point out that, in the wider context of Machine Learning, OCQ was not the first framework to tackle the problem of counting with only positive labeled data. Work in a research area called Positive and Unlabeled Learning (PUL) also developed methods that solve this problem with publications that go as far back as 2008 (elkan2008learning), under the task named Positive and Unlabeled Prior Estimation (PUPE). The main distinction between the methods proposed for PUPE and for OCQ is that the former do not induce models that can be reapplied for several unlabeled test samples, while the latter do. Thus, to a great extent, both Quantification and PUPE share common goals. However, somewhat surprisingly, they have evolved as disparate academic fields.
One of the purposes of this paper is therefore an attempt to contribute to building a better awareness of how each area can enrich the other. More specifically, in our previous work (denisOCQ2018) we proposed PAT and ODIn and compared them solely against baseline and topline approaches under a narrow experimental setup (see Section 5.2). In this paper, we extend our previous efforts by:
Critically reviewing some of the most relevant methods in PUPE literature in detail, thus unifying PUPE and OCQ literature;
Extending our experimental setup to better understand the behavior of the methods under varying circumstances (see Section 5.3);
Comparing our proposals against the actual state-of-the-art rather than baselines, according to quantification error and time cost;
Developing the Exhaustive TIcE (ExTIcE), a variation of the existing Tree Induction for Estimation (bekker2018estimating).
In our analysis, we discover that PAT outperforms all other methods tested in the majority of settings we evaluated while being orders of magnitude faster.
However, by relying on scores as a proxy for data behavior, PAT performance decreases when the target class overlaps to a great extent with other classes. To address this problem, we propose Exhaustive TIcE (ExTIcE), an extension of Tree Induction for c Estimation (TIcE) (bekker2018estimating), that can maintain quantification performance even with substantial overlap as long as, in some region of the feature space, there is little to no overlap. Although ExTIcE performs poorly in terms of time required for its computation, it serves the purpose of raising promising ideas for future work.
In the next section, we continue this article with a summary of important concepts that are applied throughout the remaining of our work. In Sections 3 and 4, we review the most prominent methods for PUPE and OCQ, respectively, including our proposals ExTIcE, PAT and ODIn. The methods we review are later compared according to the experimental evaluation described in Section 5, which led to the results presented and discussed in Section 6. Section 7 discusses the strengths and limitations of the evaluated approaches as well as ways to compose them, opening some possibilities for future research. Finally, in Section 8 we conclude this article with a brief overview of our findings and prospects for future work.
In this section, we introduce relevant definitions used throughout this work and clarify the difference between classification and the quantification tasks that we investigate.
In Sections 2.1 and 2.2, we explain classification and scoring, respectively, which are base tools for several quantification techniques. In Section 2.3, we define the quantification task, explain how it relates to classification and demonstrate the limitation of achieving quantification through counting classifications. In Section 2.4, we introduce One-class Quantification (OCQ), whose models do not rely on any expectations about the negative class, and therefore forgo negative data. In Section 2.5, we explain Positive And Unlabeled Prior Estimation (PUPE), which is similar to OCQ, albeit without requiring an explicitly reusable model. Finally, in Section 2.6, we further differentiate OCQ from PUPE and the differences impact performance evaluation in literature.
In supervised learning, we are interested in learning from atraining sample , where is a vector with attributes in the feature space , and is its respective class label. For the sake of readability, from now on we refer to simply as . Therefore, .
The objective of classification is to correctly predict the class labels of unlabeled observations in an unlabeled test
sample based on their features. A classifier is formalized in Definition1.
A classifier is a model induced from such that
which aims to approximate of the function.
In classification, we usually assume that all observations are independent and identically distributed (i.i.d) (upton2014dictionary). “Identically distributed” means that all observations, from either the training or test samples, share the same underlying distribution. “Independently distributed” means that the observations are independent of each other. In other words, the occurrence of one observation does not affect the probability of the occurrence of any other particular observation.
There are different mechanisms employed by classifiers to decide which class will be assigned to any given observation. We emphasize one that is frequently adopted for binary classification problems, that is, problems where . In binary classification, one of the two classes is denominated positive class (), while the other is denominated negative class (. In this setting, one can induce a scorer , as formalized in Definition 2
A scorer is a model induced from such that
which produces a numerical value called score
that correlates with the posterior probability of the positive class, that is. Consequently, the greater the score is, the higher is the chance of x belonging to the positive class.
For classification purposes, if such a score is greater than a certain threshold , the observation is classified as positive. Otherwise, it is classified as negative (flach2012machine). For the sake of brevity, we henceforth refer to scores of negative observations simply as negative scores, and analogously refer to scores of positive observations as positive scores. Such denominations are not to be confused with the sign of the numerical value of the scores. Given a scorer , the classification task is fulfilled as follows:
Although quantification and classification share similar characteristics, the main one being the representation of data, their objectives differ. A quantifier need not provide individual class predictions. Instead, it must assess the overall quantity of observations that belong to a specific class or a set of classes (gonzalez2017review). A quantifier is formally defined by Definition 3.
A quantifier is a model induced from that predicts the prevalence of each class in a sample, such that
denotes the universe of possible samples from . For a given test sample , the quantifier outputs a vector , where estimates the prior probability for class , such that . The objective is to be as close as possible to the true prior ratios
of the probability distribution from whichwas sampled.
Similarly to classification, in quantification we still assume that observations are sampled independently. Additionally, as the main task is to measure the prior probabilities of the classes in , it is also assumed that the class distribution changes significantly from to the training sample (which supports the induction of ) to the test sample , otherwise a quantifier would not be needed.
One straightforward way of achieving quantification is to count the predictions produced by a classifier. This method is called Classify and Count (CC) (Forman2005). Naturally, performing CC with a perfect classifier always produces a perfect quantification. However, accurate quantifiers do not necessarily need to rely on accurate classifiers. Since our objective is purely to count how many observations belong to each class, misclassifications can nullify each other, as illustrated in Table 1.
As in Table 1, Figure 1 illustrates a scenario where we obtain perfect quantification regardless of imperfect classification, given a classification model based on a scorer. This illustration will come in hand to visually understand the systematic error of CC, afterwards.
Despite CC providing perfect quantification in specific test conditions, it is systemically flawed for any classifier that does not consistently achieves perfect classification. It is important to point that perfect classifiers are rarely achievable for real-world applications. For illustration purposes, consider a case of binary quantification. Let be the estimated proportion of the positive class in an unlabeled test sample, while is the true positive class ratio in the unlabeled test sample. In CC, is estimated as follows:
Observe that we can decompose in terms of how many positive observations were correctly classified and how many negative observations were incorrectly classified, even though these values are not obtainable without access to true labels.
To put in a probabilistic perspective, let be an alias for , which the classifier’s False Positive Rate (FPR). In other words, it is the proportion of negative observations that are wrongly classified as positive. Analogously, let be an alias for , which is the classifier’s True Positive Rate (TPR). In other words, it is the proportion of positive observations that are correctly classified as such. In this context, can be defined as:
From the previous equation, we can derive that the absolute quantification error caused by CC is:
where is an alias for , which is the classifier’s False Negative Rate (FNR) or, in other words, the proportion of positive observations that are wrongly classified as negative.
From Equation 2, observe that the error relates to the absolute difference between the hatched areas (false positive and false negative) in Figure 1. Intuitively, this means that, for a score-based quantifier, it is enough to select a threshold that causes the number of false-negative observations to be the same as the number of false-positive observations. However, those values depend on the true-positive ratio , which is the variable we want to estimate in the first place, thus making this method of choosing a threshold impracticable. Observe that if we do not apply the absolute function, we can easily express as a linear function of :
This implies that , the absolute quantification error generated by CC, grows linearly when the actual positive class ratio is under or above a certain value for which quantification should be perfect. This effect is true for any classifier whose either or is not null. Figure 2 illustrates such an effect with a real dataset.
To further illustrate the aforementioned effect, Figure 3 depicts a change of on the density function of scores in a score-based classifier. Compared to Figure 1, we can notice that the area related to false positive observations shrunk down, while the area related false negative observations expanded, as got bigger. In general, if the proportion of positive observations is greater than the one that causes perfect quantification, the predicted positive ratio is underestimated since the number of false negatives becomes greater than the number of false positives. Likewise, if the proportion of positive observations is lower than the one that causes perfect quantification, the predicted positive ratio is overestimated. We point the interested reader to tasche2016does for a thorough investigation on the limitations of quantification without adjustments.
If we extend our analysis on binary CC to the multiclass scenario, a similar systematic error pattern would be found.
Although most quantification algorithms rely, at some point, on classifiers or scorers, there are several ways to minimize the systematic error. In the binary case, if we rewrite Equation 1 to isolate the true positive ratio , we have:
With Equation 4 we conclude that if we know the actual values of TPR and FPR, we can calculate as a function of . That is, we can derive the actual proportion of positive observations from the biased estimated by Classify and Count. This is the principle of Adjusted Classify and Count (ACC) (Forman2005), which is defined in the following equation, where and are estimates of and , respectively:
As ACC comes from Equation 4, it produces perfect quantification when the estimates of FPR and TPR are both correct. However, and are typically empirically estimated with labeled training data and procedures such as -fold cross-validation, which my lead to imperfect estimations.
2.4 One-class Quantification
Now, let us turn our attention to context in which we want to quantify the observations of only one target class (positive class). The negative class is a mixture of distributions that comprises everything that does not belong to this target class. Each component in this mixture is a negative sub-class. The data we have available for negative sub-classes constitute our partial knowledge about the negative class.
One problem with typical quantification is that if there is an exceedingly large number of negative sub-classes, the ones for which we have data might not be enough to reliably model the general behavior of the negative class.
In addition to that, there are cases where it is difficult to guarantee that an observation belongs to the negative class. For example, suppose that we sell a product online and we can track the preferences of our customers via their social media profiles. Our customers can be used as positive training data for the task of identifying who might be interested in purchasing our product. On the other hand, gathering data for the negative class is not as trivial. If we randomly sample online social media profiles, the resulting set would contain people who are uninterested in the product but also some potential customers. An explicit data gathering for the negative class could involve an online poll, which is time consuming and can still generate potentially biased data.
In a second example, suppose that we want to count the number of people that are infected with a disease in a population. Due to procedure costs, people may test for a disease only if they are suspected of having it. In that case, while we can have a sample of people that were positively tested for such a disease, our data for people who were negatively tested may be severely lacking and biased. In such a case, a random sample of people would include both people who are not infected and people who are infected but were never diagnosed.
If we are interested in quantifying only the positive class and we are unable to have a reliable representation of the negative class, we may need to rely solely on positive training data to induce a quantification model.
One-class Quantification (OCQ) is the task of inducing a quantification model with only positive data, as formalized in Definition 4 (denisOCQ2018).
A one-class quantifier is a quantification model induced from a single-class dataset, in which all available labeled examples belong to the same class, say the positive one, , and
The one-class quantifier outputs a single probability estimate of the positive class prevalence. Notice, however, that operates over , i.e., a sample with all occurring classes.
Excluding the explicit objective of inducing a model and disregarding training data afterward, OCQ shares the same purposes of Positive and Unlabeled Prior Estimation, which is detailed in the next section.
2.5 Positive and Unlabeled Prior Estimation
Positive and Unlabeled Prior Estimation (PUPE) is a task derived from Positive and Unlabeled Learning (PUL). The main task of the latter is akin to classification. To better explain PUPE, we first briefly introduce PUL.
In the general case of PUL (elkan2008learning), we are provided with two samples of data. One of such samples, , contains only positive (and therefore labeled) observations, whereas the other, , contains unlabeled observations that can be either positive or negative. The objective is to infer the individual labels of the observations in the unlabeled sample. Figure 4 illustrates the general setting of PUL.
Observe that the basic description of PUL does not pose explicit restrictions regarding the proportion of the classes in the unlabeled data. However, possessing the information of such a statistic make the labelling task an easier problem (elkan2008learning). If the labelling is based on a scorer, for instance, the number of positive observations can be used to set a classification threshold. Unfortunately, the number of positive observations in an unlabeled sample is not readily available, although it can be estimated. In that sense, Positive and Unlabeled Prior Estimation (PUPE) is a sub-task that has the sole objective of predicting the proportion of the classes, which could eventually support labelling.
A common assumption across different pieces of work on PUL and PUPE is that the labeled sample is “selected completely at random” from the pool of positive examples. More specifically, such an assumption states that each positive observation has a constant probability of of being labeled (elkan2008learning). Consider a function that annotates whether a positive observation is labeled, as follows:
In such a case, the assumption specifies that
that is, the probability of is a constant that for any x that is positive irregardless of its feature values. Note that, by definition,
. By applying the Bayes Theorem, also follows that
from which follows (elkan2008learning)
In a simplification, the labeled sample is a uniform sample from all available positive observations. More importantly, this assumption and how it is exploited by the algorithms underlines that the labeled sample and the positive observations from the unlabeled sample share the same probability distribution. Therefore,
We note that this assumption is also made in OCQ methods, since they aim to induce a model that estimates the probability distribution of the positive class. Despite this similarity of assumption, there are differences between OCQ and PUPE that are better described in the next section.
2.6 Differences between OCQ and PUPE
Having described OCQ and PUPE, we stress that, from a practical perspective, algorithms from both research areas are capable of solving the same set of problems interchangeably. Therefore, direct comparisons between the methods of such areas are due. However, while both methods can solve the same set of problems, there is an essential distinction between the problems that they aim to solve. PUPE describes the task as containing exactly two samples: there is no particular interest in modelling a single model that can quantify several test samples. Such a description influenced the development of PUPE techniques, and as a result, all of the examined techniques rely on transductive learning on all stages of the quantification process: they do not produce a single model that can be reused, and a costly process must be done for each and all test samples that need be evaluated.
On the other hand, OCQ methods create a model that estimates the distribution of the positive class, and with which it is possible to quantify any given sample at a later time. As we show in this article, this perspective to the problem provided OCQ techniques with a sizable advantage, in terms of time needed to process a large number of test samples, over PUPE techniques.
We also note that in the literature on PUPE, the task is often to estimate either or , whereas in OCQ we are interested in estimating . Note is the chance of one observation belonging to the positive class considering both labeled data and unlabeled data. Also, recall that , that is, is the ratio of labeled data to unlabeled positive data. Both probabilities listed depend on the labeled set, which is intended for training.
Meanwhile, in a conversion to the PUPE terminology, , that is, the proportion of the positive observations considering only the unlabeled set. This divergence is reflected on how the experimental results are shown. We highlight that by measuring the error of estimates of either or , the value obtained is highly influenced by the number of labeled observations (which are training data). On the other hand, the size of the training data does not influence evaluation measurements based on
. Thus, given our discussion, we argue that one should adopt evaluation metrics based onto to measure the performance of methods in either OCQ or PUPE.
We can convert the estimation to according to the following equation:
where is the number of labeled observations and is the number of unlabeled observations. The function in the expression limits the result since the predicted can have a corresponding over one, which would be meaningless for quantification. Observe that is inversely proportional to .
Finally, we emphasize that although the general assumption of both OCQ and PUPE is that the negative class cannot be estimated from labeled data, the distribution of the negative class does not fully contain the distribution of the positive class. In other words, the positive class is at least partially separable from the negative class. Algorithms may impose stricter versions of this assumption to successfully achieve quantification. For instance, Elkan’s algorithm requires clear separation between negative observations and positive observations.
3 Methods for Positive and Unlabeled Prior Estimation
In this section, we review and describe six of the more prominent methods in PUPE literature, highlighting key aspects of their rationale and implications in practical use, in addition to a seventh method, ExTIcE, that we propose in this paper. We do our best to simplify the rationale behind each method and offer a more intuitive and approachable explanation that unveils the uniqueness of each algorithm. Four of them are later used in our experimental evaluation.
3.1 Elkan (EN)
To the best of our knowledge, elkan2008learning were the first to explicitly tackle the prior estimation problem in Positive and Unlabeled Learning as a separate task. They introduce three techniques to estimate , one of which, henceforth called Elkan’s method (EN), is their recommended choice. The rationale of this method directly derives from Equation 5. Precisely, the technique tries to estimate with the two following steps:
In the first step, using both unlabeled and labeled datasets together, we train a classification model capable of producing calibrated probabilities, where the class feature is whether the observation belongs to or not. In other words, the classifier aims to predict rather than . As the model is a calibrated scorer, it estimates .
In the second step, in order to estimate and therefore , EN uses as a proxy for the condition of the aforementioned probability. It averages all probabilities obtained for the observations in as follows:
Figure 5 exemplifies Elkan’s algorithm on the same dataset that generated Figure 4. We make two observations based on these figures. First, positive observations, either labeled or unlabeled, share similar sizes in Figure 5. Indeed, as they have the same probability distribution, they also share the same area in the feature space uniformly. In such a case, where features are useless to distinguish labeled observations from positive but unlabeled ones, the best possible estimation for the probability of any single positive observation being labeled is the proportion of labeled observations in the shared space, therefore (see Equation 5).
The second important aspect to note, in Figure 5, is that as a negative observation gets farther the positive cluster, it also gets smaller. This happens because they get farther from labeled observations, which are the classification target for the model induced. This remark raises the question of what would happen if there were further spatial overlap between the classes. Notice that EN estimates by averaging for all . This works on the assumption that
While it is true that for every observation in , we emphasize that the classification model learns how to estimate , not . The true value of the former probability is given according to the following equation:
By providing the classifier only with instances from , EN implicitly assumes that , whereas it may not be the case. Indeed, will be significantly lower than when there is overlap between the classes, since in such cases . For this reason, when there is overlap between the classes, EN underestimates and therefore overestimates . As we show in the next sections, newer algorithms handle the possibility of class overlap better than EN by different means.
3.2 PE and pen-L1
du2014class demonstrated that the calculations in EN can be reinterpreted as a minimization of the Pearson divergence (Pd) between and , where the former is estimated from and the latter from . Finally, they introduced PE, which can be simplified with the expression:
The major benefit of PE over EN is that the former drops the need for an intermediate model to accurately estimate the posterior probability, whereas the latter needs a calibrated scorer. However, similarly to EN, PE also overestimates the proportion of positive observations whenever there is overlap between the classes. As PE is a reinterpretation of EN and share the same caveat regarding overestimation, we do not detail the method any further.
To circumvent the overestimation of PE, christoffel2016class introduced pen-L1, which applies a biased and heavy penalization on that implies that in some regions of the feature space . Such an implication is unrealistic (bekker2018estimating).
AlphaMax was introduced by jain2016nonparametric. In their terminology, corresponds to the mixture sample and to the component sample. The AlphaMax algorithm estimates the maximum proportion of in .
To better explain the intuition behind AlphaMax, let be a set that contains all positive instances in , and a set that contains all negative instances in . Finally, let be the density function of the probability distribution of sample . We know that:
Thanks to the assumption of “selected completely at random”, we also know that . In such a case, we can rewrite Equation 7 as follows:
In Equation 8, note that as increases, has to proportionally decrease. The objective of AlphaMax is to determine the maximum possible value of , which is when , for which the equation is still valid.
In practice, however, we cannot split into and , since the data is unlabeled. To circumvent this limitation, AlphaMax constructs two density functions, and , that re-weight the density functions (which estimates the mixture ) and (which estimates the component ), according to a shared weight vector . We emphasize that specifically counterbalances by applying it with , similarly to the what happens to the component of . For a given , AlphaMax proposes an optimization problem to define , given the constraint that , where are the weights of . For instance, if is estimated using histograms, would be the proportional height of each bin.
The optimization problem tries to maximize a log-likelihood of the mixture (estimation for ) given the weighted participation of the component (estimation for ). It is stated below:
Different values of in the interval are applied in the above optimization problem. While is lower than , it is possible for to counterbalance , keeping the log-likelihood about the same. However, once the applied is greater than , the log-likelihood should decrease. AlphaMax returns the value of that starts the knee in the curve of by , i.e., the value of that precedes a steep decrease in . Figure 6 illustrates that process.
An updated version called AlphaMaxN (jain2016estimating) specifically tackles the possibility of the labeled set containing false-positive observations. This setting is out of the scope of this paper. However, we note that in the appendix of jain2016estimating there is a mathematically detailed description of the AlphaMax algorithm that is more approachable than the description in its original paper.
At last, we emphasize that solving the optimization problem to define generally is a computationally intensive task that is required several times (one for each value of ).
The algorithms that belong to the KM family (ramaswamy2016mixture) have a similar rationale to AlphaMax’s. The main difference is that, instead of using log-likelihood to measure the suitability of a possible value for regarding the mixture sample , they use the distances of kernel embeddings. A better comprehension of the algorithm requires deeper understanding of Reproducing Kernel Hilbert Space, which is out of the scope of this paper.
There are two variants of KM: KM1 and KM2. The difference between the variants is the method to select the gradient threshold, which is used in a similar fashion to the “knee” in AlphaMax. KM1 is derived from a theoretical foundation developed by the authors, while KM2 is motivated from empirical evidence.
3.5 Tree Induction for Estimation (TIcE)
Tree Induction for Estimation (TIcE) (bekker2018estimating), as prior PU algorithms, bases its foundation on the assumption of “selected completely at random”. Observe that Equation 6 can be rewritten as follows:
From Equation 9, we can derive that a reasonable estimation for is:
where contains all positive instances in . However, notice that, as is unlabeled, we cannot directly embed in any equation, in practice.
Consider a function that produces a sub-sample of that contains all observations that are within the region of the feature space . With such a function, we define as follows:
where contains all negative instances in .
Finally, TIcE is interested in finding a region for which approximates , and therefore . To this end, it needs to downplay the influence of . Notice that the region that maximizes should simultaneously minimize , since the remaining of the ratio in Equation 11 should approximate the constant value according to the assumption of “selected completely at random”. Therefore, TIcE proposes the following optimization problem:
where is a correction factor (more on that later in this section).
We emphasize that, from the optimization task above, we can derive diverse methods that follow undoubtedly distinct approaches. TIcE, in particular, performs a greedy search by inducing a tree, as we describe next.
In a simplification, to find such a , TIcE uses to greedily induce a tree where each node is a sub-region of the feature space within the region defined by its parent node. The node that produces the highest (given constraints for minimum number of observations) is used to assess one estimation of . Several estimations are made via -fold cross validation and the final one is the average of all estimations assessed.
We note that although TIcE is introduced as a typical tree-induction algorithm, it is more accurate to describe it as either a greedy search or a biased optimization algorithm, since it uses the estimation assessed by only one node in the tree that may not necessarily be a leaf. Indeed, the algorithm actually intends to locate one region within the feature space.
Regarding the computation cost to solve such an optimization problem, the time complexity of TIcE described by bekker2018estimating is an overly optimistic . We better estimate the time complexity of TIcE and the full analysis is presented in Appendix A.
Regarding the non-optimality of the solution provided by TIcE, the tree-induction approach causes two biases. First, notice that TIcE’s greedy search tries to maximize the local estimation of . This incurs overestimating , as we noted in preliminary experiments. To counteract such an overestimation, TIcE subtracts an always non-negative correction factor from . However, we believe that the reasoning behind is inaccurate. According to bekker2018estimating
, the correction relates to the binomial distribution with which each positive observation is labeled (and therefore appears in): each observation is labeled with a fixed chance so that the expectation of is , but a difference can reasonably occur. However, that difference could go both ways, and the ratio obtained from sample data could be lower than the actual . In that case, why is always non-negative? Further investigation is due. We suspect that the overestimation is actually related to a bias in the way the algorithm selects the sub-regions, while trying to maximize . To support this suspicious, in Appendix B.1 we compare TIcE against a baseline variation of the algorithm where all splits are chosen randomly and no correction is applied. We find no statistical difference between TIcE and such a baseline.
To better understand the second bias, we describe next how TIcE splits a node. To reduce computational cost, a maximum number of splits is set to avoid excessive computing, and regions of the feature-space are split in order according to a priority-queue so that more promising regions are split first. When TIcE is splitting a region, derived sub-regions are added to the priority-queue. However, the algorithm only adds sub-regions that are created by dividing the feature-space using only one feature. More importantly, and contrary to typical tree-induction algorithms for classification, the criterion to choose the feature is based solely on the one most promising sub-region derived from a split, despite the possibility of all other resulting sub-regions being unpromising. We found this bias to be severe and, to back this claim, in Section 6 we compare TIcE against a proposed extension, Exhaustive TIcE (ExTIcE), described in the next section.
3.6 Exhaustive TIcE (ExTIcE)
In this section, we propose Exhaustive TIcE (ExTIcE), an extension of ExTIcE that aims to lower its search bias.
ExTIcE’s main distinction from TIcE is that the former adds all sub-regions created by all features into the priority-queue, while the latter splits the region with only one feature. Despite the name, ExTIcE is not truly Exhaustive. It still sets a hard limit on how many splits can be performed, after which the algorithm is interrupted. We notice that the limit we apply for this paper is the same one applied in TIcE. However, as TIcE always splits the data using only one feature, sub-regions do not share data points and TIcE usually runs out of data before the limit is reached. Conversely, a same data point can be present in several sub-regions for ExTIcE. Additionally, many more sub-regions are added to ExTIcE’s priority-queue, even though they will never be split further. For those reasons, ExTIcE is considerably slower than TIcE.
We also note that, although ExTIcE is our attempt to reduce the search bias in TIcE by not ignoring regions of the feature space, the algorithm is still biased since its search mechanism tries to maximize the local estimations of . For that reason, ExTIcE also applies the correction factor .
Finally, as is the case with all other PUPE approaches described so far, ExTIcE does not create a reusable model. In the next section, we describe our other proposals, which were originally presented as One-class Quantification methods and are able to induce reusable models.
4 Methods for One-class Quantification
In this section we introduce two methods for one-class quantification: Passive-Aggressive Threshold (PAT) and One Distribution Inside (ODIn). The main difference from PUPE techniques is that the following methods are inductive, that is, they generate a model that can be reused for multiple test samples.
Both proposals are based on distribution of scores. We emphasize that, as we do not have training data regarding the negative class, such proposals rely on one-class scorers (OCS). An OCS is a model learned with only positive observations, which outputs, for a previously unseen observation, a numerical value that correlates with the probability of said observation belonging to the positive class. Examples of suitable OCS are One-class SVM (khan2009survey)
, Local Outlier Factor(noumir2012simple), and Isolation Forests (breunig2000lof; liu2008isolation). In our proposal, we also use the Mahalanobis Distance (mahalanobis1936generalized) as a simple OCS. In this case, the score is the Mahalanobis distance between each observation and the positive data. In all aforementioned algorithms, the score must be either inverted or multiplied by minus one, since they originally are negatively correlated with the probability of belonging to the positive class.
4.1 Passive-Aggressive Threshold
Our first proposal OCQ proposal introduced in our previous work (denisOCQ2018), Passive-Aggressive Threshold ACC (PAT-ACC or PAT, for short), draws inspiration from Adjusted Classify and Count and Conservative Average Quantifier (forman2006quantifying). As discussed in Section 2.3, ACC depends on accurate estimates for TPR and FPR. However, in many applications we cannot reliably measure either TPR and FPR. This is particularly true for tasks derivative from One-class Quantification, since the distribution of scores of negative observations varies from sample to sample.
To offer a better grasp on the intuition behind PAT, observe that the influence of the negative distribution on ACC stems from the fact that the most suitable threshold for classification usually cut through the density function of the negative scores, leaving negative scores on both sides of the threshold, as seen in Figure 1. Although the number of negative observations on the right-hand side of the threshold is expected to be significantly smaller than on the left-hand side, it is still unpredictable whenever the distribution of negative scores changes.
In PAT, we deliberately choose a very conservative classification threshold that tries to minimize the . In other words, we select a threshold for which we expect very few negative observations to be placed on its right-hand side, as illustrated in Figure 7. With such a conservative threshold, we naively assume that there is no false positive observations. Finally, we extrapolate the total number of expected false negative observations from the number of true positive observations.
More formally, we set the threshold according to a quantilefor the one-class scores of positive observations in a training set. For example, if , then the threshold is set so that of the training (positive) observations are scored below such a threshold, while are scored above it. Given , we estimate and assume .
After the threshold is set, we perform ACC as usual: we classify all observations in the test sample of size according to this conservative threshold, count the number of positive instances , estimate the positive proportion , and readjust it as follows:
In PAT, is an important parameter. Ideally, it should be set as high as possible so that we can be more confident about the assumption of , even for non-stationary negative distributions. How high it can be set depends on the test sample size, since higher implies more intense extrapolation from fewer observations. In previous work (denisOCQ2018), we showed PAT’s performance to be similar to CC when approaches 0, as the extrapolation is reduced. We also show that, although important, is not a sensitive parameter: a broad range of possible values lead to similar quantification errors.
in two datasets. The shaded area correspond to two standard deviations. Source:denisOCQ2018.
Previous work showed stability of results for varying (denisOCQ2018), as illustrated in Figure 8. For that reason, instead of picking a single value to be used in our experiments, we adopted a strategy similar to Median Sweep (forman2006quantifying). In this case, we apply PAT with from to with increases of and consider the median of the estimates.
Regarding the time complexity of PAT, since we can reuse the scorer model multiple times, we split the analysis into two stages: training and test.
For the training stage, consider to be the time complexity of training a scorer with observations, and the time complexity of scoring one single observation. Suppose that we apply -fold cross validation to obtain the positive scores, with which we model the density function to identify the thresholds associated with different values of . In this case, the complexity to train PAT is the complexity to obtain the scores and sort them in order to identify the thresholds (with binary search):
For test, we can take different approaches if we are using multiple thresholds or only one. If we are using only one threshold, then, after scoring all test observations, we can linearly count how many are below the threshold, totalling a time complexity of . However, if we are using multiple thresholds, we can sort the scores and iterate over a pre-sorted list of thresholds to count how many observations are bellow each threshold with binary search. In this case, the time complexity is .
4.2 One Distribution Inside (ODIn)
Our second proposal introduced in our previous work (denisOCQ2018), One Distribution Inside (ODIn), is a Mixture Model (MM) that shares a similar idea with AlphaMax. The main difference between the two algorithms is that ODIn works with univariate distributions (one-class scores), whereas AlphaMax works with multivariate distributions (allowing it to directly work on the feature space).
ODIn searches the maximum possible scale factor , , for the known distribution of scores from positive training observations, so that it fits inside the distribution of scores from test observations with an overflow no greater than a specified limit. The overflow is the area between the scaled positive distribution curve and the test distribution curve, where the former is higher than the latter, as illustrated in Figure 9.
We represent the distributions as normalized histograms with unit area and bins, split by ordered thresholds. The first and last bins are open-ended. This means that all scores lower than the first division fall into the first bin, and all scores higher than the last division fall into the last bin. In our experiments, we set the thresholds between bins, i.e., the values of score that separates the bins, as percentiles obtained from the positive training observations. The first and last thresholds are set as the estimates for, respectively, the 0th and 100th percentiles of the scores. The remaining thresholds are set at every percentile, , . For instance, if , the thresholds are at the percentiles . Although, score wise, the bins do not share the same width, they are expected to be equally filled by observations from the positive distribution. Exceptions are the first and last bins, which are expected to have value close to zero. Figure 10 illustrates this process.
Thresholds for the histogram bins are not uniformly distributed across the scores ((a)a), and yet each bin is filled with the same proportion of data points ((b)b). Source: denisOCQ2018.
The overflow generated by a histogram , at a scale factor , inside a histogram , where both histogram are normalized so that , is formally defined as follows:
Given an overflow limit , which is a parameter, the histogram with scores for positive training observations, and a histogram with scores for the unlabeled test sample , ODIn estimates the proportion of positive observations in as:
Choosing , although a non-trivial task, is not devoid of useful insights. Histograms with too many bins are negatively affected by two aspects. First, if the sample size is not large enough, histograms with too many bins can become too sparse, each bin can have too low value, and ultimately, the
can face the curse of dimensionality. Second, a large number of bins has the implicit assumption of high precision for the scores. On the other hand, if the number of bins is too small, we may be unable to differentiate distributions. We point the interested reader to the work ofmaletzke2019dys for a more in-depth discussion on the effects of the number of bins in a histogram for quantification.
Although is a parameter, it can be automatically defined using only positive observations. To this end, we estimate the mean and standard deviation of for pairs of histograms derived from samples with only positive observations, at scale factor , and set , where is a parameter. Although we are actively replacing one parameter with another one, has a clearer semantic and its value is domain independent: it is the number of standard deviations of the expected average overflow.
Similarly to PAT, the time complexity of ODIn should be split into training and test stages. For training, one has to produce the scores using -fold cross validation and create the histogram. This implies sorting the scores to find out the percentiles that split the bins. Therefore, the time complexity of training is:
For the test stage, one has to score all unlabeled observations and fill in the histogram, accordingly. Finally, we note that can be found through Binary Search, with a time complexity of , where is the expected precision. Therefore, the time complexity of the test stage is:
5 Experimental Setup
In this section, we explain the experimental setup and datasets used in our empirical evaluation. In the general setting, for each dataset, we varied the true positive ratio, i.e., the proportion of the positive class in the unlabeled (test) sample, from 0% to 100% with increments of 10%. For a given positive ratio, we performed 5-fold cross validation to generate candidate observations for training and for test. The (labeled) training and (unlabeled) test samples are drawn, without replacement, from the training and test candidate sets, respectively. The training sample only includes positive observations, and the test sample obeys the positive ratio previously set.
We note that, as is the case with the experiments of bekker2018estimating, and contrary to typical experimental settings, the smaller fold of the data (one fifth) is used for training, while the larger (four fifths) is used for testing. In that case, one single test observation may appear across multiple test samples. However, it will not appear more than once in a sample. This usage of the 5-fold cross validation is employed due to the amount of data required to create test samples with varying proportions of positive observations and negative sub-classes. Due to the slowness of some of the algorithms tested, training data size was limited to 500 observations and test data size was limited to 2,000 observations. Final results are reported as Mean Absolute Error (MAE), which is the average of the absolute difference between the predicted positive ratio and the true positive ratio.
Finally, we raise attention to the fact that if we employ a quantifier that always predict , and the actual is uniformly distributed within the range (as in our experiments), then the MAE obtained in a large enough number of test samples converges to . This fact indicates that the maximum error we should consider as acceptable in our setting is .
All code is avaliable in our supplemental material website (supmatweb). Next, we describe the particularities of each experiment.
5.1 Experiment #1
In this experiment, the existence of negative sub-classes is disregarded. The size of the test sample is the mininum between the number of available positive and negative candidates, limited to 2,000. The number of repetitions is five. This experiment is designed to be easy to reproduce and compare, although supports only a superficial analysis of performance. Our objective with this experiment is to provide a similar setup with those in current literature, and to provide a general analysis of quantification performance.
5.2 Experiment #2
In this experiment, the existence of negative sub-classes is acknowledged. For each test sample, the proportion of the negative sub-classes is randomized and the sample is drawn accordingly. The size of the test sample is the largest that make the previously set proportion viable, limited to a maximum of 500 observations. To obtain greater variability in the test samples, giving the random proportion of sub-classes, the number of repetitions is 30. With this setting, we aim to produce experimental results that better suit our assumption that the negative class varies from sample to sample. Experiment #2 is the same as the one we employed in our previous work to measure the performance of PAT and ODIn (denisOCQ2018). Next, we describe a relevant limitation of this setting and how we overcome it.
5.3 Experiment #3
The uniform randomization of the proportions of negative sub-classes, in Experiment #2, has an adverse effect. While the MAE for each individual sub-class proportion is informative for the expected performance for said proportion, the experimental MAE when averaging all variations of sub-class proportions is bound to converge to the same MAE that would be obtained with balanced test samples, that is, test samples whose every single sub-class has the same number of observations.
However, in real world applications, we do not assume that all classes will appear with the same proportion. On the contrary, we assume that the proportion of the sub-classes vary and is unknown beforehand. To better evaluate the methods in this situation, we propose Experiment #3. In this experiment, we map the original dataset onto several datasets, one for each negative sub-class, containing data points of a single negative sub-class and all positive data points. Each dataset is evaluated individually. The size of the test samples is the minimum between the number of available positive and negative candidates, limited to 2,000. Finally, we evaluate:
- Experiment #3-a – Median
half of the negative classes produced MAE lower or equal than the one reported in this experiment;
- Experiment #3-b – 75-percentile
three quarters of the negative classes produced MAE lower or equal than the one reported in this experiment;
- Experiment #3-c – Worst case
the result obtained by the single negative class that produced the greatest MAE.
5.4 Experiment #4
The aim of this experiment is to compare execution time of different algorithms.
Due to the slowness of some of the algorithms evaluated, the previous experiments were executed in parallel in a variety of hardware across multiple days. To measure the time consumed by each algorithm in a comparable manner, we performed a diminished version of Experiment #1 that was executed in a single machine. The differences are: 5-fold cross validation was interrupted after the evaluation of the first fold, and the experimental setup was evaluated only once instead of repeating for five times.
We highlight that the time necessary to quantify each test sample was measured independently, and summed at the end, to avoid measuring time spent with the preparation of the samples.
We evaluated the performance of seven algorithms: EN, PE, KM1, KM2, TIcE, ExTIcE, and PAT. All methods were merged into a unified test framework, publicly available as supplemental material (supmatweb).
PAT and ODIn were preliminarily compared in the same setting proposed in Experiment #1, with both methods adopting Mahalanobis distance. PAT was consistently superior to ODIn. As both methods are based on the same rationale of learning the distribution of one-class scores, we kept only PAT in our evaluation against the PUPE techniques, considering it as a representative of such a general approach. The comparison between PAT and ODIn can be found in Appendix B.2.
Given the algorithm’s simplicity, we used our own implementation for EN. elkan2008learning
employed Support Vector Machine calibrated with Platt scaling as a base classifier for EN. In this article, we adopted the same method to keep compatibility between experiments.
EN relies on SVM. We used scikit-learn’s implementation (scikit-learn) with all parameters set to default, excepting gamma. Gamma is an important parameter that usually is set to either “auto” or “scale”. This parameter caused severe differences in the results for some datasets. For this reason, we report results for both settings, where ENa refers to the situation where gamma is set to “auto”, and ENs” refers to the situation where gamma is set to “scale”.
The code of PE, used in our experiments, was a direct translation to Python 3 from the original code, in Matlab, provided by du2014class111http://www.mcduplessis.com/index.php/.
Code for pen-L1 is not available in the author’s website. However, comparisons are possible due to transitivity and analysis of previous work (bekker2018estimating). In other words, we assume that if one algorithm performs better than in our experiments, and performs better than in the existing literature, performs better than .
We reached out to the AlphaMax’s authors and they attentively provided us with code and instructions to use AlphaMax in our experiments. Unfortunately, a fair use of the program provided would require several manual interventions. Given the volume of the experiments in our setup, making such interventions would be unfeasible and unfair with the other contenders. Alternatively, results for AlphaMax are provided in previous work (ramaswamy2016mixture; bekker2018estimating), so that it is possible to draw some conclusions by assuming transitivity.
For KM1 and KM2, we used code provided by their original authors (ramaswamy2016mixture)222http://web.eecs.umich.edu/~cscott/code.html. The code of KM1 and KM2 is a single script that produces results for both variants, since they share the significant part of the computation required to evaluate a sample. For this reason, in Experiment #4, time spent for both algorithms is aggregated into a single column, KM.
Although we previously tested PAT with different scorers (denisOCQ2018), in our analysis, we keep only the results for PAT with Mahalanobis distance (PATM). We chose PATM to be a representative of PAT in our comparisons against PU techniques since Mahalanobis Distance is the simplest scorer among the ones cited in this work and does not require any parameter, and having one single version of PAT simplifies our analysis. Another important difference regarding our previous usage of PAT is that, here, we vary the parameter from 25% to 75% with increments of one and report the median of all predictions, instead of fixing the parameter to a single value.
As PATM is the only algorithm tested that produces a model that can be used for several test samples, in Experiment #4 we additionally report the time spent by PATM to only quantify the data, while disregarding the time spent with training.
The implementation of TIcE provided by bekker2018estimating
only supports categorical features after binarization. Furthermore, numerical features should be in the range. Yet, when a numerical feature is selected to split a node, only four sub-regions are created for the ranges , , and . Since this implementation handles numerical data too simplistically and we only use numerical datasets, we developed our own implementation for TIcE. For each split, we divide the region into two sub-regions with roughly the same number of observations: one with all observations that are below or equal the median of the splitting feature, and the other with the remaining. We note that we sort the data to compute the median for each attribute that is evaluated as a split candidate, and we allow for a feature to be used more than once. The sorting could be avoided by keeping simultaneous presorted arrays with references to the observations.
Throughout experiments #1 to #3, we additionally compare all algorithms against a hypothetical classify and count approach that uses the Mahalanobis distance as scorer and the best possible threshold for classification. We call such an algorithm Best Fixed Threshold with Mahalanobis (BFTM). To choose the threshold, we evaluate several thresholds based on the percentiles of the positive training data (from 0 to 100, with increments of 1). For each dataset, we evaluate which threshold generated the lowest MAE on the test samples and report such a result. We emphasize that, regardless of the average performance obtained by BFTM, it still is affected by the systemic error explained in Section 3.
In our experiments, we used 10 datasets. Nine are directly derived from real data, and one is generated by a Bayesian network. To maintain consistency, for each dataset, the positive class is the same as in our previous work(denisOCQ2018), where they were chosen arbitrarily. These datasets were chosen due to all being publicly available and having a large enough number of observation points, for each class, so that we can proceed with the data hungry experimental setup described in Section 5. Each dataset is detailed below:
- Insects v2
sensor data regarding the flight of classes of insects. A class of insect is determined by sex and species. The observations are described by features extracted from a time series obtained from a single sensor. No environmental feature is included. All data was collected within a temperature range from (included) to (included) degree Celsius. The number of observations per class was limited to observations (achieved by seven classes). The class with the least number of observations has . The total number of records is , and the positive class is female Aedes aegypti with observations;
contains information about the flight of species of insects. As some are discriminated further by sex, the dataset has classes. The positive class is female Aedes aegypti. The data has records represented by features. We find this dataset to be heavily biased regarding the environmental feature temperature. This dataset was kept in our evaluations only to maintain consistency with our previous work (denisOCQ2018);
- Arabic Digit
contains entries described by features for the human speech of Arabic digits. There are classes, and the target class is the digit . This version sets a fixed number of features for every record (hammami2010improved; Lichman:2013);
- BNG (Japanese Vowels)
Bayesian network generated benchmark dataset with speech data regarding Japanese Vowels. There are entries, represented by features, for speakers. The speaker is the class of interest (OpenML2013);
- Anuran Calls (MFCCs)
contains features to represent the sound produced by different species of Anurans (frogs). As the data size is restricted, we only considered the two biggest families of frogs as the classes of the data, ending up with entries. The positive class is the Hylidae family, and the negative class is the Leptodactylidae family (diaz2012compressive; Lichman:2013);
contains features that represent the handwritten lowercase letters q, p and g. The data has entries and the chosen positive class is the letter q (dmr2018unsupervised);
describes the appearance of the uppercase letters of the alphabet on a black and white display with features. It contains entries and the class of interest is the letter W (frey1991letter; OpenML2013);
- Pen-Based Recognition of Handwritten Digits
handwritten digits represented by features. The digit is the target class. There are entries (alimoglu1996combining; Lichman:2013);
Pulsar candidates collected during the HTRU survey, where pulsars are a type of star. It contains two classes, Pulsar (positive) and not-Pulsar (negative), across entries described by features (lyon2016fifty; Lichman:2013);
- Wine Quality
contains features that describe two types of wine (white and red). The quality information was disregarded, and the target class is red wine. The dataset contains entries (cortez2009modeling; Lichman:2013).
KM1 and KM2 presented a runtime error while processing dataset H (Handwritten). For that reason, the performance of these algorithms is not present in any tables for this dataset.
6 Experimental Evaluation
In this section, we display and analyze the results we obtained with the experiments explained in the previous section. For all experiments, we present the average rank and, from completeness, a critical difference plot for the Nemenyi test with . This test is intended as a simple way of comparing all algorithms in one go. However, we observe the limitations of this test as it only takes the ranks into account and is conservative with the amount of data we have. In some cases, the difference between some results are glaring, even in different orders of magnitude, and the test fails to recognize the superiority of some approaches. We make particular observations for such cases and perform pair-wise comparisons via Wilcoxon signed-rank test, when relevant.
Table 2 summarizes our results for Experiment #1, and Figure 11 shows the corresponding critical difference plot. PATM, our proposal, outperformed all PU approaches in 9 out of 10 datasets. It underperformed (within one standard deviation) KM1 and ExTIcE only in the dataset Insects v2, in which PATM ranked third. We observe that, as expected, PATM outperformed BFTM in most cases. Although BFTM is overly optimistic since the threshold is chosen based on the final results, it still undergoes CC’s systemic error explained in Section 2.
Also as expected, ExTIcE outperformed TIcE in every dataset, since ExTIcE removes a search constraint from TIcE. However, more noteworthy is the fact that ExTIcE performed better than all PULearning approaches in all datasets, although a direct comparison against BFTM is inconclusive (p-value is one for Wilcoxon Rank-Sum test).
Regarding Figure 11, we note that although PATM did not differ significantly from ExTIcE, the test only evaluated the average rank of the algorithms. Directly comparing PATM against ExTIcE with Wilcoxon Rank-Sum test results in a p-value of .
The results from Experiment #2 were unremarkably similar to the ones from Experiment #1. This is due to the fact that the majority of the datasets used in our experiments are already fairly balanced (regarding the negative sub-classes). For this reason, we do not further analyze such results. They are displayed in Appendix B.3.
Table 3 and Figure 12 present the results for Experiment #3-a. As a recap, the results in the table indicates that, for half the classes, the MAE obtained is lower or equal than the value shown. While the rankings are mostly unchanged from Experiments #1 and #2, we observe that for some datasets, especially N, I, and B, the MAE obtained by both ExTIcE and PATM are below half of those obtained in the previous experiments. This is evidence of a great disparity in the separability between different sub-classes and the positive class. We can therefore expect that the PAT’s low errors in Experiment #3-a should be compensated by larger errors as we investigate more difficult sub-classes, in Experiments #3-b and #3-c.
Table 4 and Figure (a)a present the results for Experiment #3-b, and Table 5 and Figure (b)b, for Experiment #3-c. Whereas for 75-percentile (Experiment #3-b) PATM still maintains significantly lower MAE than ExTIcE in pairwise comparison (p-value of according to Wilcoxon Rank-Sum test), the opposite takes place for 100-percentile (Experiment #3-c). In fact, due to the poor performance of PATM in datasets N, I, and B, its average rank was despite the fact that the algorithm ranked first in all other datasets. Finally, although the performance of ExTIcE for the same datasets decreased in comparison to the previous experiments, it still outperformed all other approaches. In the remaining datasets, ExTIcE ranked second, only behind PATM.
Particularly for dataset N, observe that the average error obtained by PATM is close to 50% in Table 5. As the actual positive ratio varied uniformly within the interval during the experiment, such an error indicates that PATM always predicted as either close to zero or close to one. Considering our previous results for PATM in this same dataset, we can infer that the current situation corresponds to the latter case, since the algorithm could previously detect situations where the positive class was not prominent (the error was below the baseline 25%), and the learning process involved only the positive class. In fact, further analysis of our more detailed data (available as supplemental material (supmatweb)) reveals that the average prediction of was , which indicates that observations from the negative class obtained score values at least as large as those of positive observations. From this piece of data, we can assume that observations that belong to this negative class are highly similar to at least part of the positive data, fact that also affected the best classify and count BFTM. In the next section, we discuss how and why this scenario affected PATM to a considerably greater degree than ExTIcE. Before that, we do our final analysis regarding time consumption.
|Data||ENa||ENs||PE||KM||TIcE||ExTIcE||PATM||PATM w/o T|
Table 6 presents the total time, in seconds, required to perform all tasks in Experiment #4. We can see that PE was several orders of magnitude slower than the other approaches. KM and predictably ExTIcE were both orders of magnitude slower than TIcE, PATM and EN. Although TIcE, EN and PATM were generally in the same order of magnitude, PATM performed consistently faster, even when the time necessary to train the scorer is considered.
Given the proposed experimental setup, we cannot conclusively claim that EN, TIcE and PAT always have numerically similar execution times. We note that the training dataset was limited to 500 observations, and the test sample to 2,000. We believe that further experimentation would have shown both EN and TIcE to become several orders of magnitude slower than PATM for bigger samples due to the time complexities of SVM and TIcE. Additionally, replacing PATM’s Mahalanobis Distance with a different dissimilarity function would also impact its execution time performance.
Elkan’s method (EN) has historical value as it puts the spotlight on Positive and Unlabeled Prior Estimation, a problem that is similar to One-class Quantification. EN also introduced theoretical basis for newer algorithms to improve on. However, as the results of our data-driven experimentation showed, such a method usually presented a poor performance.
In the previous context, we would not to recommend EN as a first-choice method to address a quantification task. Nevertheless, given that EN is a classical method that can achieve one-class quantification, we argue that it should be used as a baseline when comparing other methods. We would neither recommend PE, since our experiments demonstrated there is no statistical evidence of the difference of performance between PE and EN. In addition to that, PE was shown to be the slowest approach among all algorithms tested.
As explained in Section 5.5, BFT represents the best possible Classify and Count derived from a one-class scorer. However, as discussed in Section 2.3, due to the systematic error of CC, in practice, BFT will tend to be outperformed by the other methods. Despite this, like EN, we argue that this method can be used as baseline in the comparison of novel quantification algorithms.
ExTIcE fulfilled its role of showing the potential of TIcE’s underlying search problem. Indeed, the former consistently provided smaller absolute quantification errors than the latter. Nevertheless, our purpose is not to defend ExTIcE’s position and recommend it as a quantifier, but rather entice the community to further explore the region search problem proposed by TIcE, in future work. ExTIcE, while less restricted than TIcE, is still limited in a number of ways. For instance, like in TIcE and most other tree algorithms, the sub-regions explored only “cut” the feature space along its axes. Additionally, we believe it is possible to create an algorithm from the ideas of TIcE that, similarly to PAT, is capable of inducing a model that can later be used to quantify several test samples without resorting to the training data.
ramaswamy2016mixture make the argument that “requiring an accurate conditional probability estimate (which is a real valued function over the feature space) for estimating the mixture proportion (a single number) is too roundabout”. On the other hand, we defend that the referred approach is actually very practical, since there are already a number of methods for this exactly purpose that are accessible for even inexperienced practitioners. This approach is also the base of PAT, which is, in our opinion, notoriously simpler than KM, yet generally providing smaller quantification errors at an unquestionable faster rate.
In our experiments, PAT was shown to produce the smallest quantification errors while being the fastest algorithm. For this reason, it is the algorithm we mostly recommend for practical use.
Notwithstanding the favorable results, we must highlight PAT’s drawbacks, which were evidenced by the evolution of Experiment #3. PAT was developed on the assumption that some negative observations can be similar to positive observations up to a certain degree. The algorithm (indirectly) tries to ignore the presence of negative observations close to the boundaries of the positive class, in the feature space, by extrapolating the number of observations from only the top scored ones.
However, consider the case where a negative sub-class is partially identical to the positive class, in the sense that a number of negative observations are, individually, identical to or indistinguishable from positive observations. In such a case, the quantification of PAT will likely be affected, since PAT does its computations solely on the observations’ scores. Naturally, the degree to which PAT will be affected depends on the proportion of the aforementioned sub-class within the negative class.
Meanwhile, ExTIcE could be less or not affected by those partially identical classes. Indeed, its search mechanism allows it to completely ignore regions of the feature space where such overlaps are more prevalent, if there are other regions with less overlap. Figure 14 illustrates this discussion. Notice that ExTIcE would likely only consider the top-right quadrant of the feature space to infer , while PAT would use all scores, even though negative observations are as highly scored as positive observations, in this scenario.
Remarks being done for PAT and ExTiCE, we argue that, for practical reasons, the overlaps mentioned may indicate a need to revise: (a) whether the negative observations actually should or need be classified as negative and (b) the quality of the existing features.
In any case, we can try to minimize the effects of negative classes that are identical to positive observations on PAT by ensembling it along with ExTIcE. In this particular scenario, PAT overestimates as a result of negative observations being considered to be positive. In addition to that, we noticed that ExTIcE tends to generally overestimate . The latter finding is not straightforward: since ExTIcE biasedly tries to maximize (the proportion of labeled data over all positive data), we would expect (the proportion of unlabeled positive data over all unlabeled data) to be overestimated. However, ExTIcE overestimated 75% of its predictions in Experiment #3 (considering all classes). Such an overestimation can be justified by TIcE’s correction factor being too heavy. The occasional heavy overestimation of PAT along with the general overestimation of ExTIcE favor the approach of considering the minimum between the predictions provided by both methods.
|Experiment #3-a||Experiment #3-c|
Table 7 presents the results of experiments #3–a and #3–c for ExTIcE, PATM and an ensemble that outputs the minimum prediction between the two methods. In experiment #3–a (median), we can see that, for all but one dataset, the ensemble performs better than ExTIcE and worse than PATM. In the exceptional case of dataset A, the ensemble performed better than the other methods. We note that in all but one dataset, the performance of the enseble was numerically closer to the performance of PATM rather than ExTIcE. On the other hand, in experiment #3–c, the ensemble has a better performance than PATM in multiple datasets. Differently from PATM, the ensemble could perform well in the problematic datasets N, I and B. However, we emphasize that this ensemble imposes a high computational cost due to the use of ExTIcE. Our main purpose is to highlight that it is indeed possible to achieve performance similar to PAT’s while handling the particular case where it cannot perform well. We expect other, faster, methods to be developed in future work.
Finally, both PAT and ExTIcE strongly depend on the assumption that the distribution of the positive class is the same in both training and test samples. Given their strategies, we safely presume that they would be severely affected in the event of the assumption being false.
In this paper, we described several distinct approaches for the one-class quantification problem, most of which are derived from the area of research known as Positive and Unlabeled Learning.
We empirically showed the superiority of our proposal, Passive Aggressive Threshold (PAT) for one-class quantification problems, given that the distribution of the negative class is unknown and overlap with the positive class is allowed only up to a reasonable degree. However, we stress that PAT performs poorly in cases where a reasonable portion of the negative class is indistinguishable from positive observation points.
We also showed how the region search optimization problem behind Tree Induction for Estimation (TIcE) is able to solve one-class quantification tasks in which a portion of the negative observations can be identical to or indistinguishable from positive observations. However, such an approach still requires further development, as we demonstrated with our superior (in terms of lower quantification error) version ExTIcE.
For future work, we are interested in exploring better one-class scorers for PAT, and develop methods to solve the search problem proposed by TIcE. Additionally, on the latter objective, we aim to develop methods that can train solely with positive observations and later quantify several independent test samples, to qualify as One-class Quantification algorithms.
Acknowledgements.This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code PROEX-6909543/D, the Fundação de Amparo a Pesquisa do Estado de São Paulo (FAPESP, grant 2017/22896-7), and the United States Agency for International Development (USAID, grant AID-OAA-F-16-00072).
Appendix A Analysis on TIcE’s time complexity
In this section, we thoroughly analyze the time complexity of TIcE. For this analysis, consider only binary nominal attributes and splits at the median for numerical attributes, so that the data is always split into two slices.
To evaluate the goodness for each attribute, when splitting a node , it is necessary to count how many positive observations go to each side after the split. For this end, the code provided by bekker2018estimating uses a data structure called BitArray for such a counting. The structure is instantiated and initialized for each possible split. Although BitArray is highly optimized, especially so for memory usage, it still performs the counting in , where is the number of observations assessed by the splitting node. Additionally, any data structure that is below for exact counting would still have an initialization that is , since every observation must be processed to give enough information to the structure about the counting. is the same time complexity of a linear counting using a standard array. The authors do not comment on alternatives.
We note that it is possible to use another data structure, like binary decision trees, to obtain the count in. This data structure can be updated after the split for the attribute that caused the split: for numerical attributes, this can be done with a Cartesian tree in , and for binary nominal attributes, it is not necessary since the attribute should not be used any longer. However, for the remaining attributes, there is no such a way to quickly place each observation into the correct side of the split, since no relation between the splitting attribute and the other ones is guaranteed. Therefore, the data structure for each attribute should be updated, resulting in a for the split when using such a data structure, where is the number of attributes that the node has access to. On the other hand, by ditching this data structure, the split is .
Considering that each attribute is used only once, the data is always split in half and there is enough data, the maximum height is , i.e., the total number of attributes, and the complexity of the algorithm is , as shown in Equation 12, where is the total number of observations, and is the recurrence relation of the algorithm.
If there is not enough data to use all attributes, but, again, each attribute is used only once, and divide the data in half, the complexity is , since the maximum height is and . Therefore, the general complexity of the algorithm is when each feature is used only once and the data is divided in half, which is significantly higher than the stated by bekker2018estimating. If the attributes can be used more than once and/or the data is not evenly divided after each split, the complexity is even higher. This fact emphasizes the overly optimistic initial assessment of