Scalable Evaluation and Improvement of Document Set Expansion via Neural Positive-Unlabeled Learning

10/29/2019 ∙ by Alon Jacovi, et al. ∙ 0

We consider the situation in which a user has collected a small set of documents on a cohesive topic, and they want to retrieve additional documents on this topic from a large collection. Information Retrieval (IR) solutions treat the document set as a query, and look for similar documents in the collection. We propose to extend the IR approach by treating the problem as an instance of positive-unlabeled (PU) learning—i.e., learning binary classifiers from only positive and unlabeled data, where the positive data corresponds to the query documents, and the unlabeled data is the results returned by the IR engine. Utilizing PU learning for text with big neural networks is a largely unexplored field. We discuss various challenges in applying PU learning to the setting, including an unknown class prior, extremely imbalanced data and large-scale accurate evaluation of models, and we propose solutions and empirically validate them. We demonstrate the effectiveness of the method using a series of experiments of retrieving PubMed abstracts adhering to fine-grained topics. We demonstrate improvements over the base IR solution and other baselines. Implementation is available at



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are interested in the task of focused document set expansion, in which a user has identified a set of documents on a focused and cohesive topic, and they are interested in finding more documents about the same topic in a large collection. This problem is also known as a “More Like This” (MLT) query in web retrieval. A common way of modeling this problem is to consider the set of documents as a long query, with which Information Retrieval (IR) techniques can score and rank documents. IR literature on document similarity and ranking is vast [Faloutsos:1995:SIR:222929, Mitra:2000:IRD:593956.593986, inter alia]—beyond the scope of this work, and largely orthogonal to it, as will be explained later.

Current methods in document set expansion for very large collections are based on word-frequency or bag-of-words document similarity metrics such as Term Frequency-Inverse Document Frequency (TF-IDF) and Okapi BM25 and its variants [DBLP:journals/ftir/RobertsonZ09, bm25variant]

, which are considered strong due to their robustness to extreme class imbalance, corpus variance and variable length inputs, as well as their scalability and efficiency

[DBLP:journals/corr/MitraC17]. However, the performance of such solutions is limited, as the models cannot capture local or global relationships between words in documents.

In this work, we examine methods to improve document set expansion by leveraging non-linear models (such as neural networks) under the setting of imbalanced binary text classification. To this end, we look to positive-unlabeled (PU) learning [DBLP:conf/icml/PlessisNS15]: a binary classification setting where a classifier is trained based on only positive and unlabeled data. In the standard document expansion setting, we indeed only possess positive (the document set) and unlabeled (the large document collection) data.

PU learning has originally been employed for text classification by DBLP:conf/icml/LiuLYL02, DBLP:conf/ecml/LiL05, DBLP:conf/ijcai/LiL03 by using techniques such as EM and SVM. Since then, the setting has been well studied theoretically [DBLP:conf/kdd/ElkanN08, DBLP:conf/icml/PlessisNS15, DBLP:conf/nips/NiuPSMS16], and recently objective functions have been developed to facilitate training of flexible neural networks from PU data [DBLP:conf/nips/KiryoNPS17]. We discuss the PU learning setting in more detail in Section 2, and relevant work on PU learning for text in Section 8.

We are, however, not interested in replacing traditional (term-frequency-based) IR solutions, but rather improve upon their results by further classifying the outputs of those models. There are two reasons for this approach: (1) Traditional IR engines are based on word frequencies, and as a result, cannot capture features based on word order; (2) Classification by the use of neural networks does not scale well to “extreme” imbalance111In practice, an IR task may involve positive documents in the order of hundreds or thousands, and negative documents in the order of dozens of millions. Literature dealing with imbalanced classification traditionally discuss typical ratios of 1:50 and 1:100 [DBLP:journals/corr/abs-1806-00194, DBLP:journals/corr/abs-1804-10851]. To our knowledge, the setting of extreme imbalance has not been discussed in literature..

Following these observations, we see traditional IR engines and neural models as complementary to each other. Our proposed solution is a two-step process: First, a BM25-based, MLT IR engine retrieves relevant candidates; Then, a non-linear PU learning model is trained based on the subset of candidates. In this way, each method relieves the weakness of the other.

As already discussed above, PU learning has recently become viable for deep neural network models. As a result, we are able to leverage it to train models that are able to capture higher order features between words. However, PU learning literature focused on theoretical analysis and experiments on small models and simple—notably, class-balanced—benchmarks such as MNIST, CIFAR10 and 20News [kato2018learning, DBLP:journals/corr/abs-1810-00846, DBLP:journals/corr/abs-1901-10155]. PU learning has not been extensively tested for imbalanced datasets. Scaling PU solutions to high-dimensional, ambiguous and complex data is a significant challenge. One reason for this is that PU data is, by definition, difficult or sometimes impossible to be fully labeled for exhaustive, large-scale evaluation.

For the purpose of document set expansion, and in particular for fine-grained topics, gathering fully-labeled data for an accurate benchmark is also a challenge. For this reason, we propose to simulate the scenario synthetically but realistically by using the PubMed collection of bio-medical academic papers. PubMed entries are manually assigned multiple terms from Medical Subject Headings (MeSH), a large ontology of medical terms and topics. We can treat a set of MeSH terms as defining a fine-grained topic, and use the MeSH labels for deriving fully-labeled tasks (see examples of MeSH topic conjunctions in Table 1). This results in an evaluation setup which is extensive (allowing for a large variety of different datasets based on different bio-medical topics), flexible (the ability to simulate different biases in the data gathering to account for many possible practical settings), and accurate (fully labeled data).

The contributions of this work are thus:

  1. We propose a method of generating tasks based on PubMed, using conjunctions of MeSH terms for labels—including a variant that allows for limited negative data—as a benchmark for evaluating PU learning for document set expansion, and for long-form text in general.

  2. We apply the state-of-the-art PU solutions, which were previously only evaluated on very simple benchmarks, to the tasks and report challenges: no knowledge of the class prior, batch size restrictions, very imbalanced data (small class prior), and limited labeled data. We propose methods for dealing with the above challenges—some of which apply to imbalanced classification universally.

  3. We empirically evaluate the proposed PU solution against the standard approach (selecting the IR engine’s top- documents), noting the PU method yields a significant improvement.

Topic nnPU IR Naive All-positive Upper-bound
20 20,000 Animals + Brain + Rats 48.97 32.25 11.63 1.49 44.6 68.17
Adult + Middle Aged + HIV Infections 42.38 26.75 7.22 6.88 30.98 55.61
Lymphatic Metastasis + Middle Aged + Neoplasm Staging 26.95 23.27 6.63 13.26 21.38 40.34
Renal Dialysis + Kidney Failure, Chronic + Middle Aged 49.16 41.23 8.95 0.00 28.40 58.18
Average of 10 topics 33.26 26.69 7.18 2.16 26.46 50.46
50 20,000 Animals + Brain + Rats 60.56 32.8 10.87 5.41 45.86 70.23
Adult + Middle Aged + HIV Infections 42.77 31.85 10.70 12.28 40.53 58.10
Lymphatic Metastasis + Middle Aged + Neoplasm Staging 28.21 24.77 6.06 12.26 23.24 40.99
Renal Dialysis + Kidney Failure, Chronic + Middle Aged 50.09 35.78 9.13 0.00 31.81 57.58
Average of 10 topics 37.36 29.07 7.75 3.01 30.41 51.09
Average of 15 topics 33.82 27.55 6.20 2.12 29.02 47.41
Table 1: Experiment F1 results against the baselines of average performance across topics, as well as four example topics. See Section 6 for details. denotes the same collection of topics. The average of 15 topics includes . The nnPU experiments include BER optimization and proportional batching, but without pre-trained embeddings.

2 Background: Positive-Unlabeled Learning

PU learning is the problem of learning a binary classifier from positive and unlabeled data. In this section we briefly describe notation and relevant literature.


We refer to the positive set as P, the labeled positive set as LP, the unlabeled set as U, and the negative set as N. Empirical approximations of expectations and priors are denoted .

2.1 Setting

Let and

be random variables jointly distributed by

where and are the class marginals (i.e., the positive and negative class-conditional densities), and let and

be an arbitrary binary decision function and a loss function of

respectively. For the purpose of this work, we will use the common sigmoid loss, , as we have observed the best empirical performance with this loss. We denote and

as the class prior probabilities, such that

. The methods described in this section all assume the proportion to be known.

Binary classification aims to minimize the risk:

In supervised (positive and negative: PN) learning, both positive and negative samples are available. The supervised classification risk can be expressed as the partial class-specific risks:


Notice that under the zero-one loss (), the risk refers to . When training, we use which can be regarded as a soft approximation of this formulation for back-propagation. In practice, the expectations are expressed as the average of losses and optimized in batched gradient-descent or similar methods.

2.2 Unbiased PU Learning (uPU)

In this work we utilize the case-control variant of PU learning222In case-control PU learning, the positive and unlabeled data are collected separately. There are other variants which assume different distributions on the data. [Ward2009PresenceonlyDA]. Formally, unlabeled data is available instead of , in addition to as before.

In order to train a binary classifier from PU data, we could naively train a classifier to separate positive from unlabeled samples. This approach will result, of course, in a sub-optimal biased solution since the unlabeled dataset contains both positive and negative data. DBLP:conf/icml/PlessisNS15

proposed the following unbiased risk estimator to train a binary classifier from PU data.


for any function 333Following:

, we can substitute the negative-class expectation in (1):


By further empirically approximating this risk as an average of losses over our available dataset, we arrive at an unbiased risk estimator that can be trained on PU data, referred to as the uPU empirical risk.

Non-negative PU (nnPU)

If the loss is always positive, so should be the risk. However, DBLP:conf/nips/KiryoNPS17 noted that by using stochastic batched optimization, and specifically when very flexible models (such as neural networks) are used, the negative portion of the uPU loss can eventually cause the loss to go negative during training. To mitigate this overfitting phenomenon, they proposed to encourage the loss to stay positive by using gradient-ascent on the negative portion (which replaces the negative-class risk of the classification risk) when it becomes negative. This method is referred to as nnPU.

3 The PubMed Set Expansion Task

In this section we discuss the method of generating an extensive benchmark for evaluating solutions of MLT document set expansion.

We are inspired by the following scenario: A user has a set of documents which all pertain to a latent topic, and is interested in retrieving more documents about that topic from a very large collection. While traditional term-frequency-based IR solutions scale well to extremely large collections of documents, they are imprecise, and contain a significant amount of noise. Therefore, an additional step based on PU learning can be utilized to classify the output of the IR model, and improve the results.

We are interested in gathering a task for evaluation of the second step. In other words, given an existing black-box IR solution, we would like to use it to produce a dataset for training and evaluation of models which should improve upon the black-box IR solution’s performance.

Due to the varied nature of the setting, it is impractical to acquire full supervision for a large number of topics. Therefore, we propose to generate synthetic tasks inspired by the real use-case application.

3.1 Task Generation Method

We generate the document-set expansion tasks by leveraging the expansive PubMed Database: A collection of 29 million bio-medical academic papers. Each document is labeled with MeSH tags, denoting the subject of the document. A conjunction of MeSH terms defines a fine-grained topic, which we use to simulate a user’s information intent (example conjunctions in Table 1).

The method of generating a PubMed PU task is then:

  1. Input: set of MeSH terms (the retrieval topic); number of labeled positive data; a black-box MLT IR engine, along with query parameters.

  2. randomly selected papers that are labeled with .

  3. .

For the tasks generated and utilized in this paper, we have chosen MeSH sets manually, and (for the training set). For the MLT IR engine we have used the Elasticsearch444 implementation of BM25. The top- scoring documents are retrieved. We make use of the abstracts of the PubMed papers only. See Appendix A for exact details of our method, as well as a comparison to an alternative method for generating censoring PU (explained in the appendix) tasks. The code for generating the tasks, and the data of our generated tasks are available online at uploaded dataset contains the paper abstracts. The PubMed identifiers are also available in cases where additional information about each paper, such as the full text, can be retrieved from PubMed if desired..

We note that although in essence document set expansion involves using for both training and evaluation (transductive case), we are interested in the case where the PU model is able to generalize to completely unseen data (inductive case). As a result, we split the dataset into training, validation, and test sets, where we use the validation set for hyper-parameter tuning and early-stopping, and evaluate on the test set using the true labels. In other words, we assume a separate (from training) small PU set is available for validation. In our experiments, the size of the validation set is half of the size of the training set. In a deployment setting, the PU model can be used to label the training data.

4 Experiment Details

The rest of this work will reference experiment results. Unless otherwise noted, our base architecture is a single-layer CNN [DBLP:conf/emnlp/Kim14]

. The choice of CNN, over other recurrent-based or attention-based models, is due to this architecture achieving the best performance in our experiments. Test-set performance is reported as an average over multiple MeSH topics (as many as our resources allowed). Except for the experiments that use pretrained models, the inputs are tokenized by words, and word embeddings are randomly initialized and trained with the model. More details are available in Appendix B. We stress that our intent in this work is not to report the very best scores possible, but rather to perform controlled experiments to test hypotheses. To this end, many orthogonally beneficial “tricks” in NLP literature were not utilized. Additionally, nnPU-trained models generally required more diligent hyperparameter tuning due to an additional two hyperparameters.

5 PU Learning for Document Set Expansion

In PU classification literature, traditionally small (and in many cases, linear) models have been used on relatively simple tasks, such as CIFAR-10 and 20News. However, performance of existing methods does not scale well to very high-dimensional inputs and state-of-the-art neural models for text classification; applying the PU learning methods described in Section 2 to a more practical setting results in several critical challenges that must be overcome—for example, PU learning methods often assume a known class prior, yet estimation of the class prior, particularly for text, is hard and inaccurate. In this section we discuss various challenges we have encountered in applying PU learning to the PubMed Set Expansion task, along with proposed, empirically validated solutions.

Prior Accuracy F1
20 84.27 0.0
20 0.5 62.09 33.26
50 81.71 0.0
50 0.5 59.92 37.36
Table 2: Experiments for the PU model, trained with the nnPU loss with either the true class prior (optimizing for accuracy by surrogate) or a prior of 0.5 (optimizing for BER by surrogate). Reported average across 10 topics.

5.1 Class Imbalance and Unknown Prior (BER Optimization)

Due to the class imbalance (very small class prior), the classification risk encourages the model to be biased towards negative-class prediction (by prioritizing accuracy) in lieu of a model that achieves worse accuracy but better F1. Thus, optimizing for a metric that is similar to F1 or AUC is preferable.

Under a known class prior assumption, DBLP:journals/ml/SakaiNS18 derived a PU risk estimator for optimizing AUC directly. However, cannot be assumed to be known in practice. Furthermore, the high dimensionality and lack of cluster assumption in the input of our task makes estimation difficult and noisy [pmlr-v37-menon15, DBLP:conf/icml/RamaswamyST16, DBLP:conf/nips/JainWR16, DBLP:journals/ml/PlessisNS17].

Following this line of thought, we propose a simple solution to both problems: by assuming a prior of in the uPU loss regardless of the value of the true prior, we are able to optimize a surrogate loss for the Balanced Error (BER) metric666Given a decision function :

[DBLP:conf/icpr/BrodersenOSB10]. Effectively, the uPU loss we are optimizing is:


When using the zero-one loss (), the binary classification risk is equivalent to BER, while BER minimization is equivalent to AUC maximization: [pmlr-v37-menon15]. Since back-propagation requires a surrogate loss in place of , such as , the BER and AUC metrics are not inversely equivalent; However, we’ve found BER optimization to perform well in practice.


Table 2 shows a performance comparison in which the models trained using a prior of achieved stronger F1 performance despite weaker accuracy.

5.2 Small Batch Size (Proportional Batching)

The large memory requirements of state of the art neural models such as Transformer [DBLP:conf/nips/VaswaniSPUJGKP17] and BERT [DBLP:journals/corr/abs-1810-04805], as discussed in the next subsection, coupled with the need to run on GPU, restrict the batch sizes that can be used.

This presents a challenge: When the loss function is composed of losses for multiple classes, when using stochastic batched optimization, each batch should contain a proportionate amount of data of each class relative to the entire dataset. When the classes are greatly imbalanced, this imposes a lower-bound on the batch size when the batch contains one positive example or more. For example, for a dataset which contains 50 positive and 10,000 unlabeled samples, each batch which contains a positive sample must have 200 unlabeled samples. In practice, we were limited to the vicinity of 20 samples per batch when training large Transformer models.

Using a smaller batch-size than the lower-bound (in the case of the example, 20 as opposed to 201) implies that the vast majority of batches will not have labeled positive samples. This result damages performance in two ways:

  1. The model overfits to the unlabeled data. Since the unlabeled samples are treated as discounted negative samples by the uPU loss, this results in a model that exhibits very low recall (i.e., the model is biased towards predicting the negative class).

  2. Inconsistent early-stopping performance. As the uPU validation loss is a weighted sum of the positive and unlabeled components, because the loss of the unlabeled component is often much smaller than the positive component and due to the usage of a surrogate prior of 0.5, early-stopping becomes unstable777This problem can be solved by limiting the batch size to 1 and calculating the validation set uPU loss for every sample separately, but that is an extremely inefficient solution..

To solve both of these problems, we propose to increase the sampling frequency of the positive class inversely to its frequency in the dataset. In practice, this solution simply enforces each batch to have a rounded-up

proportion of its samples for each class. In the example above, every batch with 20 samples will have 1 positive and 19 unlabeled samples. As we “run out” of positive samples before unlabeled samples, we define an epoch as the a single loop through the positive set.

The implication of increasing the sampling frequency is essentially that the positive component of the uPU loss receives a stronger weight. In our running example, the sampling frequency was increased . For a sampling frequency increase by an order of , the uPU loss becomes:


This, intuitively, counter-acts the overfitting problem caused the abundance of stochastic update steps of entirely unlabeled-class batches. The issue of unstable validation uPU loss is solved as well, since every batch must contain both positive and unlabeled samples, by a ratio that is consistent between the training and validation sets (and thus the validation uPU loss remains a reliable validation metric).

The issue of overfitting in this case is derived from a more general problem: Overfitting to the “bigger” class in stochastic optimization of extremely imbalanced data, when the loss can be decomposed into multiple components for each of the classes (as is the case for cross-entropy loss, as well). For this reason, our solution improves ordinary imbalanced classification under batch size restrictions, as well.

Setting Class Ratio Batch Proportional Batching F1
PN (P:N) 15:85 512 32.55
16 5.55
16 41.61
PU (LP:U) 2:100 512 22.77
16 0.0
16 22.35
Table 3: Evaluation for the sampling frequency increase method for mitigating overfitting to the bigger class in imbalanced classification with small batch size. Results show that proportional batching dramatically improves results under batch size constraints for both ordinary supervision (PN) and PU settings.


Table 3 shows the effect of the increased sampling frequency method in ordinary imbalanced binary classification, as well as in nnPU training. In the small batch size experiments, the method causes an increase in recall, showing that the model is less inclined towards the “bigger” (in our case, the negative) class. The results apply in both the PN and PU settings, showing that proportional batching can be beneficial to any imbalanced classification task under batch size restrictions.

Embeddings F1
Random (trainable) 24.65
SciBERT 27.92
Table 4: Performance comparison using SciBERT pretrained embeddings. The numbers reported are the average of three topics.

5.3 Limited Data

A defining challenge of document set expansion tasks, when observed through the lens of imbalanced classification, is the very small class prior and small amount of labeled positive data. Although BER optimization mitigates the issue of the class imbalance, the issue of very little labeled data remains. To this end, we investigate pretraining as a solution.

We utilize SciBERT [DBLP:journals/corr/abs-1903-10676] as a source of pretrained contextual embeddings for the PubMed domain. For PubMed abstracts that go above the 512 word-piece limit of SciBERT, we utilize a sliding-window approach that averages all embeddings for a word-piece that appeared in multiple windows.


Table 4 details experimental results. SciBERT embeddings show improvement over random trainable embeddings.

6 Effectiveness of PU Learning

In this section we evaluate the viability of our proposed solution. All experiments in this section use BER optimization and proportional batching as described in Section 5, but no pre-trained embeddings.

As an anchor for comparison, we use the following reference: Upper-bound: An identical model, trained on the same training data with full supervision using the true labels. This reference can be regarded as the upper-bound performance in the ideal case.

As well as the following baselines:

  • IR888We note that the comparison here should be made to the specific IR engine which resulted in the dataset of the PU model, as the PU model benefits greatly from better performance in the IR engine.: The top- documents of the IR engine’s output, for

    , are selected as positive documents, while the rest are treated as negative. F1 mean and standard deviation are reported across


  • All-positive: Classifying all samples as the positive class.

  • Naive

    : Supervised learning between the labeled positive set (as P) and the unlabeled set (as N).

The IR baseline is the main alternative to our approach. The all-positive (minority) and naive baselines are very simplistic “lower-bound” models to be compared against.

Experiments in Table 1 show a significant increase in F1 performance as an average across many topics, against all baselines.

Figure 1 shows the performance of the IR and PU models against the upper-bound as a function of the amount of labeled data. The reported values are the distance of F1 scores, normalized by their sum. The figure shows that the PU model’s performance gains closer to the upper-bound as more labeled samples are added. In comparison, the IR model improvement diminishes greatly past 300 labeled samples.

Figure 1: The F1 absolute difference, normalized by the sum of the two F1 scores, between the upper-bound and nnPU as a function of the amount of labeled positive samples, as well as between the IR top- baseline (mean and standard deviation) and the upper-bound. Numbers are the average of five topics.

7 Using Negative Data

The document set expansion scenario may allow for cases where a limited amount of negative data can be collected. For example, the user may possess some number of relevant negative documents which were acquired alongside the positive documents, prior to training; alternatively, the user may label some documents from the IR engine’s output as they appear. Therefore, it is of interest to augment the task with a supplement of biased labeled negative data—i.e., negative documents which were not sampled from the true negative distribution, but were selected with some bias, such as their length, popularity (for example, the number of citations), or their placement within the IR engine’s rankings.

In this work, we consider a bias from document character length, randomly sampling abstracts that are below a certain amount of characters. Alternative bias methods are discussed in Appendix A.

PNU Learning

When it is possible to obtain negative data in a limited capacity, it can be incorporated into training. When the negative data is sampled simply from , i.e., it is unbiased negative data, it is possible to use PNU classification [DBLP:conf/icml/SakaiPNS17], which is a linear combination of and :


Or in other words, multi-task learning of the PN and PU losses.

We note that to our knowledge, PNU learning has not yet been successfully applied to deep models in literature prior to this work. We apply the same solution to the case of biased negative samples.

In our experiments, we have increased the sampling frequency of the negative samples (as described in Section 5.2) such that every batch will have the same (non-zero) amount of positive and negative samples, and each class—positive, negative and unlabeled—of samples is shuffled and looped separately.


Tables 5 and 6 shows the results of PNU learning for the biased and unbiased settings. We note multiple conclusions from these results:

  1. There is an improvement in performance when using unbiased negative samples, and no improvement when using biased negative samples (Table 5).

  2. In the case of unbiased samples, performance between the PU and PN models is comparable, and a simple ensemble of the PN and PU models out-performs multi-task PNU learning (Table 6a).

  3. The PN+PU ensemble out-performs a PU ensemble, validating that the performance increase is not solely due to the ensembling method (Table 6a).

  4. In the case of biased negative samples, where the performance of the PN model is severely lower than the PU model, PNU multi-task learning slightly out-performs ensembling (Table 6b).

Setting Precision Recall F1
PU 29.35 71.83 40.78
PN (unbiased N) 33.83 70.40 42.14
PN (biased N) 19.34 90.62 31.29
Table 5: Experiments for five topics. All experiments used , for training and , for validation (as well as in the PU setting).
Setting Model (a) Unbiased N F1 (b) Biased N F1
PNU Ensemble (PN+PU 1+1) 42.31 37.63
PNU Multi-task 41.50 41.48
PU Ensemble (3) 41.25
Table 6: Average performance of the same five topics as in Table 5. All experiments used , for training and , for validation (as well as ). Bias selection for N was performed by character length. “Multi-task” refers to Equation (5).

8 Related Work

Linear PU models have been extensively used for text classification [DBLP:conf/aaai/LiuLLY04, DBLP:conf/micai/YuZP05, DBLP:conf/dasfaa/CongLWL04, DBLP:conf/ecml/LiL05] by using EM and SVM algorithms. Particularly, the 20News corpus has been often leveraged to build PU tasks for evaluation of those models [DBLP:conf/icml/LeeL03, DBLP:conf/ijcai/LiLN07]. DBLP:conf/acl/LiZLN10 have evaluated EM-based PU models against distributional similarity for entity set expansion. DBLP:conf/emnlp/LiLN10 proposed that PU learning may out-perform using negative data when only the negative data’s distribution significantly differs between training and deployment.

DBLP:journals/ml/PlessisNS17, DBLP:journals/corr/abs-1809-05710 describe methods of estimating the class prior from PU data under some distributional assumptions. DBLP:journals/corr/abs-1810-00846 introduced PUbN as an alternative PU-based loss for learning with biased negative samples. PUbN involves a two-step method where the marginal probability of a sample to be labeled (positive or negative) is approximated using a neural model, and then used. In our experiments, PUbN has consistently overfit to the majority baseline. We suspect that this is a result from inaccurate estimation of the labeling probability due to the difficulty of the task.

9 Conclusion

We have proposed a two-stage solution to document set expansion—the task of retrieving documents from a large collection based on a small set of documents pertaining to a latent fine-grained topic—as a method of improving and expanding upon current IR solutions, by training a PU model on the output of a black-box IR engine. In order to accurately evaluate this method, we synthetically generated tasks by leveraged the PubMed database using MeSH term conjunctions to denote latent topics. Finally, we discuss challenges in applying PU learning to this task, namely an unknown class prior, extremely imbalanced data and batch size restrictions, propose solutions (one of which—“Proportional Batching”—applies in the general scope of imbalanced classification, as we empirically validate), and provide empirical evaluation against multiple baselines which showcase the effectiveness of the approach.

Future Work

Stronger class prior estimation, through assumptions specific to this task, may facilitate direct AUC optimization. Additionally, methods of increasing precision may be considered (such as data augmentation by heuristics or adversarial training).


Appendix A PubMed Set Expansion Task Generation

In this section we discuss details of the PubMed Set Expansion task generation process.


For this work, we have indexed the January 2019 version of PubMed in an Elasticsearch ver-6.5.4 index. We discard all papers in PubMed that do not have MeSH terms or abstracts (of which there are few). The title and abstract of each paper are tokenized using the Elasticsearch English tokenizer, with term vectors. The title receives a 2.0 score boost during retrieval. For retrieval, we use the Elasticsearch “More Like This” query with the default implementation of BM25, and a “minimum should match” parameter of 20%, indicating that papers that do not share a 20% overlap of terms with the query are dropped. This parameter was controlled in the interest of efficiency, as the query is otherwise very slow.

Table 7 contains statistics about sample topics.

Topic Precision Recall
Liver + Rats, Inbred Strains + Rats 20 10,000 17.45 15.59
50 10,000 16.55 14.82
Adult + Middle-Aged + HIV Infections 20 20,000 18.33 20.06
50 20,000 25.42 27.85
Table 7: Dataset sizes for two example PubMed Set Expansion tasks based on the given topics, each composed of three MeSH terms. The reported sizes are for the training set. Precision denotes the proportion of P samples in U, and recall denotes the proportion of retrieved P samples from all positive documents in PubMed.

Censoring PU learning

An alternative, easier, scenario for the Document Set Expansion task involves the case where the LP data was sampled and labeled from the U distribution, termed censoring PU learning. To model this case, the task can be generated in the following way:

  1. Input: set of MeSH terms (the retrieval topic); number of labeled positive data; a black-box MLT IR engine, along with query parameters.

  2. All papers that are labeled with .

  3. randomly selected papers in P.

Experimentally, the F1 performance of all the models (PU and PN) was greatly increased for this setting, in comparison to the case-control tasks described in the main work. All methods discussed in this work apply to the censoring setting, as it is a special case of case-control.


It is possible to simulate bias in the sampling of documents according to many heuristics and assumptions. For example, it may be assumed that the user is more likely to label documents that are shorter, or documents that are more famous (as indicated by amount of citations in PubMed). Additional possible conditions include the ranking of the IR engine in two possible ways: 1. The user may submit labels after the IR query while viewing the results. In this case, the user is more likely to label documents that are ranked higher; 2. In the case of an IR engine modeled by bag-of-words (such as BM25), documents that rank lower can be assumed to possess less relevant vocabulary overlap with the positive class, such that they may be easier to label at a glance. Figure 2 shows a typical distribution of class according to the rank of BM25 for a sample task of PubMed Set Expansion.

Figure 2: Two histograms of positive and negative documents respectively by their BM25 score. The horizontal axis denotes buckets of BM25 scores, and the vertical axis is the amount of samples in that bucket.

Appendix B Experiment Details

The experiments were implemented in PyTorch version 1.0.1.post2, AllenNLP version 0.8.3-unreleased. The neural models used a CNN encoder with max-pooling, with 100 filters for the title and 200 filters for the abstract, split evenly between window sizes of 3 and 5. The choice of CNN (over other recurrent-based or attention-based models) is due to this architecture achieving the best performance in practice. For the SciBERT contextual embeddings, SciBERT-base was used. The learning rate for the model with no pretraining used is 0.001, while the learning rate for the SciBERT model is 0.00005. The nnPU parameters

were set to 0 and tuned over the validation set loss, respectively. In all cases of nnPU training we used the biggest batch-size possible, which was 1000 for the CNN model with no pretraining, and between 16 to 25 for the SciBERT model. In the case of the SciBERT model, we’ve ignored training and validation samples longer than 600 words, tokenized by the AllenNLP default implementation of

ordTokenizer , to avoid long outliers which greatly limit the batch size. This was not performed on the test set to maintain an unbiased comparison. \subsection{Experiment Topics} % \paragraph{} ($\dagger$) \begin{enumerate}%[label=\arabic*)] \item Animals + Brain + Rats. \item Adult + Middle Aged + HIV Infections. \item Lymphatic Metastasis + Middle Aged + Neoplasm Staging. \item Base Sequence + Molecular Sequence Data + Promoter Regions, Genetic. \item Renal Dialysis + Kidney Failure, Chronic + Middle Aged. \item Aged + Middle Aged + Laparoscopy. \item Apoptosis + Cell Line, Tumor + Cell Proliferation. \item Disease Models, Animal + Rats, Sprague-Dawley + Rats. \item Liver + Rats, Inbred Strains + Rats. \item Dose-Response Relationship, Drug + Rats, Sprague-Dawley + Rats. \end{enumerate} ($\ddagger$) \begin{enumerate}%[label=\arabic*)] \item Female + Infant, Newborn + Pregnancy. \item Molecular Sequence Data + Phylogeny + Sequence Alignment. \item Cells, Cultured + Mice, Inbred C57BL + Mice. \item Dose-Response Relationship, Drug + Rats, Sprague-Dawley + Rats. \item Brain + Magnetic Resonance Imaging + Middle Aged. \end{enumerate} % \item Animals @ Brain @ Rats. D000818 D001921 D051381 % \item Adult @ Middle Aged @ HIV Infections. D000328 D008875 D015658 % \item Lymphatic Metastasis @ Middle Aged @ Neoplasm Staging. D008207 D008875 D009367 % \item Base Sequence @ Molecular Sequence Data @ Promoter Regions, Genetic. D001483 D008969 D011401 % \item Renal Dialysis @ Kidney Failure, Chronic @ Middle Aged. D006435 D007676 D008875 % \item Aged @ Middle Aged @ Laparoscopy. D000368 D008875 D010535 % \item Apoptosis @ Cell Line, Tumor @ Cell Proliferation. D017209 D045744 D049109 % \item Disease Models, Animal @ Rats, Sprague-Dawley @ Rats. D004195 D017207 D051381 % \item Liver @ Rats, Inbred Strains @ Rats. D008099 D011919 D051381 % \item Dose-Response Relationship, Drug @ Rats, Sprague-Dawley @ Rats. D004305 D017207 D051381 % \end{enumerate} % ($\ddagger$) % \begin{enumerate}%[label=\arabic*)] % \item Female @ Infant, Newborn @ Pregnancy. D005260 D007231 D011247 % \item Molecular Sequence Data @ Phylogeny @ Sequence Alignment. D008969 D010802 D016415 % \item Cells, Cultured @ Mice, Inbred C57BL @ Mice. D002478 D008810 D051379 % \item Dose-Response Relationship, Drug @ Rats, Sprague-Dawley @ Rats. D004305 D017207 D051381 % \item Brain @ Magnetic Resonance Imaging @ Middle Aged. D001921 D008279 D008875 \end{document}