1 Introduction
In supervised learning, the labels provided by a group of annotators are typically aggregated into a single label, which is regarded as ground truth. The underlying assumption that one label fits all is rarely questioned. However, the process of labeling is often subjective, i.e., based on the personal experiences of the humans who label the data
[ovesdotter-alm-2011-subjective], such as when the task is to predict beauty in images or rate movies [Geng2015]. Genuine disagreement is also common in seemingly “more objective” tasks, for instance, in assessing the mental state, beliefs, or other hidden states based on observable data [Liu2019HCOMP, Homan2015]. Nevertheless, the negative impacts of AI systems trained on too narrow a segment of a population are increasingly felt [Neff2016, vincent_2016, Yang_2017, pmlr-v81-buolamwini18a, Wired_2018].Label distribution learning
(LDL) seeks to predict, for each data item, a probability distribution over the set of labels
[Geng2016]. LDL can capture for each data item the diversity of opinions among the human annotators. In contrast, almost all of the prior work in LDL has taken the label distributions found in the training data to be accurate, when in fact annotations obtained from crowdsourcing sites or social media—a common though certainly not exclusive source of label distributions—are most often samples of a larger pool of annotators, who may not themselves be representative of the actual target populations’ opinions. Furthermore, outside of a limited number of cases—such as movie ratings, where annotations are abundant and convenient—the sample size of annotations for each data item is far too small (often as small as five annotations per item) to reliably represent the true distributions of the annotator pool’s beliefs and opinions.
Liu et al. [Liu2019HCOMP] propose a method for sharing labels among items with similar label distributions, under the assumption that, relative to a specific labeling task, there are only a small number of possible interpretations of any data item, even at the population level. A goal in their work was to address the resource bottleneck of human annotation from a large sample by sharing the existing labels. They introduce the concept of population-level label distribution learning (PLDL), i.e., LDL that explicitly models label distributions as population samples. They explored PLDL by using different methods of clustering in the label space. However, they do not use population models to evaluate their algorithm’s performance or select models.
Our main contribution is to introduce a new model selection techniques based on population-level hypothesis tests. These techniques use traditional frequentist statistical approaches to hypothesize that items are similar if their labels can be regarded as samples from a common source. We use these methods to explore, from a sampling perspective, the clustering approach of Liu et al
[Liu2019HCOMP]. Additionally, we consider a new, bottom-up approach, called neighborhood-based pooling (NBP) (Figure 1), for improving the ground truth estimates of labels for PLDL. Our approach is based on the idea that items with very similar labels may have essentially the same meaning, relative to an annotation task, and thus may be shared, but that the space of label distribution is less clustered and smoother, rather than clustered, as Liu et al. assume.
2 Related Work
Disagreement during labeling tasks is a well studied problem [Malossini2006, Marcus1993, ovesdotter-alm-2011-subjective]. Snow et al. [Snow2008] observed using multiple crowdsourced workers that individuals (even experts) had personal biases when labeling. This can contribute to diversity among labels through disagreement, which nondistributional learning cannot account for. Aroyo [Aroyo2014] observed that when the human annotators agree with one another, they perform at a level comparable to experts, and when they disagree, it is often for a good reason. In conventional single label learning applications, researchers consider the majority label as the ground truth, disregarding previously discussed disagreements.
Geng [Geng2016] introduced label distribution learning, a learning paradigm where probability distributions are objects to be predicted. LDL takes into account the diversity, disagreement, ambiguity, and uncertainty between annotators for its approaches including, predictions. He and colleagues studied LDL using a variety of different learning algorithms, problem transformation, algorithm adaptation, and a specialized algorithm. The algorithm for LDL shares some similarities with learning over a probability distribution, which has a long history of research [Sheng2008, Goyal2018, Algorithm1979]. Both use probability distributions; LDL interprets them as ground truth while the others use them to model uncertainty. Geng and other researchers studied the applications of LDL in various settings such as predicting population level labels [Geng2015, Geng2014, Ren2017] while some do not [Gao2017, Ling2018]. These studies acknowledge the need for a large number of labels to train on for improving their distributions. We use the natural scene and facial expressions datasets used by Geng [Geng2014] for our experiments (see Section 4).
In contrast to supervised learning, semi-supervised learning (SSL) is a combination of supervised and unsupervised learning
[Chapelle2006]. SSL uses both labeled and unlabeled data to improve learning. SSL [Iscen2019, zhu05survey] has a long history with a variety of methods of learning. Clustering is another semi-supervised learning method in which if two items belong to the same cluster, they are believed to share the same label [Chapelle2006].Liu et al. [Liu2019HCOMP] use a similar approach for PLDL (they also formally introduce the PLDL problem). They explored clustering as an unsupervised learning method to improve the quality of ground truth estimation. They [Liu2019HCOMP] collected a crowdsourced dataset in which five-to-ten human annotators labeled each data item. Then they “[clustered] together semantic similar data items, and then [pooled] together all the labels in each cluster into a single, larger sample” [Liu2019HCOMP], under the assumption that this sample represented the population-level beliefs about each item in the cluster. They compared and contrasted the performance of a variety of clustering methods. Their approach is more typically applied to items with no or unreliable labels, where similarity between data items is necessarily determined in the feature space of the data items, not in the label space. However, Liu et al. [Liu2019HCOMP] observe that, if all the items already have labels, one can determine similarity in the label space alone. Their work has two limitations: (1) though their approach was based on statistical principles, they do not exploit this connection in their analysis and (2) they consider only a relatively small number of clusters per learning problem (under the assumption that for classification problems the number of distinct answer distributions is necessarily limited).
Zhang [Min-LingZhang2005] introduces multi-instance-multi-learning (MIML)--NN as a non-parametric learning method that used the k-nearest neighbor for multiple labels. However, -NN selects the closest neighbors around the item regardless of how close or further away they are from the data point. The drawback of this approach is not taking into account the similarity between the item and its neighbors.
3 Methods
3.1 Label Distribution Learning
Wang and Geng [wang2019theoretical] developed a theory for label distribution learning, proving a number of theorems about error functions for LDL. Their theory presents label distributions as ground truth objects, without considering that they may be merely estimates of an underlying ground distribution. Here, we present a theory for LDL for the special case of when the label distribution is a sample of an underlying population, which explicitly accounts for this population, along with noise in the sampling process and level of reliability of the annotators. First, we introduce some notation. For any probability distribution , and any set let
denote a random variable in
over . If context is clear we will drop the “” subscript. Let denote the distribution conditioned on , and similarly for and . Thus, is a random variable in over the distribution . Finally, for any set , let denote the space of all probability distributions over .Now, to formally define PLDL, let be a probability distribution over , where is the feature space of a data set of interest, is the agent (or annotator) space, and is a label space. Let , , and be random variables representing the marginal distributions of , , and respectively. We assume that is independent from .
Let be a sample from , drawn according to Figure 1. We can thus consider the indexed set . and let be a hypothesis space, where each is a function . The empirical risk minimization (ERM) problem for PLDL is to find a hypothesis that minimizes , the loss function applied to on :
(1) |
(Digression: in the multi-label setting one can take and treat it as a single label learning problem; however, it is often beneficial to exploit the set structure of the multi-label setting in designing a machine learning solution.) is a function that is small when each is close to and zero whenever . For the sake of discussion, we will take to be the expected Kullback-Liebler (KL) divergence, , where for any two probability distributions and with random variables over ,
(2) |
KL divergence is widely used in machine learning, especially in belief modeling, and as a loss function in many settings. Here, it can be roughly interpreted as the expected number of bits per item needed to correct whenever a sample from is mistaken for a sample from , and this seems to capture intuitively the notion of error when comparing two probability distributions.
3.2 Estimating Ground Truth via Label Pooling
A common problem that occurs in population-based label distribution is that datasets frequently only have a small number of annotations per data item, i.e., each occurs at most times in , where is typically too small to estimate the ground truth sample . For simplicity’s sake we will assume hereafter that each item occurs exactly times.
Liu et al. [Liu2019HCOMP] explore the idea that similarly-labeled data items may have similar meanings at the population level and could thus be seen as samples of a common underlying source. We generalize the idea of pooling and extend it to neighborhood-based approaches, but first we need to formally define pooling: A pooling is: an integer , a collection of sets such that , and a mapping . After learning a model that best fits the data, each data item is then associated with the marginal label distribution (which Liu et al. call the refined label distribution) of ’s cluster .
3.2.1 Cluster-Based Pooling
Liu et al. [Liu2019HCOMP] introduce pooling methods based on generative hierarchical probabilistic models. Each of these models assumes that the empirical label distributions were generated by choosing a cluster according to a distribution (or in the case of latent Dirichlet allocation (LDA) [blei2003latent], each data item has its own distribution ) over the pools, and then choosing the labels via a distribution associated with the chosen cluster (or in the case of LDA, choosing a new “pool” for each label).
For comparison purposes, we consider four of the models used by Liu et al. [Liu2019HCOMP]: a (finite) multinomial mixture model (F) with a Dirichlet prior over , where each cluster distribution is a multinomial distribution with Dirichlet priors
, a Gaussian mixture model (
G) without Dirichlet priors, k-means (
K) and LDA (L).3.2.2 Neighborhood-based Pooling
Neighborhood based pooling (NBP) creates for each data item one pool , where
is a hyperparameter called the
neighborhood radius (see Algorithm 2).We also considered Euclidean distance, Chebyshev distance, and the Canberra metric, however our results using these methods were similar enough to our results with KL divergence (Figure 2). Additionally, KL divergence has a meaningful interpretation in the context of statistical estimation. It is the expected number of bits per item needed to make a multinomial sample from one distribution appear to be from the other.

3.3 Hyperparameter Selection for (and Evaluation of) Pooling Models
To evaluate our label pooling methods and select hyperparameters (the number of pools for the clustering or the neighborhood radius for the NBP methods), we consider two loss functions: first, we use the mean KL divergence between the empirical and label distributions of each item the evaluation set.
(3) |
Second, we consider the probability of the given label set, given the pools. Additionally, the clustering methods all rely on stochastic optimization. To avoid overfitting, for a range of cluster sizes (based on the range selected by Liu et al. [Liu2019HCOMP]), we ran 100 trials on the training set (since we are using clustering here for sample estimation and not prediction, it is valid and proper to test on the training set) picked the model for each value of with the median loss. Table 2 shows the number of clusters selected on each label set.
Selecting the best hyperparameter by the raw loss function (Eq 3) may not be the best choice here, in part because it only measures which hyperparameter fits the model best, but also because ground truth is for us unobservable. Sometimes, even the best models are still not adequate. In unsupervised problems such as pooling, there are few settings where widely agreed-upon numerical measures are sufficient to judge the quality of a model [Halkidi2001].
However, in the case of PLDL we have a population of annotators to work with. This enables us to use frequentist statistical methods such as hypothesis testing to assess the quality of our pooling models. The basic idea is to generate from each given model random samples that are the same size as our training sample. If the model is a good fit for our training sample, then statistics from the training sample should be in line with those generated directly by the model.
So for each model, we run two simulation-based statistical tests. First we generate 1000 synthetic label sets, each the size of our training set, based on the pooling model we are testing, we then compare the loss function on our test data to the distribution of loss functions on the synthetic sets. As an additional test, we compare these to another sample of 1000 synthetic label sets based on a bootstrap sample from the actual data. See Algorithms 3–4.
where denotes .
We consider a range of reasonable values for based on our NBP and bootstrap sampler. Table 3 shows the neighborhood sizes () selected on each of the label sets. The selected
is the elbow point of a piecewise linear regression line (See Figure
7 and Figure 7).4 Experiments
4.1 Data and Labels
All of the datasets we consider have labels that were produced by humans, and in many cases so was the data, we consulted with our institutional review board (IRB), who determined that the data was both secondary and publicly available, and thus exempt from IRB review. For each dataset, we used a 50/25/25 train/dev/test split. These datasets have been used in prior LDL related research [Liu2019HCOMP, Geng2015].
Jobs dataset - JQ1, JQ2, and JQ3 We used a set of 2,000 job-related annotated tweets data set collected by Liu et al. [P16-1099] for this research. The dataset contains responses from three annotation tasks. Originally five crowdsourced annotators each from Mechanical Turk and FigureEight (10 annotators total) labelled the data. The task for JQ1 was to identify the point of view of the tweet (i.e., 1st person, 2nd person, 3rd person, unclear, or not job related). JQ2 was to capture the employment status of the subject in the tweet (i.e., employed, not in labor force, not employed, unclear, and not job-related). The final task was to identify if there was any mention of a employment transition event in the tweet (i.e., getting hired/job seeking, getting fired, quitting a job, losing job some other way, getting promoted/raised, getting cut in hours, complaining about work, offering support, going to work, coming home from work, none of the above but job related, and not job-related).
Natural Scenes (NS) dataset222The datasets are to download on the website http://ldl.herokuapp.com/download We also use the natural scenes dataset by [Geng2016]. This set contains 2,000 images of natural scenes (NS). Each image has a label distribution over nine labels (i.e., plant, sky, cloud, snow, building, desert, mountain, water, and sun) and with labels collected from ten human annotators. These images have 36 features associated with them.
BU-3DFE Facial Expression (FE) dataset This dataset contains 2,500 facial expression images, each of these images are associated with a label distribution over 6 label categories (i.e., happiness, sadness, surprise, fear, anger, and disgust). The dataset was collected by Yin et al [face_dataset]. For each image, the labels come from 23 human annotators.
4.2 Experiments on Label Pooling
In evaluating our pooling results according to the simulation-based methods described in Section 3.3 we noticed that, frequently, the values of the loss function on the synthetic data were either all greater or all less than their corresponding values in the training data (recall that we evaluated the model on the training data because, given its purpose to estimate population statistics from samples, rather than predict individual values on new items, held-out data was not necessary). This made our test defined by Algorithm 3 meaningless. Thus, rather than use it, we instead, for each parameter value to be tested, and corresponding pooling subtract the loss on the training data from the mean loss on the synthetic data
, divided by the standard deviation of the synthetic data loss:
(4) |
and seek the parameter that minimizes this quantity. The optimal number of clusters () based on the observation for is given in Table 1 and Table 2. For the JQ1, JQ2, and JQ3 datasets, F clustering models outperformed other clustering methods based on these results. It was followed by L clustering for the same set of labels. As expected, K and G come last. Figure 7 shows the histograms of the loss functions on the JQ1 dataset for L clustering. While the standard difference for the value in Eq. 4, is shown in Figure 7 and Figure 7 for bootstrap and cluster samplers. In contrast to the jobs dataset (JQ1, JQ2, and JQ3), for NS and FE, G outperformed other models and it was followed by either F (for NS) or K (for FE). These results are related to the structure of the label distributions of the dataset.
The hyperparameter for NBP is the neighborhood size (). To build the synthetic dataset for NBP, we use the bootstrap and NBP based samplers defined in Algorithms 3–4. Figure 7 and Figure 7 shows the standard difference with linear piecewise fitting for bootstrap and NBP samplers. Table 3 gives the optimum neighborhood sizes () based on each sampling method. Due to the nature of NBP, we also report , the median of all the neighborhood sizes. The sizes identified using the NBP sampler outperformed the sizes identified using the bootstrap sampler. In FE, both the values identified utilized the entire dataset which was available for pooling, while in contrast other datasets utilized approximately a quarter of the entire set.
Model | JQ1 | JQ2 | JQ3 | NS | FE | |
---|---|---|---|---|---|---|
FMM | 29 | 12 | 11 | 36 | 16 | |
0.201 | 0.151 | 0.347 | 0.340 | 0.080 | ||
GMM | 2 | 3 | 5 | 6 | 37 | |
0.700 | 0.639 | 1.416 | 0.215 | 0.008 | ||
K-Means | 5 | 16 | 36 | 6 | 22 | |
0.716 | 0.809 | 1.215 | 0.654 | 0.049 | ||
LDA | 2 | 12 | 7 | 4 | 35 | |
0.428 | 0.201 | 0.587 | 0.443 | 0.064 |
Model | JQ1 | JQ2 | JQ3 | NS | FE | |
---|---|---|---|---|---|---|
FMM | 14 | 7 | 35 | 7 | 16 | |
0.193 | 0.170 | 0.269 | 0.935 | 0.080 | ||
GMM | 4 | 2 | 6 | 17 | 37 | |
0.819 | 0.777 | 1.225 | 0.285 | 0.008 | ||
K-Means | 10 | 15 | 33 | 12 | 22 | |
0.712 | 0.797 | 1.126 | 0.675 | 0.049 | ||
LDA | 7 | 5 | 5 | 4 | 35 | |
0.243 | 0.237 | 0.625 | 0.602 | 0.064 |
JQ1 | JQ2 | JQ3 | NS | FE | |
---|---|---|---|---|---|
5 | 6 | 8 | 4 | 4.5 | |
0.358 | 0.358 | 0.704 | 0.568 | 0.080 | |
967 | 922 | 863.5 | 548.5 | 1,250 | |
1,000 | 1,000 | 1,000 | 1,000 | 1,250 | |
3 | 2 | 2 | 2 | 3 | |
0.317 | 0.257 | 0.409 | 0.444 | 0.080 | |
875 | 645 | 293 | 265 | 1,250 | |
1,000 | 1,000 | 1,000 | 1,000 | 1,250 |
4.3 Supervised Learning Experiments
Following the same experimental process as Liu et al [Liu2019HCOMP] we use the refined labels to train and test supervised learning models. Here, we tested our held-out data on two supervised learning models (CNN on Table 4 and LSTM on Table 5) from Liu et al., with the goal of predicting the pooled label distribution of each data item. Note that the features for the supervised learning models to learn were different among the datasets. The JQ1, JQ2, and JQ3 datasets are text-based [Liu2019HCOMP]
and the NS and FE features are the vectorized representations used by Geng
[Geng2015]. We evaluated the predictions against the refined labels generated from the clustering and NBP. We evaluated them: (1) as a traditional supervised learning problem by measuring the accuracy (using ), (2) as a probability distribution problem using the KL-divergence.4.3.1 Results from Label Clustering
Looking at the KL-divergence results, the CNNs in Table 4 outperformed the LSTM models. In all the instances, the labels refined through clustering outperformed the baseline. The labels refined using the G models outperformed the other models for the JQ1 and JQ2 datasets while for JQ3 and NS, K outperformed the others. For the FE dataset, the labels refined by the F model performed better than the other models. These results could be related to the size of the label spaces. JQ3 had the largest label space, while others were within a similar range. Moreover, in JQ3, human annotators were allowed to provide multiple labels per data item, in contrast to others where only one label choice was allowed per item. In the LSTMs, the observations from the CNNs were common, in contrast to JQ1, where L outperformed other models.
4.3.2 Results from Label Pooling
In our experiments, we used the two radii picked using bootstrap and NBP sampling. Looking at the mean KL-divergence, for the CNNs, the NBP outperformed the clustering models. The accuracy results obtained did supplement the results obtained through the KL test in majority of the cases other than FE dataset. Similar to the clustering approaches, the results from CNNs outperformed the results obtained through the LSTMs. These observations will be discussed further in the next section.
![]() |
![]() |
KL-divergence | Accuracy | |||||||||||||
Dataset | Raw | Clustering | NBP - KL | Raw | Clustering | NBP - KL | ||||||||
Labels | F | G | K | L | Labels | F | G | K | L | |||||
Jobs dataset - Supervised Learning Classification (CNN) | ||||||||||||||
JQ1 | 1.088 | 0.346 | 0.265 | 0.270 | 0.370 | 0.021 | 0.028 | 0.537 | 0.747 | 0.575 | 0.677 | 0.727 | 1.000 | 0.974 |
JQ2 | 1.072 | 0.483 | 0.148 | 0.306 | 0.738 | 0.064 | 0.032 | 0.477 | 0.720 | 1.000 | 0.652 | 0.648 | 0.916 | 1.000 |
JQ3 | 1.440 | 0.772 | 0.366 | 0.341 | 1.033 | 0.206 | 0.117 | 0.271 | 0.313 | 0.467 | 0.553 | 0.330 | 0.528 | 0.600 |
Natural Scenes dataset - Supervised Learning Classification (CNN) | ||||||||||||||
NS | 0.985 | 0.589 | 0.469 | 0.235 | 0.551 | 0.117 | 0.192 | 0.282 | 0.816 | 0.186 | 0.340 | 0.828 | 0.529 | 0.044 |
Facial Expressions dataset - Supervised Learning Classification (CNN) | ||||||||||||||
FE | 0.081 | 0.0009 | 0.071 | 0.081 | 0.081 | 2.927 | 1.045 | 0.259 | 0.353 | 0.315 | 0.259 | 0.227 | 1.000 | 1.000 |
KL-divergence | Accuracy | |||||||||||||
Dataset | Raw | Clustering | NBP - KL | Raw | Clustering | NBP - KL | ||||||||
Labels | F | G | K | L | Labels | F | G | K | L | |||||
Jobs dataset - Supervised Learning Classification (LSTM) | ||||||||||||||
JQ1 | 0.907 | 0.874 | 0.918 | 0.788 | 0.770 | 0.550 | 0.533 | 0.865 | 0.874 | 0.794 | 0.861 | 0.844 | 0.987 | 1.000 |
JQ2 | 0.989 | 1.038 | 0.624 | 0.746 | 1.084 | 0.659 | 0.669 | 0.841 | 0.861 | 1.000 | 0.855 | 0.832 | 0.964 | 1.000 |
JQ3 | 1.567 | 1.358 | 1.003 | 0.910 | 1.456 | 0.789 | 0.748 | 0.643 | 0.621 | 0.642 | 0.718 | 0.612 | 0.782 | 0.821 |
5 Discussion
Discrepancies in the results obtained from different clustering models correspond to discrepancies in the label distributions in different datasets and in the behaviors of different clustering models. For instance, the label distributions in the jobs dataset (Figure 10
) tend to be skewed towards one label choice, while this was not the case for NS (Figure
10) and FE (Figure 10). This skewing in the jobs dataset label distributions may be due to the nature of human annotation with respect to human interpretations of the questions asked about the data.In contrast to the clustering methods, NBP as a whole showed promising results throughout the experiments. On the NS and FE datasets, it showed a pattern of increasing KL divergence as the radius increased. This behavior seems to depend on how the label distributions are structured. For instance, the skewed label distributions in the jobs dataset results in some items having no neighbors at smaller radii.
The jobs dataset by Liu et al. [Liu2019HCOMP] contains labels obtained from from two crowdsourcing platforms. Manual inspection suggests there exist population-level disparities in the distributions provided by each. This phenomenon could be studied more rigorously by modeling user behaviors and traits to improve PLDL. One approach, for instance, could involve inter-annotator agreement using measures such as Cohen’s Kappa [LAZAR2017299] to either resolve conflicts during annotation or use them for refining the distribution estimates. However, a challenge would be to acquire a dataset that contains annotator information, as they are generally removed in publicly available datasets.
Table 3 raises questions about hyperparameter selection for NBP. As our main goal is to share labels between neighbors, looking at the median number of neighbors per each data item () presents some insights. All the datasets had a of at least a quarter of the entire dataset except for FE. One reason could be that FE is, among the datasets we consider, the one that has the largest population of annotators.
While single label learning has a number of established performance measures, such measures are not so well-established in label distribution learning. Geng [Geng2015]
analyzed 41 different measures and identified five (KL-divergence, Chebyshev, Clark, Canberra, Cosine similarity, and Intersection) as most effective. In our work, we used KL-divergence (one of the five measures identified). KL-divergence is used to measure information loss specifically for probability distributions. The use of our sampling techniques to evaluate the models and hyperparameter selection contributes to the establishment of standard procedures for LDL evaluation.
6 Conclusion
Gathering labeled data is an evident resource bottleneck for population-based label distribution learning (PLDL). We introduced and studied new methods for refining the label distributions of data items for PLDL based on the labels of their neighbors. Neighborhood-based pooling (NBP) is semi-supervised learning approach that uses information theoretic measures to pool similar label distributions. We compared NBP as a pooling technique for supervised learning to clustering methods introduced in prior research. We also introduced new methods based on population hypothesis testing for selecting models for label refinement. Our results show that NBP is a feasible approach for refining label distribution estimates.
7 Acknowledgments
We thank the anonymous reviewers for their helpful feedback and suggestions. Many thanks to Ifeoma Nwogu, Cecilia O. Alm, Victoria Maung, James Spann, and Nandula Perera for their contributions and conversations.
References
Appendix A Appendix



Comments
There are no comments yet.