AgreementLearning: An End-to-End Framework for Learning with Multiple Annotators without Groundtruth

by   Chongyang Wang, et al.
University of Cambridge

The annotation of domain experts is important for some medical applications where the objective groundtruth is ambiguous to define, e.g., the rehabilitation for some chronic diseases, and the prescreening of some musculoskeletal abnormalities without further medical examinations. However, improper uses of the annotations may hinder developing reliable models. On one hand, forcing the use of a single groundtruth generated from multiple annotations is less informative for the modeling. On the other hand, feeding the model with all the annotations without proper regularization is noisy given existing disagreements. For such issues, we propose a novel agreement learning framework to tackle the challenge of learning from multiple annotators without objective groundtruth. The framework has two streams, with one stream fitting with the multiple annotators and the other stream learning agreement information between the annotators. In particular, the agreement learning stream produces regularization information to the classifier stream, tuning its decision to be better in line with the agreement between the annotators. The proposed method can be easily plugged to existing backbones developed with majority-voted groundtruth or multiple annotations. Thereon, experiments on two medical datasets demonstrate improved agreement levels with annotators.



There are no comments yet.


page 1

page 2

page 3

page 4


A Study on Agreement in PICO Span Annotations

In evidence-based medicine, relevance of medical literature is determine...

ENHANCE (ENriching Health data by ANnotations of Crowd and Experts): A case study for skin lesion classification

We present ENHANCE, an open dataset with multiple annotations to complem...

Anchoring and Agreement in Syntactic Annotations

We present a study on two key characteristics of human syntactic annotat...

U-Net-and-a-half: Convolutional network for biomedical image segmentation using multiple expert-driven annotations

Development of deep learning systems for biomedical segmentation often r...

Automating the assessment of biofouling in images using expert agreement as a gold standard

Biofouling is the accumulation of organisms on surfaces immersed in wate...

D-LEMA: Deep Learning Ensembles from Multiple Annotations – Application to Skin Lesion Segmentation

Medical image segmentation annotations suffer from inter/intra-observer ...

Agreement Implies Accuracy for Substitutable Signals

Inspired by Aumann's agreement theorem, Scott Aaronson studied the amoun...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The knowledge of domain experts is essential for the training of deep learning models in medical applications, especially when the objective groundtruth is not apparent for normal people or ambiguous merely given the input data itself. That is, the decision-making, i.e. the detection, classification, and segmentation process, is based on not only the presented data but also the expertise or experiences of the annotator. Thereon, an important part of supervise learning is to fit the model to the domain experts’ annotations.

Figure 1:

The proposed framework regularizes the classifier that fits with all the annotators with the estimated agreement information between annotators.

Figure 2: An overview of the proposed agreement learning framework, which comprises i) (above) the classifier stream that fits with all the annotators; and ii) (below) the agreement learning stream that learns to estimate the agreement between annotators and leverage such information to regularize the classifier.

In supervised learning, the input normally comprises pairs of , where and are respectively the data and the label of -th sample. Given the annotations provided by multiple annotators, typical methods aim to provide a single set of groundtruth label. Therein, a common practice is to aggregate these multiple annotations with majority voting Surowiecki (2005). This approach is more meaningful when larger number of annotations per are available and there is no obvious differences in expertise assumed between annotators Lampert et al. (2016). However, collecting annotations from domain experts is costly, while differences in expertise or experiences sometime exist. Except for majority-voting, some have tried to estimate the groundtruth label using STAPLE Warfield et al. (2004)

based on Expectation-Maximization (EM) algorithms. Nevertheless, such method is sensitive to the variance in the annotations and the data size

Lampert et al. (2016); Karimi et al. (2020). When the number of annotations per is modest, efforts are put into creating models that utilize all the annotations with multi-score learning Meng et al. (2011) or soft labels Hu et al. (2016). Recent approaches have instead focused on leveraging or learning the expertise of the annotators while training the model Long et al. (2013); Long and Hua (2015); Healey (2011); Guan et al. (2018); Ji et al. (2021); Yan et al. (2014, 2010); Tanno et al. (2019); Zhang et al. (2020). A basic idea is to refine the classification or segmentation towards the underlying groundtruth by modeling the annotators.

In this paper, we focus on a hard situation when the groundtruth is ambiguous to define. On one hand, this could be due to the missing of objective groundtruth in a specific scenario. For instance, in the analysis of bodily movement behavior for chronic-pain (CP) rehabilitation, the self-awareness of people with CP about their exhibited pain or fear-related behaviors is low, thus physiotherapists play a key role in judging it Felipe et al. (2015); Singh et al. (2016). However, since the physiotherapists are assessing the behavior on the basis of visual observations, they may disagree on the judgment or groundtruth. Similarly, in abnormality prescreening for bone X-rays, except for abnormalities like fractures and hardware implantation that are obvious and deterministic, other types like degenerative diseases and miscellaneous abnormalities are mainly diagnosed with further medical examinations Rajpurkar et al. (2017). Thereon, the objective of modeling in this paper is to improve the overall agreement between the model and the annotators. An overview of our proposed agreement learning framework is shown in Fig.2. Our contributions are four-fold:

  • We propose a novel agreement learning framework to directly leverage the agreement information stored in the annotations from multiple annotators to regularize the behavior of the classifier.

  • To improve the robustness of agreement learning, we propose a general agreement distribution and an agreement regression loss to model the uncertainty in annotations.

  • To regularize the classifier, we propose a regularization function to tune the classifier to better agree with all the annotators.

  • Our method noticeably improves existing backbones for better agreement levels with all the annotators on classification tasks in two medical datasets, involving data of body movement sequences and bone X-rays.

Related Work

Annotator Modeling.

The leveraging or learning of annotators’ expertise for better modeling is usually implemented in a two-step or multiphase manner, or integrated to run simultaneously. For the first category, one way to acquire the expertise is by referring to the prior knowledge about the annotation, e.g. the year of experience of each annotator, and the discussion held on the disagreed annotations. With such prior knowledge, studies in Long et al. (2013); Long and Hua (2015); Healey (2011) propose to distill the annotations, deciding which annotator to trust for disagreed samples. Without the access to such prior knowledge, the expertise, or behavior of an annotator can also be modeled given the annotation and the data, which could be used as a way to weight each annotator in the training of a classification model Guan et al. (2018), or adopted to refine the segmentation learned from multiple annotators Ji et al. (2021). More close to ours are the ones that simultaneously model the expertise of annotators while training the classifier. Previous efforts are seen on using probabilistic models Yan et al. (2014, 2010) driven by EM algorithms, and multi-head models that directly model annotators as confusion matrices estimated in comparison with the underlying groundtruth Tanno et al. (2019); Zhang et al. (2020). All these methods consider the existence of an underlying groundtruth for each , where the annotations are noisy estimations of it. Furthermore, during the evaluation, such approaches usually compare the model with the objective groundtruth (e.g., the biopsy result in cancer screening) or the groundtruth agreed by extra or left-out annotators. However, when objective groundtruth does not exist, efforts yet need to be made on how to learn from the subjective annotations.

Uncertainty Modeling.

Uncertainty modeling is a popular topic in the computer vision domain, especially for tasks of semantic segmentation and object detection. Therein, methods proposed can be categorized into two groups: i) the Bayesian methods, where parameters of the posterior distribution (e.g. mean and variance) of the uncertainty are estimated with Monte Carlo dropout

Leibig et al. (2017); Kendall et al. (2017); Ma et al. (2017) and parametric learning Hu et al. (2020); Charpentier et al. (2020) etc.; and ii) ’non-Bayesian’ alternatives, where the distribution of uncertainty is learned with ensemble methods Lakshminarayanan et al. (2016), variance propagation Postels et al. (2019), and knowledge distillation Shen et al. (2021) etc. Except for their complex and time-consuming training or inference strategies, another characteristic of these methods is the dependence on Gaussian or Dirac delta distributions as the prior assumption.

Model Evaluation without Groundtruth.

In the context of modeling multiple annotations without groundtruth, typical measures for evaluation are the metrics of agreements. For example, Kleinsmith et al. (2011) uses metrics of agreement, e.g. Cohen’s Kappa Cohen (1960) and Fleiss’ Kappa Fleiss (1971), as the way to compare the agreement level between a system and an annotator and the agreement level between other unseen annotators, in a cross-validation manner. Therein, it remains unknown how well the model performs when learned from all the annotators. For this end, Lovchinsky et al. (2019) proposes a metric named discrepancy ratio. In short, the metric compares performances of the model-annotator vs. the annotator-annotator, where the performance can be computed as discrepancy e.g. with absolute error, or as agreement e.g. with Cohen’s kappa. In this paper, we use the Cohen’s kappa together with such metric to evaluate the performance of the model. We refer to this as agreement ratio.


The core of our proposed method is to learn to estimate the agreement between different annotators based on their raw annotations, and simultaneously utilize the agreement estimation to regularize the training of the final classification tasks. Therein, different components of the proposed method concern: i) the learning of agreements between annotators; ii) regularizing the classifier with such information; and iii) alleviating the possible imbalances during training. In testing or inference, the model estimates annotators’ agreement level based on the current data input, which is then used to aid the classification. In this paper, we consider a dataset comprising samples , with each sample being an image or a timestep in a body movement data sequence. For each sample , denotes the annotation provided by -th annotator, with being the agreement computed between annotators. For a binary task, . With such dataset , the proposed method aims to improve the agreement level with all the annotators. It should be noted that, for each sample , the method does not expect the annotations to be available from all the annotators.

Learning Agreement with Uncertainty Modeling

To enable a robust learning of the agreement between annotators, we consider modeling the uncertainty that could exist in the annotations. In our scenarios, the uncertainty comes from the annotators’ varying expertise exhibited in their annotations across different data samples, which may not follow specific prior distributions. Inspired by the study of Li et al. (2020) that proposed to use a general distribution for uncertainty modeling in the bounding box regression of object classification, without relying on any prior distributions, we further propose a general agreement distribution for agreement learning (see the upper part of Fig.3). Therein, the distribution values are the possible agreement levels between annotators with a range of , which is further discretized into with a uniform interval of 1. The general agreement distribution has a property

, which can be implemented with a softmax layer with

nodes. The predicted agreement for regression can be computed as the weighted sum of all the distribution values

Figure 3:

The learning of the agreement between annotators is modeled with a general agreement distribution using agreement regression loss (above), with the X axis of the distribution being the agreement levels and the Y axis being the respective probabilities. The learning can also be implemented as a linear regression task with RMSE (below).

For training the predicted agreement value towards the target agreement

, inspired by the effectiveness of quantile regression in understanding the property of conditional distribution

Koenker and Hallock (2001); Hao et al. (2007); Fan et al. (2019), we propose a novel Agreement Regression (AR) loss defined by


Comparing with the original quantile regression loss, the quantile is replaced with the agreement computed at current input sample . The quantile is usually fixed for a dataset, as to understand the underlying distribution of the model’s output at a given quantile. By replacing with , we optimize the general agreement distribution to focus on the given agreement level dynamically across samples.

In Li et al. (2021), the authors proposed to use the top

values of the distribution and their mean to indicate the shape (flatness) of the distribution, which provides the level of uncertainty in object classification. In our case, all probabilities of the distribution are used to regularize the classifier. While this also informs the shape of the distribution for the perspective of uncertainty modeling, the skewness reflecting the high or low agreement level learned at the current data sample is also revealed. Thereon, two fully-connected layers with RELU and Sigmoid activations respectively are used to process such information and produce the agreement indicator

for regularization.

Learning Agreement with Linear Regression.

Straightforwardly, we can also formulate the agreement learning as a plain linear regression task, modelled by a fully-connected layer with a Sigmoid activation function (see the lower part of Fig.

3). Then, the predicted agreement is directly taken as the agreement indicator for regularization. Given the predicted agreement and target agreement at each input sample , by using Root Mean Squared Error (RMSE), the linear regression loss is computed as


It should be noted that, the proposed AR loss can also be used for this linear regression variant, which may help optimize the underlying distribution towards the given agreement level. In the experiments, an empirical comparison between different variants for agreement learning is conducted.

Regularizing the Classifier with Agreement

Since the high-level information implied by the agreement between annotators could provide extra hints in classification tasks, we utilize the agreement indicator to regularize the classifier training towards providing outcomes that are more in agreement with the annotators. Given a binary classification task (a multi-class task can be decomposed into several binary ones), at input sample , we denote the original predicted probability towards the positive class of the classifier to be . The general idea is that, when the learned agreement indicator is i) at chance level i.e. , shall stay unchanged; ii) biased towards the positive/negative class, the value of shall be regularized towards the respective class. For these, we propose a novel regularization function written as


where is the regularized probability towards the positive class of the current binary task,

is a hyperparameter controlling the scale at which the original predicted probability

changes towards when the agreement indicator increases/decreases. Fig.4 shows the property of the function: for the original predicted probability , the function with larger augments the effect of the learned agreement indicator so that the output is regularized towards the more (dis)agreed; when is at 0.5, where annotators are unable to reach an above-chance opinion about the task, the regularized probability stays unchanged with .

Figure 4: The property of the regularization function. X and Y axes are the agreement indicator and regularized probability , respectively. is regularized to the class, for which the is high, with the scale controlled by .

Alleviating Imbalances in Training

In this subsection, two types of imbalance are targeted for the classifier stream that learns from multiple annotators. The first is the class imbalance present in the annotation of each annotator, while the other imbalance happens when the model is biased towards a small group of the annotators.

Annotation Balancing for Each Annotator.

For the classifier stream, given the regularized probability at the current input sample , the classifier is updated using the sum of the loss computed according to the available annotation from each annotator. Due to the various expertise of annotators as well as the nature of the task (i.e., positive samples are sparse), the annotation from each annotator could be noticeably imbalanced. To address this problem, we adopt the Focal Loss (FL) Lin et al. (2017) proposed for dealing with class imbalances in object detection, which can be written as


where is the predicted probability of the model towards the positive class at the current data sample, is the binary groundtruth, and is the focusing parameter used to control the threshold for judging the well-classified. A larger

leads to a lower threshold so that more samples would be treated as the well-classified and down weighted. In our scenario, the FL function is integrated into the following loss function to compute the loss for each annotator


where is the number of samples that have been labelled by -th annotator, , . By default, the losses computed from all the annotators are averaged to be the final loss of the classifier


Additionally, searching for the manually for each annotator could be cumbersome, especially for a dataset labeled by numerous annotators. In this paper, in order to save such efforts, we compute for each annotator given the number of samples per class of each binary task. The hypothesis is that, for annotations biased more towards one class, shall set to be bigger since larger number of samples tend to be well-classified. Following Cui et al. (2019), we leverage the effective number of samples to compute each as


where is the number of samples for the majority class in the current binary task annotated by annotator , .

In de La Torre et al. (2018), a Weighted Kappa Loss (WKL) was used to compute the agreement-oriented loss between the output of a model and the annotation of an annotator. As developed from the Cohen’s Kappa, this loss may guide the model to pay attention to the overall agreement instead of the local accuracy. We compare this loss function to the logarithmic one in the experiment. The value range of this loss is

, thus a Sigmoid function is applied before we sum the loss from each annotator.

Model Balancing for all the Annotators.

For the classifier stream, according to Equation 7, losses computed from all the annotators are averaged to be reduced by the model. As such, the model may tend to focus on reducing the loss computed from a few of the annotators. In other words, such reduction to the overall loss may not promise a good agreement with all the annotators. Therefore, by adding to Equation 7, we propose a simple and efficient Annotator Focal Loss (AFL) to balance the learning of the classifier as


where is another focusing parameter similar to the one used in FL, with . Here, the average loss serves as a threshold to judge the balance of the model in the learning from different annotators. Specifically, loss computed from an annotator that is comparatively higher than such average loss is multiplied by a larger factor, to which the model would pay more attention.


In this section, we evaluate our proposed method on two medical datasets, where multiple human experts are involved in the annotation processes and the objective groundtruth is ambiguous to define.


Two medical datasets are selected to evaluate the proposed framework, involving data of body movement sequences and bone X-rays.


The EmoPain Aung et al. (2015) dataset contains skeleton-like movement data collected from 18 participants with CP and 12 healthy participants while they perform a variety of full-body physical rehabilitation activities (e.g. stretching forward and sitting down). In total, we have 46 activity sequences collected from these 30 participants, with each sequence lasting for about 10 minutes (or 36,000 samples). A binary task is included to detect the presence of protective behavior (e.g. hesitation, guarding) Keefe and Block (1982) exhibited by participants with CP during the performances. The detection of such behavior could be leveraged to generate automatic feedback and inform therapeutic personalized interventions Wang et al. (2021a). Four experts were recruited to provide the binary annotations of the presence or absence of protective behavior per timestep for each CP participant data sequence.

Framework/Annotator CE WKL
Linear Distributional
Majority-voting 1.0417 0.7616
1.0452 0.7638
0.9733 0.7564
1.0067 0.7606
Learn-from-all 1.0189 0.7665
1.0442 0.7667
1.0456 0.7636
1.0477 0.7727
AgreementLearning 1.0508 0.7796
(Ours) 1.0454 0.7707
1.0482 0.7773
Annotator 1 0.9613 1.0679
Annotator 2 1.0231 0.9984
Annotator 3 1.0447 0.9743
Annotator 4 0.9732 0.9627
Table 1: The ablation experiment on the EmoPain and MURA datasets. Majority-voting refers to the method using the majority-voted groundtruth for training. CE and WKL refer to the logarithmic and weighted kappa loss functions used in the classifier stream, respectively. Linear and Distributional refer to the agreement learning stream with linear regression and general agreement distribution, respectively. The best performance in each framework/annotator set is marked in bold for each dataset.


The MURA dataset Rajpurkar et al. (2017) comprises 40,561 radiographic images of 7 upper extremity types (i.e., shoulder, humerus, elbow, forearm, wrist, hand, and finger), and is used for the binary classification of abnormality. This dataset is officially split into training (36,808 images), validation (3197 images), and testing (556 images) sets, with no overlap in subjects. The training and validation sets are publicly available, with each image labelled by a radiologist. While some abnormalities like fractures and hardware implantation are deterministic, the others like degenerative diseases and miscellaneous abnormalities are mostly determined given further examination. Thus, at the prescreening stage, such abnormality classification relies on the expertise of the expert. In the testing set, the authors recognized this by recruiting six additional radiologists for annotation, and defined the groundtruth with majority-voting among the three randomly-picked radiologists. The rest three radiologists achieved Cohen’s kappa with such groundtruth of 0.731, 0.763, and 0.778, respectively. To simulate the opinions of different experts for the data we have access to, three virtual annotators are purposely created to reach Cohen’s kappa with the existing annotator of 0.80, 0.75, and 0.70, respectively.

Implementation Details

For experiments on the EmoPain dataset, the state-of-the-art HAR-PBD network Wang et al. (2021a) is adopted as the backbone, and Leave-One-Subject-Out validation is conducted across the participants with CP. The average of the performances achieved on all the folds is reported. The training data is augmented by adding Gaussian noise and cropping, as seen in Wang et al. (2021b). The number of bins used in the general agreement distribution is set to 20, i.e., the respective softmax layer has 21 nodes. The used in the regularization function is set to 3.5. The used in the annotator focal loss is set to 1.5.

For experiments on the MURA dataset, the Dense-169 network Huang et al. (2017)

pretrained on the ImageNet dataset

Deng et al. (2009) is used as the backbone. The original validation set is used as the testing set, where the first view (image) from each of the 7 upper extremity types of a subject is used. Images are all resized to be , while images in the training set are further augmented with random lateral inversions and rotations of up to 30 degrees. The number of bins is set to 5, and the is set to 3.0 with the set to 1.0.

For all the experiments, the classifier stream is implemented with a fully-connected layer using a Sigmoid activation for the binary classification task. Adam Kingma and Ba (2014) is used as the optimizer with learning rate

1e-4, and the number of epochs is set to 100. The search on hyperparameters (i.e., the number of bins of

, , and is included in the supplementary material. For the classifier stream, the logarithmic loss is adopted by default as used in Equation 5, 6, 7, and 9

, while the WKL loss is used for comparison when mentioned. For the agreement learning stream, the AR loss is used for the distributional variant, while the RMSE is used for the linear regression variant. We implement our method with TensorFlow deep learning library on a PC with a RTX 3080 GPU and 32 GB memory.

Agreement Computation

For a binary task, the agreement level is computed as


where is the number of annotators that have labelled the sample . In this way, stands for the agreement of annotators towards the positive class of the current binary task. In this work, we assume each sample was labelled by at least one annotator. is the weight for the annotation provided by -th annotator that could be used to show the different levels of expertise of the annotators. The weight can be set manually given prior knowledge about the annotator, or used as a learnable parameter for the model to estimate. In this work, we treat annotators equally by setting to 1. We leave the discussion on other situations to future works.


Following Lovchinsky et al. (2019), we evaluate the performance of a model by using the agreement ratio written as


where the numerator computes the average agreement for the pairs of predictions of the model and annotations of each annotator, and the denominator computes the average agreement between annotators with denoting the number of different annotator pairs. is the Cohen’s Kappa. The agreement ratio is larger than 1 when the model achieves better performance than the average annotator.


As shown in the first section of Table 1, models trained with majority-voted groundtruth produce agreement ratios of 1.0417 and 0.7616 with annotation (class) balancing on the EmoPain and MURA datasets, respectively. For EmoPain dataset, such performance (1.0417) suggests that the model is better than the average performance of annotators, while the performance on the MURA dataset (0.7616) is below the average performance. In this section, we first demonstrate the improvements introduced by our method. Then, we study the impact of the proposed AR loss.

The Impact of Balancing Methods.

As shown in the second section of Table 1, directly exposing the model to all the annotations is harmful, with performances lower than the majority-voting framework of 0.9733 and 0.7564 achieved on the two datasets, respectively. By using the two balancing methods during training, the performance on the EmoPain dataset is improved to 1.0189 but is still lower than what can be achieved using majority-voted groundtruth, while a better performance of 0.7665 on the MURA dataset is achieved. These results show the importance of balancing for the learn-from-all framework. The performance of the majority-voting (1.0452/0.7638) and learn-from-all (1.0456/0.7667) frameworks is further improved by using the WKL loss, which was designed to optimize a model at the global agreement with an annotator rather than the local accuracy.

The Impact of Agreement Learning.

For both datasets, as shown in the third section of Table 1, with our proposed agreement learning method using general agreement distribution, the best performances of 1.0508 and 0.7796 are achieved, respectively. In addition, the combination of general agreement distribution and AR loss shows better performance than the methods with linear regression and RMSE on both datasets (1.0477 and 0.7727). Such results could be due to the fact that the agreement indicator produced from the linear regression is directly the estimated agreement value, which could be largely affected by the errors made during agreement learning. In contrast, with general agreement distribution, the information passed to the classifier is first the shape and skewness of the distribution. Thus, it is more tolerant to the errors (if) made by the weighted sum that used for the regression in the agreement learning. Such advantage can also be taken as a way to capture the uncertainty that may exist in the annotations.

Comparing the results with the Annotators.

For the last section of Table 1, the annotation of each annotator is used to compute the agreement ratio against the other annotators. For the EmoPain dataset, the best method in majority-voting (1.0452) and learn-from-all (1.0456) frameworks show very competitive if not better performance than annotator 3 (1.0447) who has the best agreement level with all the other annotators. Thereon, the proposed method is able to enhance the performance to an even higher agreement ratio of 1.0508 with all the annotators. Such performance suggests that, when adopted in real-life, the model is able to analyze the movement behavior of people with CP, at a performance that is highly in agreement with the human experts. However, for the MURA dataset, the best performance so far achieved by the agreement learning framework of 0.7796 is still lower than 1. This suggests that, at the current task setting, the model may make around 22% errors more than the human experts. One reason could be largely due to the challenge of the task. As shown in Rajpurkar et al. (2017), where the same backbone only achieved a similar if not better performance than the other radiologists for only one (wrist) out of the seven upper extremity types. In this paper, the testing set comprises all the extremity types, which makes the experiment even more challenging. Future works may explore better backbones tackling this.

The Impact of Agreement Regression Loss.

The proposed AR loss can be used for both the linear and distributional agreement learning. However, as seen in Table 2 and Table 3, the performance of linear agreement learning is better with RMSE rather than with the AR loss. The design of the AR loss assumes the loss computed for a given quantile is in accord with its counterpart of agreement level. Thus, such result may be due to the gap between the quantile of the underlying distribution of the linear regression and the targeted agreement level. Therefore, the resulting agreement indicator passed to the classifier may not reflect the actual agreement level. Instead, for linear regression, a vanilla loss like RMSE promises that the regression value is fitting towards the agreement level. The proposed general agreement distribution directly adopts the range of agreement levels to be the distribution values, which helps to narrow the gap when AR loss is used. Therein, the agreement indicator is extracted from the shape and skewness of such distribution (probabilities of all distribution values), which could better reflect the agreement level when updated with AR loss. As shown, the combination of distributional agreement learning and AR loss achieves the best performance in each dataset.

Agreement Learning
Agreement Learning
Linear RMSE 1.0477
CE AR 0.9976
Distributional RMSE 1.0289
AR 1.0508
Linear RMSE 1.0454
WKL AR 1.035
Distributional RMSE 1.0454
AR 1.0482
Table 2: The experiment on the EmoPain dataset for analyzing the impact of Agreement Regression (AR) loss on agreement learning. The best performance in each agreement learning type is marked in bold.
Agreement Learning
Agreement Learning
Linear RMSE 0.7727
CE AR 0.7698
Distributional RMSE 0.7729
AR 0.7796
Linear RMSE 0.7707
WKL AR 0.7674
Distributional RMSE 0.7724
AR 0.7773
Table 3: The experiment on the MURA dataset for analyzing the impact of Agreement Regression (AR) loss on agreement learning. The best performance in each agreement learning type is marked in bold.


In this paper, we targeted the scenario of learning with multiple annotators where the groundtruth is ambiguous to define. Two medical datasets in this scenario were adopted for the evaluation. We showed that backbones developed with majority-voted groundtruth or multiple annotations can be easily enhanced to achieve better agreement levels with the annotators, by leveraging the underlying agreement information stored in the annotations. For agreement learning, our experiments demonstrate the advantage of learning with the proposed general agreement distribution and agreement regression loss, in comparison with other possible variants. Future works may extend this paper to prove its efficiency in datasets having multiple classes, as only binary tasks were considered in this paper. Additionally, the learning of annotator’s expertise seen in Tanno et al. (2019); Zhang et al. (2020); Ji et al. (2021) could be leveraged to weight the agreement computation and learning proposed in our method for cases where annotators are treated differently.


  • M. S. Aung, S. Kaltwang, B. Romera-Paredes, B. Martinez, A. Singh, M. Cella, M. Valstar, H. Meng, A. Kemp, M. Shafizadeh, et al. (2015) The automatic detection of chronic pain-related expression: requirements, challenges and the multimodal emopain dataset. IEEE transactions on affective computing 7 (4), pp. 435–451. Cited by: EmoPain..
  • B. Charpentier, D. Zügner, and S. Günnemann (2020) Posterior network: uncertainty estimation without ood samples via density-based pseudo-counts. arXiv preprint arXiv:2006.09239. Cited by: Uncertainty Modeling..
  • J. Cohen (1960) A coefficient of agreement for nominal scales. Educational and psychological measurement 20 (1), pp. 37–46. Cited by: Model Evaluation without Groundtruth..
  • Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie (2019) Class-balanced loss based on effective number of samples. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 9268–9277. Cited by: Annotation Balancing for Each Annotator..
  • J. de La Torre, D. Puig, and A. Valls (2018) Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recognition Letters 105, pp. 144–154. Cited by: Annotation Balancing for Each Annotator..
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Implementation Details.
  • C. Fan, Y. Zhang, Y. Pan, X. Li, C. Zhang, R. Yuan, D. Wu, W. Wang, J. Pei, and H. Huang (2019) Multi-horizon time series forecasting with temporal attention learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2527–2535. Cited by: Learning Agreement with Uncertainty Modeling.
  • S. Felipe, A. Singh, C. Bradley, A. C. Williams, and N. Bianchi-Berthouze (2015) Roles for personal informatics in chronic pain. In 2015 9th International Conference on Pervasive Computing Technologies for Healthcare, pp. 161–168. Cited by: Introduction.
  • J. L. Fleiss (1971) Measuring nominal scale agreement among many raters.. Psychological bulletin 76 (5), pp. 378. Cited by: Model Evaluation without Groundtruth..
  • M. Guan, V. Gulshan, A. Dai, and G. Hinton (2018) Who said what: modeling individual labelers improves classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: Introduction, Annotator Modeling..
  • L. Hao, D. Q. Naiman, and D. Q. Naiman (2007) Quantile regression. Sage. Cited by: Learning Agreement with Uncertainty Modeling.
  • J. Healey (2011) Recording affect in the field: towards methods and metrics for improving ground truth labels. In International conference on affective computing and intelligent interaction, pp. 107–116. Cited by: Introduction, Annotator Modeling..
  • N. Hu, G. Englebienne, Z. Lou, and B. Kröse (2016) Learning to recognize human activities using soft labels. IEEE transactions on pattern analysis and machine intelligence 39 (10), pp. 1973–1984. Cited by: Introduction.
  • P. Hu, S. Sclaroff, and K. Saenko (2020) Uncertainty-aware learning for zero-shot semantic segmentation. Advances in Neural Information Processing Systems 33. Cited by: Uncertainty Modeling..
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: Implementation Details.
  • W. Ji, S. Yu, J. Wu, K. Ma, C. Bian, Q. Bi, J. Li, H. Liu, L. Cheng, and Y. Zheng (2021) Learning calibrated medical image segmentation via multi-rater agreement modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12341–12351. Cited by: Introduction, Annotator Modeling., Conclusion.
  • D. Karimi, H. Dou, S. K. Warfield, and A. Gholipour (2020) Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Medical Image Analysis 65, pp. 101759. Cited by: Introduction.
  • F. J. Keefe and A. R. Block (1982) Development of an observation method for assessing pain behavior in chronic low back pain patients.. Behavior therapy. Cited by: EmoPain..
  • A. Kendall, V. Badrinarayanan, and R. Cipolla (2017)

    Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding

    In British Machine Vision Conference, Cited by: Uncertainty Modeling..
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Implementation Details.
  • A. Kleinsmith, N. Bianchi-Berthouze, and A. Steed (2011) Automatic recognition of non-acted affective postures. IEEE Transactions on Systems, Man, and Cybernetics, Part B 41 (4), pp. 1027–1038. Cited by: Model Evaluation without Groundtruth..
  • R. Koenker and K. F. Hallock (2001) Quantile regression. Journal of economic perspectives 15 (4), pp. 143–156. Cited by: Learning Agreement with Uncertainty Modeling.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2016) Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474. Cited by: Uncertainty Modeling..
  • T. A. Lampert, A. Stumpf, and P. Gançarski (2016) An empirical study into annotator agreement, ground truth estimation, and algorithm evaluation. IEEE Transactions on Image Processing 25 (6), pp. 2557–2572. Cited by: Introduction.
  • C. Leibig, V. Allken, M. S. Ayhan, P. Berens, and S. Wahl (2017)

    Leveraging uncertainty information from deep neural networks for disease detection

    Scientific reports 7 (1), pp. 1–14. Cited by: Uncertainty Modeling..
  • X. Li, W. Wang, X. Hu, J. Li, J. Tang, and J. Yang (2021) Generalized focal loss v2: learning reliable localization quality estimation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632–11641. Cited by: Learning Agreement with Uncertainty Modeling.
  • X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang (2020) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. arXiv preprint arXiv:2006.04388. Cited by: Learning Agreement with Uncertainty Modeling.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: Annotation Balancing for Each Annotator..
  • C. Long, G. Hua, and A. Kapoor (2013) Active visual recognition with expertise estimation in crowdsourcing. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3000–3007. Cited by: Introduction, Annotator Modeling..
  • C. Long and G. Hua (2015)

    Multi-class multi-annotator active learning with robust gaussian process for visual recognition

    In Proceedings of the IEEE international conference on computer vision, pp. 2839–2847. Cited by: Introduction, Annotator Modeling..
  • I. Lovchinsky, A. Daks, I. Malkin, P. Samangouei, A. Saeedi, Y. Liu, S. Sankaranarayanan, T. Gafner, B. Sternlieb, P. Maher, et al. (2019) Discrepancy ratio: evaluating model performance when even experts disagree on the truth. In International Conference on Learning Representations, Cited by: Model Evaluation without Groundtruth., Metric.
  • L. Ma, J. Stückler, C. Kerl, and D. Cremers (2017) Multi-view deep learning for consistent semantic mapping with rgb-d cameras. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 598–605. Cited by: Uncertainty Modeling..
  • H. Meng, A. Kleinsmith, and N. Bianchi-Berthouze (2011) Multi-score learning for affect recognition: the case of body postures. In International Conference on Affective Computing and Intelligent Interaction, pp. 225–234. Cited by: Introduction.
  • J. Postels, F. Ferroni, H. Coskun, N. Navab, and F. Tombari (2019) Sampling-free epistemic uncertainty estimation using approximated variance propagation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2931–2940. Cited by: Uncertainty Modeling..
  • P. Rajpurkar, J. Irvin, A. Bagul, D. Ding, T. Duan, H. Mehta, B. Yang, K. Zhu, D. Laird, R. L. Ball, et al. (2017) Mura: large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957. Cited by: Introduction, MURA., Comparing the results with the Annotators..
  • Y. Shen, Z. Zhang, M. R. Sabuncu, and L. Sun (2021) Real-time uncertainty estimation in computer vision via uncertainty-aware distribution distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 707–716. Cited by: Uncertainty Modeling..
  • A. Singh, S. Piana, D. Pollarolo, G. Volpe, G. Varni, A. Tajadura-Jimenez, A. C. Williams, A. Camurri, and N. Bianchi-Berthouze (2016) Go-with-the-flow: tracking, analysis and sonification of movement and breathing to build confidence in activity despite chronic pain. Human–Computer Interaction 31 (3-4), pp. 335–383. Cited by: Introduction.
  • J. Surowiecki (2005) The wisdom of crowds. Anchor. Cited by: Introduction.
  • R. Tanno, A. Saeedi, S. Sankaranarayanan, D. C. Alexander, and N. Silberman (2019) Learning from noisy labels by regularized estimation of annotator confusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11244–11253. Cited by: Introduction, Annotator Modeling., Conclusion.
  • C. Wang, Y. Gao, A. Mathur, A. C. D. C. Williams, N. D. Lane, and N. Bianchi-Berthouze (2021a) Leveraging activity recognition to enable protective behavior detection in continuous data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5 (2). Cited by: EmoPain., Implementation Details.
  • C. Wang, T. A. Olugbade, A. Mathur, A. C. D. C. Williams, N. D. Lane, and N. Bianchi-Berthouze (2021b) Chronic pain protective behavior detection with deep learning. ACM Transactions on Computing for Healthcare 2 (3), pp. 1–24. Cited by: Implementation Details.
  • S. K. Warfield, K. H. Zou, and W. M. Wells (2004) Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE transactions on medical imaging 23 (7), pp. 903–921. Cited by: Introduction.
  • Y. Yan, R. Rosales, G. Fung, M. Schmidt, G. Hermosillo, L. Bogoni, L. Moy, and J. Dy (2010) Modeling annotator expertise: learning when everybody knows a bit of something. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 932–939. Cited by: Introduction, Annotator Modeling..
  • Y. Yan, R. Rosales, G. Fung, R. Subramanian, and J. Dy (2014) Learning from multiple annotators with varying expertise. Machine learning 95 (3), pp. 291–327. Cited by: Introduction, Annotator Modeling..
  • L. Zhang, R. Tanno, M. Xu, C. Jin, J. Jacob, O. Ciccarelli, F. Barkhof, and D. C. Alexander (2020) Disentangling human error from the ground truth in segmentation of medical images. arXiv preprint arXiv:2007.15963. Cited by: Introduction, Annotator Modeling., Conclusion.


Chongyang Wang is supported by the UCL Overseas Research Scholarship (ORS) and Graduate Research Scholarship (GRS). This work was supported by the National Key R&D Program of China (2020YFB1313300).