Joint Multi-Dimensional Model for Global and Time-Series Annotations

05/06/2020 ∙ by Anil Ramakrishna, et al. ∙ 3

Crowdsourcing is a popular approach to collect annotations for unlabeled data instances. It involves collecting a large number of annotations from several, often naive untrained annotators for each data instance which are then combined to estimate the ground truth. Further, annotations for constructs such as affect are often multi-dimensional with annotators rating multiple dimensions, such as valence and arousal, for each instance. Most annotation fusion schemes however ignore this aspect and model each dimension separately. In this work we address this by proposing a generative model for multi-dimensional annotation fusion, which models the dimensions jointly leading to more accurate ground truth estimates. The model we propose is applicable to both global and time series annotation fusion problems and treats the ground truth as a latent variable distorted by the annotators. The model parameters are estimated using the Expectation-Maximization algorithm and we evaluate its performance using synthetic data and real emotion corpora as well as on an artificial task with human annotations



There are no comments yet.


page 5

page 7

page 8

page 9

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Crowdsourcing is a popular tool used in collecting human judgments on subjective constructs such as emotion. Typical examples include annotations of images and video clips with categorical emotions or with continuous affective dimensions such as valence or arousal. Online platforms such as Amazon Mechanical (MTurk) and have risen in popularity owing to their inexpensive annotation costs and their ability to scale efficiently.

Crowdsourcing is also a popular approach in collecting labels for training supervised machine learning algorithms. Such labels are typically obtained from domain experts, which can be slow and expensive. For example, in the medical domain, it is often expensive to collect diagnosis information given laboratory tests since this requires judgments from trained professionals. On the other hand, unlabeled patient data may be easily available. Crowdsourcing has been particularly successful in such settings with easy availability of unlabeled data instances since we can collect a large number of annotations from untrained and inexpensive workers over the Internet, which when combined together may be comparable or even better than expert annotations


A typical crowdsourcing setting involves collecting annotations from a large number of workers; hence there is a need to robustly combine them to estimate the ground truth. The most common approach for this is to take simple averages for continuous annotations or perform majority voting for categorical annotations. However, this assumes uniform competency across all the workers which is not always guaranteed or justified. Several alternative approaches have been proposed to address this challenge, each assuming a specific function modeling the annotators’ behavior. In practice, it is common to collect annotations on multiple questions for each data instance in order to reduce costs, the annotators’ mental load or even to improve annotation accuracy. For example, if we’re annotating valence and arousal for a given data instance (such as a single image or video segment), collecting annotations on both these dimensions in one session per instance may be preferred over collecting valence annotations for all instances followed by arousal.

Such a joint annotation task may entail task specific or annotator specific dependencies between the annotated dimensions. In the aforementioned example, task specific dependencies may occur due to inherent correlations between the valence and arousal dimensions depending on the experimental setup. Annotator specific dependencies may occur due to a given annotator’s (possibly incorrect or incomplete) understanding of the annotation dimensions. Hence it is of relevance to model the dimensions jointly. However, most state of the art models in annotation fusion combine the annotations by treating the different dimensions independently.

Joint modeling of the annotation dimensions may result in more accurate estimates of the ground truth as well as in giving a better picture of the annotators’ behavior. In this work, we address this goal by proposing a multi-dimensional model which makes use of any potential relationships between the annotation dimensions while combining them. The model we propose is applicable to both the global annotation setting (such as while collecting emotion annotations on a picture, judgment about the overall tone of a conversation, etc.) as well as time series annotations (for example, time continuous annotations of audio/video clips on dimensions such as engagement or affect). Our model treats the hidden ground truth as latent variables and estimates them jointly with the annotator parameters using the Expectation Maximization (EM) algorithm [12]. We evaluate the model in both settings with both synthetic and real emotion corpora. We also create an artificial annotation task with controlled ground truth which is used in the model evaluation for both settings.

The main contributions of this work are as follows:

  1. We propose a unified model to capture relationships between annotation dimensions. For ease of exposition we focus on the linear case in this paper.

  2. The linear model we propose results in an annotator specific matrix which captures this annotator level relationship between the annotation dimensions.

  3. We create a novel multi-dimensional annotation task with controlled ground truth and use it to evaluate both the global and time series annotation settings of the model.

The rest of the paper is organized as follows. In Section 2, we review related work and motivate the problem in Section 3. In Section 4, we describe the proposed model and provide equations for parameter estimation using EM algorithm (derivations are deferred to the appendix). We evaluate the model in Section 5 and provide conclusions in Section 6.

2 Related work

Several authors, most notably [45], assert the benefits of aggregating opinions from many people which is often believed to be better than those from a small number of experts, under certain conditions. Often referred to as the wisdom of crowds, this approach has been remarkably popular in recent times, specially in fields such as psychology and behavioral sciences where a ground truth may not be easily accessible or may not exist. This popularity can be largely attributed to online crowdsourcing platforms such as Mturk that connect researchers with low cost workers from around the globe. Along with cost, scalability is another major appeal with such tools leading to their frequent use in machine learning, leveraging large scale annotation of data instances such as images [13], audio/video clips [46] and text snippets [42].

Figure 1 shows a common setting in the crowdsourcing paradigm. For each data instance , annotator provides a noisy annotation which depends on the ground truth where is the dimension being annotated. Since we collect several annotations for each , we need to aggregate them to estimate the unknown ground truth. The most common technique used in this aggregation is to take the average value in case of numeric annotations or perform majority voting in the case of categorical annotations as shown in Equation 1.


where, {} is the indicator function.

While simple and easy to implement, this approach assumes consistent reliability among the different annotators which seems unreasonable, especially in online platforms such as Mturk. To address this, several approaches have been suggested that account for annotator reliability in estimating the ground truth.

Fig. 1: Plate notation for a basic annotation model. is the latent ground truth for the given data instance (for the question) and is the rating provided by the annotator.

Early efforts to capture reliability in annotation modeling [11], [41] assumed specific structure to the functions modeled by each annotator. Given a set of annotations along with the corresponding function parameters, the ground truth is estimated using the Maximum A Posteriori (MAP) estimator.



is the prior probability of ground truth.

In [11], the categorical ground truth label is modified probabilistically by annotator

using a stochastic matrix

as shown in Equation 3 in which each row is a multinomial conditional distribution given the ground truth.


Given annotations from different annotators, their parameters and prior distribution of labels , the ground truth is estimated using MAP estimation as before.


The above expression makes a conditional independence assumption for annotations given the ground truth label. Since we do not typically have the annotator parameters , these are estimated using the EM algorithm.

Figure 2 shows an extension of the model in Figure 1

in which we learn a predictor (classifier/regression model) for the ground truth jointly with annotator parameters. Such a predictor may be used to estimate the ground truth for new unlabeled data instances. This strategy of jointly modeling the annotator functions as well as the ground truth predictor has been shown to have better performance when compared to predictors trained independently using the estimated ground truth

[34]. The ground truth estimate in this model is given by

Fig. 2: Annotation model proposed by [34] with a jointly learned predictor. is the set of features for the data instance; is the dimension of the latent ground truth which is modeled as a function of ; is the rating provided by the annotator.

Recently, several additional extensions have been proposed to the model in Figure 2; For example, in [1], the authors assume varying regions of annotator expertise in the data feature space and account for this using different probabilities for label confusion for each region. The authors show that this leads to a better estimation of annotator reliability and ground truth.

The models described so far have been designed for annotation tasks in which the task is to rate some global property of the data. For example, in image based emotion annotation, the task may be to provide annotations on affective dimensions such as valence and arousal conveyed by each image. However, human interactions often involve variations of these dimensions over time [28] which are captured using time series annotations from audio/video clips. Various tools have been developed to collect such annotations, including Anvil [23], Feeltrace [9], EMuJoy [29], Gtrace [10] and DARMA [18] (for a review of available tools and their properties, see [14] and [18]). In fusing such time series annotations, the previously mentioned models are applicable only if annotations from each frame are treated independently. However, this entails several unrealistic assumptions such as independence between frames, zero lag in the annotators and synchronized response in the annotators to the underlying stimulus.

Several works have been proposed to capture the underlying reaction lag in the annotators. [30] proposed a generalization of Probabilistic Canonical Correlation Analysis (PCCA) [2] named Dynamic PCCA which captures temporal dependencies of the shared ground truth space in a generative setting, and incorporated a latent time warping process to implicitly handle the reaction lags in annotators. They have further proposed a supervised extension of their model which jointly learns a predictor function for the latent ground truth signal similar to [34]. [26] address the reaction lag by explicitly finding the time shift that maximizes the mutual information between expressive behaviors and their annotations. [19] generalize the work of [26] by using a linear time invariant (LTI) filter which can also handle any bias or scaling the annotators may introduce.

More recent works in annotation fusion include [22] in which the authors propose a variant of the model in Figure 1 with various annotator functions to capture four specific types of annotator behavior. [38] describes a mechanism named approval voting that allows annotators to provide multiple answers instead of one for instances where they are not confident. [40] uses repeated sampling for opinions from annotators over the same data instances to increase reliability in annotations.

Most of the models described above focus on combining annotations on each dimension separately. However, the annotation dimensions are often related. For example, many studies in emotion literature have reported interrelationships between discrete emotion categories [47, 36]. The circumplex model [37], which attempts to capture these relationships by modeling the emotions as points on a two dimensional space, has also been noted to exhibit v-shaped

patterns in the joint distribution of valence and arousal

[25]. In addition, in most practical applications, the annotation tasks themselves are multi-dimensional. For example, while collecting ratings on affective dimensions it is routine to collect annotations on valence, arousal and dominance together. Further, there may be dependencies between the internal definitions the annotators hold for the annotation dimensions; for example, while annotating emotional dimensions, a particular annotator may associate certain valence values with only a certain range of arousal. Hence it may be beneficial to model the different dimensions jointly while performing annotation fusion. However, research in this direction has been limited. [33] proposed a model which assumes joint Gaussian noise between the annotation dimensions, but their model fails to capture structural dependencies described above between the annotation and ground truth dimensions. The model proposed in [30] can indeed be generalized to combine the different annotation dimensions together but they do not evaluate with joint annotated dimensions from a real dataset as that is not the focus of their work. [39] jointly model continuous annotations on valence and arousal using personalized basis spline

functions, on which functional PCA is applied to identify the dominant spline functions. Using this model, they estimate the ground truth for each data instance using a heuristic algorithm, but their model does not include a jointly trained ground truth predictor. It is therefore of relevance to model multi-dimensional annotation fusion as part of the unified annotator function and predictor modeling paradigm.

In this work, we propose a joint multi-dimensional model to address many of the gaps mentioned above. Our model captures annotator specific linear relationships between different annotation dimensions, and is an extension of the Factor Analysis model [20]. It incorporates an annotator specific transformation matrix parameter , which explicitly captures the relationship between the annotation dimensions and enables clear interpretations of the estimated relationships; the matrix is jointly estimated with a predictor for the ground truth signal. We further provide generalizations of our model to both global and time series annotation settings. We begin with a motivation followed by a detailed description of the model and its parameter estimation in the next sections.

(c) Movie emotions
Fig. 3: Correlation heatmaps for annotations from a representative sample of emotion annotated datasets; v - valence, a - arousal, d - dominance, p - power

3 Motivation

To examine the relationships between the annotation dimensions, we created a plot of absolute values of Pearson’s correlation between annotation dimensions from four commonly studied emotional corpora in Figure 3: IEMOCAP [6], SEMAINE [27], RECOLA [35] and the movie emotion corpus from [25]. Each of these corpora include annotations over affective dimensions such as valence, arousal, dominance and power. For the IEMOCAP corpus, we used global annotations while the others include time series annotations of the affective dimensions from videos. In each case, the correlations were computed from concatenated annotation values between all the dimensions.

As is evident, in almost all cases, the annotation dimensions exhibit non-zero correlations. We attribute the inconsistent correlations between the dimensions across corpora to varying underlying affective narratives as well as differences in perceptions and biases introduced by individual annotators themselves (see Section A.1). The non-zero correlations highlight the benefit of modeling the annotation dimensions jointly. The model we propose is aimed at addressing this. We explain the model in detail in the next section.

4 Joint Multi-dimensional Annotation Model

4.1 Setup

Fig. 4: Proposed model. is the set of features for the data instance, is the latent ground truth for the dimension and is the rating provided by the

annotator. Vectors

and (shaded) are observed variables, while is latent. is the set of annotator ratings for the instance.

The proposed model is shown in Figure 4. Each data instance has a feature vector and an associated multidimensional ground truth , which is defined as follows,


We assume that from a pool of annotators, a subset operates on each data instance and provides their annotation .


where index corresponds to the annotator; is an annotator specific matrix that defines his/her linear weights for each output dimension; and are noise terms defined individually in the next sections along with the functions and . In the global annotation setting, both and where is the number of items being annotated; for the time series setting and , where is the total duration of the data instance (audio/video signal). In all subsequent definitions, we use uppercase letters to denote various counts and lowercase letters to denote the corresponding index variables.

We make the following assumptions in our model.

  1. Annotations are independent for different data instances.

  2. The annotations for a given data instance are independent of each other given the ground truth.

  3. The model ground truths for different annotation dimensions are assumed to be conditionally independent of each other given the features .

4.2 Global annotation model

In this setting, the ground truth and annotations are dimensional vectors for each data instance. We define the ground truth and annotations as follows.


where, ; ; ; . The annotator noise is defined as ; . is the annotator specific weight matrix. Each annotation dimension value for annotator is defined as a weighted average of the ground truth vector with weights given by the vector .

4.2.1 Parameter Estimation

The model parameters are estimated using Maximum Likelihood Estimation (MLE) in which they are chosen to be the values that maximize the likelihood function .


Optimizing Equation 10 directly is intractable because of the presence of the integral within the log term, hence we use the EM algorithm. Note that the model we propose assumes that only some random subset of all available annotators provide annotations on a given data instance, as shown in Figure 4. However, for ease of exposition, we overload the variable and use it here to indicate the number of annotators that attempt to judge the given data instance .

4.2.2 EM algorithm

The Expectation Maximization (EM) algorithm to estimate the model parameters is shown below. It is an iterative algorithm in which the E and M-steps are executed repeatedly until an exit condition is encountered. Complete derivation of the model can be found in Appendix B.

Initialization We initialize by assigning the expected values and covariance matrices for the ground truth vectors to their sample estimates (i.e. sample mean and sample covariance) from the corresponding annotations. We then estimate the parameters as described in the maximization step using these estimates.

E-step In this step we take expectation of the log likelihood function with respect to and the resulting objective is maximized with respect to the model parameters in the M-step. Equations to compute the expected value and covariance matrices for the latent variable in the E-step are listed below.


terms are covariance matrices between the subscripted random variables.

and are dimensional vectors obtained by concatenating the annotation vectors and their corresponding expected values.

M-step In this step, we compute current estimates for the parameters as follows. The expectations shown below are over the conditional distribution .

Note the similarity of the update equation for with the familiar normal equations. We are using the soft estimate of to find the expression for in each iteration. Here, X is the feature matrix for all data instances; it includes individual feature vectors in its rows. and are parameters from the previous iteration.

Termination We run the algorithm until convergence, and stop model training when the change in log-likelihood falls below a threshold of .

4.3 Time series annotation model

In this setting, the ground truth and the annotations are matrices with rows (time) and columns (annotation dimensions). The ground truth matrix is defined as follows.


where , and ; represents the time dimension and is the length of the time series.

is the feature matrix where each row corresponds to features extracted from the data instance for one particular time stamp.

is the vectorization operation which flattens the input matrix in column first order to a vector. is the additive noise vector with .

In [19], the authors propose a linear model where the annotation function is a causal linear time invariant (LTI) filter of fixed width. The advantage of using an LTI filter is that it can capture scaling and time-delay biases introduced by the annotators.

The filter width is chosen such that , where is the number of time stamps for which we have the annotations. The annotation function for dimension can be viewed as the left multiplication of a filter matrix as shown in Equation 12.


We extend this model in our work to combine information from all of the annotation dimensions. Specifically, the ground truth is left multiplied by horizontally concatenated filter matrices, each corresponding to a different dimension as shown below.


with unique parameters. with .

4.3.1 Parameter Estimation

Estimating the model parameters similar to the global model requires computing the expectations over a vector of size . Since is the number of time stamps in the task and can be arbitrarily long, this may not be feasible in all tasks. For example, in the movie emotions corpus [25], annotations are computed at a rate of 25 frames per second with each file of duration 30 minutes or of 45k annotation frames. To avoid this we use a variant of EM named Hard EM in which instead of taking expectations over the entire conditional distribution of we find its mode. This variant has been shown to be comparable in performance to the classic EM (Soft EM) despite being significantly faster and simple [43]. This approach is similar to the parameter estimation strategy devised by [19] in their time series annotation model.

The likelihood function is similar to the global model in Equation 10 as shown below.

However the integral here is with respect to the flattened vector .

4.3.2 EM algorithm

The EM algorithm for the time series annotation model is listed below. Complete derivations can be found in Appendix C.

Initialization Unlike the global annotation model, we initialize randomly since we observed better performance when compared to initializing it with the annotation means. Given this , the model parameters are estimated as described in the maximization step below.

E-step In this step we assign to the mode of the conditional distribution . Since this distribution is normal (see appendix B) finding the mode is equivalent to minimizing the following expression.

M-step Given the estimate for from the E-step, we substitute it in the likelihood function and maximize with respect to the parameters in the M-step. The estimates for the different parameters are shown below.

is the number of files annotated by user ; is a matrix obtained by reshaping as described in subsection C.1.2.

Termination We run the algorithm until convergence, and stop model training when the change in log-likelihood falls below a threshold of .

5 Experiments & Results

We evaluate the models described above on three different types of data: synthetic data, an artificial task with human annotations, and finally with real data. We describe these below. We compare our joint models with their independent counterparts as baselines, in which each annotation dimension is modeled separately. This allows us to highlight the benefits of moving to a multi-dimensional annotation fusion scheme with everything else kept constant. Update equations for the independent model can be obtained by running the models described above for each dimension separately with . Note that the independent model is similar in the global setting to the regression model proposed in [34] (with ground truth scaled by the singleton ). In the time series setting it is identical to the model proposed by [19].

The models are evaluated by comparing the estimated with the actual ground truth. We report model performance using two metrics: the Concordance correlation coefficient () [24] and the Pearson’s correlation coefficient (). measures any departures from the concordance line (line passing through the origin at angle). Hence it is sensitive to rotations or rescaling in the predicted ground truth. Given two samples and , the sample concordance coefficient is defined as shown below.

We also report results in Pearson’s correlation to highlight the accuracy of the models in the presence of rotations.

As noted before, the models proposed in this paper are closely related to the Factor Analysis model, which is vulnerable to issues of unidentifiability [17], due to the matrix factorization. Different types of unidentifiability have been studied in literature, such as factor rotation, scaling and label switching. In our experiments, we handle label switching through manual judgment (by reassigning the estimated ground truth between dimensions if necessary) as is common in psychology [21], but defer the task of choosing an appropriate prior on the rotation matrix to address other unidentifiabilities for future work.

We report aggregate test set results using -fold cross validation. To address overfitting, within each fold, we evaluate the parameters obtained after each iteration of the EM algorithm by estimating the ground truth on a disjoint validation set, and pick those with the highest performance in concordance correlation as the parameter estimates of the model. We then estimate the performance of this parameter set in predicting the ground truth from a separate held out test set for that fold. Finally, we also report statistically significant differences between the joint and independent models at false-positive rate () in all our experiments.

5.1 Global annotation model

The global annotation model uses the EM algorithm described in Section 4.2.2

to estimate the ground truth for discrete annotations. We evaluate the model in three different settings described below. Statistical significance tests were run by computing bootstrap confidence intervals

[15] on the differences in model performances across the -folds. To establish the statistical significance, we ran the joint and independent models to obtain test set model predictions from folds. Given these, we ran bootstrap iterations in which the test set predictions were sampled with replacement, from which and

were estimated for each dimension. We conclude significance if the evaluation metric being examined was higher in at least

of the bootstrap runs.

(a) Concordance ()
(b) Pearson ()
Fig. 5: Performance of global annotation model on synthetic dataset; *-statistically significant

5.1.1 Synthetic data

We created synthetic data according to the model described in Section 4.2 with random features for 100 data instances each with 2 dimensions of annotations (i.e. =2). 10 artificial annotators, each with unique random

matrices were used to produce annotations for all the data instances. Elements of the feature matrices were sampled from the standard normal distribution, while the elements of

matrices were sampled from . Elements of ground truth were sampled from and was estimated from and X. Since its off diagonal elements are non-zero, our choice of represents tasks in which the annotation dimensions are related to each other.

Figure 5 shows the performance of joint and independent models in predicting the ground truth . For both dimensions, the proposed joint model predicts the with considerably higher accuracy as shown by the higher correlations, highlighting the advantages of modeling the annotation dimensions jointly when they are expected to be related to each other.

(a) Concordance ()
(b) Pearson ()
Fig. 6: Performance of global annotation model on artificial dataset; Sat-Saturation, Bri-Brightness; *-statistically significant

5.1.2 Artificial data

Since crowdsourcing experiments typically involve collecting subjective annotations, they seldom have well defined ground truth. As a result, most annotation models are evaluated on expert annotations collected by specially trained users. For example, while collecting annotations on medical data, labels estimated by fusing annotations from naive users may be evaluated against those provided by experts such as doctors. However, this poses a circular problem since the expert annotations themselves may be subjective and combining them to estimate the ground truth is not straightforward. To address this, we created an artificial task with controlled ground truth on which we collect annotations from multiple annotators and evaluate the fused annotation values with the known ground truth values, similar to [4]. In our task, the annotators were asked to provide their best estimates on perceived saturation and brightness values for monochromatic images. The relationship between perceived saturation and brightness is well known as the Helmholtz—Kohlrausch effect [8], according to which, increasing the saturation of an image leads to an increase in the perceived brightness, even if the actual brightness was constant.

In our experiments, we collected annotations on images from two regimes: one with fixed saturation and varying brightness, and vice versa. This approach was chosen since it would allow us to evaluate the impact of change in either brightness or saturation while the other was held constant. The color of the images were chosen randomly (and independent of the image’s saturation and brightness) between green and blue. Annotations were collected on Mturk and the annotators were asked to familiarize themselves with saturation and brightness using an online interactive tool before providing their ratings. In both experiments, a reference image with fixed brightness and saturation was inserted after every ten annotation images to prevent any bias in the annotators. The reference images were hidden from the annotators and appeared as regular annotation images. For parameter estimation, RGB values were chosen as the features for each image.

We used the joint model to estimate the ground truth for the two regimes separately since we expect the relationship between saturation and brightness to be dissimilar in the two cases. From each experiment, predicted values of the underlying dimension being varied was compared with the actual values. For example, in the experiment with varying saturation and fixed brightness, the joint model was run on full annotations, but only the estimated values of saturation were compared with ground truth saturation. For the independent model, we use annotation values of the underlying dimension being varied from each regime, and compare the estimated values with ground truth.

Figure 6 shows the performance of the joint and independent models for this experiment. The joint model leads to better estimates of saturation when compared to the independent model by making use of the annotations on brightness. This agrees with the Helmholtz—Kohlrausch phenomenon described above, since the annotators can perceive the changing saturation as a change in brightness, leading to correlated annotations for the two dimensions. On the other hand, the independent model leads to better estimates of brightness, which seems to have no effect on perceived saturation annotations. This experiment highlights the benefits of jointly modeling annotations in cases where the annotation dimensions may be correlated or dependent on one another.

5.1.3 Real data

(a) Concordance correlation ()
(b) Pearson correlation ()
Fig. 7: Performance of global annotation model on the text emotions dataset; *-statistically significant

Our final experiment for the global model was on the task of annotating news headlines in which the annotators provide numeric ratings for various emotions. This dataset was first described in the 2007 SemEval task on affective text [44]. Numeric ratings from the original task were labeled by trained annotators and we treat these as expert annotations. We use Mturk annotations from [42] as the actual input to our model. Sentence level annotations are provided on seven emotions (=7): anger, disgust, fear, joy, sadness, surprise and valence (positive/negative polarity). We use sentence level embeddings computed using the pre-trained sentence embedding model sent2vec333 [31] as feature vectors for the model.

Figure 7 shows the performance of the joint and independent models on this task. The joint model shows better performance in predicting the reference emotion labels for anger, disgust, fear, joy and sadness, but performs worse than the independent model in predicting surprise and valence.

(a) Concordance correlation coefficients
(b) Pearson correlation coefficients
Fig. 8: Concordance and Pearson correlation coefficients between ground truth/reference and model predictions for the time series annotation model; *-statistically significant

5.2 Time series annotation model

In this setting, the annotations are collected on data with a temporal dimension, such as time series data, video or audio signals. Similar to the global model, we evaluate this model in 3 settings: synthetic, artificial and on real data. The evaluation metrics and are computed over estimated and actual ground truth vectors by concatenating the data instances into a single vector. The time series models have the window size

as an additional hyperparameter, which is selected using a validation set. In each fold of the dataset, we train model parameters for different window sizes from the set

, and pick and related parameters with the highest concordance correlation

on the validation set. These are then evaluated on a disjoint test set, and we repeat the process for each fold. In each experiment, the parameters were initialized randomly, and the process was repeated 20 times at different random initializations, selecting the best starting point using the validation set. To identify significant differences, we compute the test set performance of the two models for each fold, and run the paired t-test between the

sized samples of and corresponding to the joint and independent models. We do not bootstrap confidence intervals due to smaller test set sizes.

5.2.1 Synthetic data

The synthetic dataset was created using the model described in Section 4.3. Elements of the feature matrix were sampled from the standard normal distribution while elements of and ground truth were sampled from . In this setting each data instance includes feature vectors, one for each time stamp. The time dependent feature matrices were created using a random walk model without drift but with lag to mimic a real world task. In other words, while creating the dimensional time series, the features vectors were held fixed for a time period arbitrarily chosen to be between 2 to 4 time stamps. This was done because in most tasks the underlying dimension (such as emotion) is expected to remain constant at least for a few seconds. In addition, the transition between changes in the feature vectors were linear and not abrupt. In our experiments, we chose , , , and the number of annotators .

Figure 8 shows the aggregate results across -folds () for the joint and independent models in the 3 settings. In the synthetic dataset, the joint model achieves higher values for Pearson’s correlation for both the dimensions and higher value for for dimension 1. For dimension 2 however, the independent model achieves better .

5.2.2 Artificial data

We collected annotations on videos with the artificial task of identifying saturation and brightness, described in the previous section. The videos consisted of monochromatic images with the underlying saturation and brightness varied independent of each other. The dimensions were created using a random walk model with lag as described in Section 5.2.1. The annotations were collected in house using an annotation system developed using the Robot Operating System [32]. 10 graduate students gave their ratings on the two dimensions. Each dimension was annotated independently using a mouse controlled slider. For parameter estimation, the feature vectors for each time stamp were RGB values.

As seen in Figure 8, both models achieve similar performance in predicting the ground truth for saturation and brightness in terms of , as well as in predicting saturation in terms of . The independent model achieves slightly better performance in predicting brightness in terms of concordance correlation (though not statistically significant); however, their performance in terms of suggests that the joint model output differs only in terms of a linear scaling. The joint model appears to be at par with the independent model for the most part, suggesting that the transformation matrix connecting the two dimensions for each annotator, is unable to accurately capture the dependencies between the dimensions, likely due to the fact that, unlike the global annotation model, the underlying brightness and saturation were varied simultaneously and independent of each other (leading to non-linear dependencies between them), and that we limit to only capture linear relationships.

5.2.3 Real data

We finally evaluate our model on a real world task with time series annotations. We chose the task of predicting the affective dimensions of valence and arousal from movie clips, first described in [25]. The associated corpus includes time series annotations of valence and arousal on contiguous 30 minute video segments from 12 Academy Award winning movies. This task was chosen because the data set includes both expert annotations as well as annotations from naive users. We treat the expert annotations as reference and evaluate the estimated dimensions against them; however, we note that the expert labels were provided by just one annotator, which may itself be noisy.

For each movie clip, 6 annotators provide annotations on their perceived valence and arousal using the Feeltrace [9] annotation tool. The features used in our parameter estimation include combined audio and video features extracted separately. The audio features were estimated using the emotion recognition baseline features from Opensmile [16]

at 25 fps (same frame rate as the video clips) and aggregated at a window size of 5 seconds using the following statistical functionals: mean, max, min, std, range, kurtosis, skewness and inter-quartile range. The video features were extracted using OpenCV

[5] and included frame level luminance, intensity, Hue-Saturation-Value (HSV) color histograms and optical flow [7], which were also aggregated to 5 seconds using simple averaging. The combined features were of size for each frame.

Figure 8 shows the performance of the two models in estimating the affective dimensions for the dataset. The joint model seems to considerably outperform the independent model while estimating arousal while the independent models seem to produce better estimates of valence from the annotations. The independent model seems to perform poorly in arousal prediction, but the joint model shows a balanced performance, with the joint modeling constraint likely acting as a regularizer.

(a) Concordance correlation
(b) Pearson’s correlation
Fig. 9: Effect of varying dependency between annotation dimensions for the synthetic model

5.3 Effect of dependency among dimensions

To evaluate the impact of the magnitude of dependency between the annotation dimensions on the performance of the models, we created a set of synthetic annotations for the global model similar to Section 5.1.1. We created 10 synthetic datasets, each with constant matrices across all annotators. The principal diagonal elements were fixed to 1 while the off diagonal elements were increased between 0.1 to 1 with a step size of 0.1. Similar to the previous setting, we created 100 annotators, each operating on 10 files. Note that despite the annotators having identical matrices, their annotations on a given file were different because of the noise term in Equation 7.

Figure 9 shows the 5-fold cross validated performance of the joint and independent models on this task. As seen in the figure, the joint model consistently outperforms the independent model in both metrics. Both the models start with similar performance when the off diagonal elements are close to zero since this implies no dependency between the annotation dimensions, and the performance of both models continues to degrade as the off diagonal elements increase. However, the joint model is able to make better predictions of the ground truth by making use of the dependency between the dimensions, highlighting the benefits of modeling the annotation dimensions jointly. Visualizations for averaged estimates of the matrices from this experiment can be found in Section A.2.

6 Conclusion

We presented a model to combine multi-dimensional annotations from crowdsourcing platforms such as Mturk. The model assumes the ground truth to be latent and distorted by the annotators. The latent ground truth and the model parameters are estimated using the EM algorithm. EM updates are derived for both global and time series annotation settings. We evaluate the model on synthetic and real data. We also propose an artificial task with controlled ground truth and evaluate the model.

Weaknesses of the model include vulnerability to unidentifiability issues like most variants of factor analysis [17]

. Typical strategies to address this issue involve adapting a suitable prior constraint on the factor matrix. For example, in PCA, the factors are ordered such that they are orthogonal to each other and arranged in decreasing order of variance. In our experiments, the model was found to be vulnerable to unidenfiability due to label switching, which was addressed through manual judgements. We defer the task of choosing an appropriate prior constraint on

for future work.

Future work includes generalizing the model with Bayesian extensions, in which case the parameters can be estimated using variational inference, in addition to adding model constraints to ensure identifiability of all model parameters. Though we limit our analysis here to linear relationships between the transformation matrix and the ground truth vector , we note that extending the model to capture non-linear relationships is straightforward. For example, the vector in Equation 7 can be replaced by one that includes a non-linear dependence on . Providing theoretical bounds to the model performance, specially with respect to the sample complexity may also be possible since we have assumed normal distributions throughout the model.

7 Acknowledgements

The authors would like to thank Zisis Skordilis for all the helpful discussions and feedback.


  • [1] K. Audhkhasi and S. Narayanan (2013) A globally-variant locally-constant model for fusion of labels from multiple diverse experts without using reference labels. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (4), pp. 769–783. Cited by: §2.
  • [2] F. R. Bach and M. I. Jordan (2005) A probabilistic interpretation of canonical correlation analysis. Technical report University of California, Berkeley. Cited by: §2.
  • [3] C. M. Bishop (2006) Pattern recognition and machine learning. Springer. Cited by: §B.0.1.
  • [4] B. M. Booth, K. Mundnich, and S. S. Narayanan (2018) A novel method for human bias correction of continuous-time annotations. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3091–3095. Cited by: §5.1.2.
  • [5] G. Bradski (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §5.2.3.
  • [6] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language resources and evaluation 42 (4), pp. 335. Cited by: §3.
  • [7] L. S. Chen, T. S. Huang, T. Miyasato, and R. Nakatsu (1998) Multimodal human emotion/expression recognition. In Proceedings of Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 366–371. Cited by: §5.2.3.
  • [8] D. Corney, J. Haynes, G. Rees, and R. B. Lotto (2009) The brightness of colour. PloS one 4 (3), pp. e5091. Cited by: §5.1.2.
  • [9] R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, and M. Schröder (2000) FEELtrace: an instrument for recording perceived emotion in real time. In ISCA tutorial and research workshop (ITRW) on speech and emotion, Cited by: §2, §5.2.3.
  • [10] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Stapleton (2013) Gtrace: general trace program compatible with emotionml. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013, pp. 709–710. Cited by: §2.
  • [11] A. P. Dawid and A. M. Skene (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics, pp. 20–28. Cited by: §2, §2.
  • [12] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pp. 1–38. Cited by: §1.
  • [13] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    IEEE conference on Computer Vision and Pattern Recognition

    pp. 248–255. Cited by: §2.
  • [14] D. Dupre, D. Akpan, E. Elias, J. Adam, B. Meillon, N. Bonnefond, M. Dubois, and A. Tcherkassof (2015) Oudjat: a configurable and usable annotation tool for the study of facial expressions of emotion. International Journal of Human-Computer Studies 83, pp. 51–61. Cited by: §2.
  • [15] B. Efron and R. J. Tibshirani (1994) An introduction to the bootstrap. CRC press. Cited by: §5.1.
  • [16] F. Eyben, M. Wöllmer, and B. Schuller (2010)

    Opensmile: the Munich versatile and fast open-source audio feature extractor

    In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, pp. 1459–1462. External Links: ISBN 978-1-60558-933-6, Document Cited by: §5.2.3.
  • [17] L. R. Fabrigar, D. T. Wegener, R. C. MacCallum, and E. J. Strahan (1999) Evaluating the use of exploratory factor analysis in psychological research.. Psychological methods 4 (3), pp. 272. Cited by: §5, §6.
  • [18] J. M. Girard and A. G. Wright (2018) DARMA: software for dual axis rating and media annotation. Behavior research methods 50 (3), pp. 902–909. Cited by: §2.
  • [19] R. Gupta, K. Audhkhasi, Z. Jacokes, A. Rozga, and S. S. Narayanan (2018-01) Modeling multiple time series annotations as noisy distortions of the ground truth: an expectation-maximization approach. IEEE Trans. Affect. Comput. 9 (1), pp. 76–89. External Links: ISSN 1949-3045, Document Cited by: §C.1.1, §2, §4.3.1, §4.3, §5.
  • [20] H. H. Harman (1976) Modern factor analysis. University of Chicago press. Cited by: §2.
  • [21] H. F. Kaiser (1958) The varimax criterion for analytic rotation in factor analysis. Psychometrika 23 (3), pp. 187–200. Cited by: §5.
  • [22] Y. E. Kara, G. Genc, O. Aran, and L. Akarun (2015) Modeling annotator behaviors for crowd labeling. Neurocomputing 160, pp. 141–156. Cited by: §2.
  • [23] M. Kipp (2001) Anvil-a generic annotation tool for multimodal dialogue. In Seventh European Conference on Speech Communication and Technology, pp. 1367–1370. Cited by: §2.
  • [24] I. Lawrence and K. Lin (1989) A concordance correlation coefficient to evaluate reproducibility. Biometrics, pp. 255–268. Cited by: §5.
  • [25] N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi (2011) A supervised approach to movie emotion tracking. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2376–2379. Cited by: §A.1, §2, §3, §4.3.1, §5.2.3.
  • [26] S. Mariooryad and C. Busso (2015) Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Affective Computing 6 (2), pp. 97–108. Cited by: §2.
  • [27] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder (2012) The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing 3 (1), pp. 5–17. Cited by: §3.
  • [28] A. Metallinou and S. S. Narayanan (2013-04) Annotation and processing of continuous emotional attributes: challenges and opportunities. In 2nd International Workshop on Emotion Representation, Analysis and Synthesis in Continuous Time and Space (EmoSPACE 2013), External Links: Document Cited by: §2.
  • [29] F. Nagel, R. Kopiez, O. Grewe, and E. Altenmüller (2007) EMuJoy: software for continuous measurement of perceived emotions in music. Behavior Research Methods 39 (2), pp. 283–290. Cited by: §2.
  • [30] M. A. Nicolaou, V. Pavlovic, and M. Pantic (2014) Dynamic probabilistic CCA for analysis of affective behavior and fusion of continuous annotations. IEEE transactions on pattern analysis and machine intelligence 36 (7), pp. 1299–1311. Cited by: §2, §2.
  • [31] M. Pagliardini, P. Gupta, and M. Jaggi (2017) Unsupervised learning of sentence embeddings using compositional n-gram features. CoRR abs/1703.02507. Cited by: §5.1.3.
  • [32] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs, E. Berger, R. Wheeler, and A. Ng (2009-05) ROS: an open-source robot operating system. In Proc. of the IEEE Intl. Conf. on Robotics and Automation (ICRA) Workshop on Open Source Robotics, Kobe, Japan. Cited by: §5.2.2.
  • [33] A. Ramakrishna, R. Gupta, R. B. Grossman, and S. S. Narayanan (2016) An expectation maximization approach to joint modeling of multidimensional ratings derived from multiple annotators. In Proceedings of Interspeech, pp. 1555–1559. External Links: Document Cited by: §2.
  • [34] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy (2010) Learning from crowds. The Journal of Machine Learning Research 11, pp. 1297–1322. Cited by: Fig. 2, §2, §2, §5.
  • [35] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne (2013) Introducing the recola multimodal corpus of remote collaborative and affective interactions. In 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. Cited by: §A.1, §3.
  • [36] J. A. Russell and J. M. Carroll (1999) On the bipolarity of positive and negative affect.. Psychological bulletin 125 (1), pp. 3. Cited by: §2.
  • [37] J. A. Russell (1980) A circumplex model of affect.. Journal of personality and social psychology 39 (6), pp. 1161. Cited by: §2.
  • [38] N. B. Shah, D. Zhou, and Y. Peres (2015) Approval voting and incentives in crowdsourcing. arXiv preprint arXiv:1502.05696. Cited by: §2.
  • [39] K. Sharma, M. Wagner, C. Castellini, E. L. van den Broek, F. Stulp, and F. Schwenker (2019) A functional data analysis approach for continuous 2-d emotion annotations. In Web Intelligence, Vol. 17, pp. 41–52. Cited by: §2.
  • [40] V. S. Sheng, F. Provost, and P. G. Ipeirotis (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 614–622. Cited by: §2.
  • [41] P. Smyth, U. M. Fayyad, M. C. Burl, P. Perona, and P. Baldi (1995) Inferring ground truth from subjective labelling of venus images. In Advances in neural information processing systems, pp. 1085–1092. Cited by: §2.
  • [42] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng (2008) Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In

    Proceedings of the conference on empirical methods in natural language processing

    pp. 254–263. Cited by: §2, §5.1.3.
  • [43] V. I. Spitkovsky, H. Alshawi, D. Jurafsky, and C. D. Manning (2010) Viterbi training improves unsupervised dependency parsing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, pp. 9–17. Cited by: §4.3.1.
  • [44] C. Strapparava and R. Mihalcea (2007) SemEval-2007 task 14: affective text. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, pp. 70–74. Cited by: §5.1.3.
  • [45] J. Surowiecki (2005) The wisdom of crowds. Anchor. Cited by: §1, §2.
  • [46] C. Vondrick, D. Patterson, and D. Ramanan (2013) Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision 101 (1), pp. 184–204. Cited by: §2.
  • [47] D. Watson and L. A. Clark (1992) On traits and temperament: general and specific factors of emotional experience and their relation to the five-factor model. Journal of personality 60 (2), pp. 441–476. Cited by: §2.

Appendix A Supplementary Analyses

a.1 Annotator specific correlations

Figure 10 highlights the correlations between annotation dimensions for 4 annotators from the movie emotions [25] and RECOLA [35] corpora. As noted earlier, different annotators may exhibit different degree of associations between the annotation dimensions, leading to the observed differences in correlations both within and between the two corpora. This difference in annotator behavior also leads to the different inter-dimension correlations observed among the corpora in Figure 3.

a.2 Effect of dependency among dimensions

The model we present includes the annotator specific parameter which measures the relationships between the annotation dimensions. To highlight the ability of the model to recover this parameter, in Figure 11, we show a plot of averages of all predicted matrices for different step sizes from the synthetic experiment described in Section 5.3. In each case, the predicted matrices closely resemble the actual matrices for the annotators highlighting the accuracy of the joint model. However, as we get closer to step size , the estimated matrices appear to be washed out (despite being accurate to a scaling term), with all terms of the estimated close to instead of (Figure 10(f)), due to model unidentifiability.

(a) 0
(b) 0.2
(c) 0.4
(d) 0.6
(e) 0.8
(f) 1
Fig. 11: Average plots estimated from the joint model at different step sizes for off diagonal elements of the annotator’s matrices

Appendix B EM derivation for global annotation model

b.0.1 Deriving

To help with the model formulation, we first derive parameters of the joint distribution . Since the product of two normal distributions is also normal [3], this joint distribution is also normal and is given by,


The different components of the covariance matrix from Equation 15 are derived below.

In the derivation of , the first equation is a direct application of the law of total covariance and the second equation is because of the conditional independence assumption of annotation values given the ground truth

Finally, owing to the jointly normal distributions, is also normal:

Also, by definitions of conditional normal distributions, given a normal vector of the form

the conditional distribution has the following form.


b.1 EM Formulation

We begin by introducing a new distribution in Equation 10. We drop the parameters from the likelihood function expansion for convenience.


Using Jensen’s inequality over log of expectation, we can write the above as follows,


The bound above becomes tight when the expectation is taken over a constant value, i.e.

Solving for the constant c, we have

b.1.1 E-Step

The E-step involves simply assuming to follow the conditional distribution .

To help with future computations, we also compute the following expectations, where the first two are a result of equations 16 and 17; third equation is by definition of covariance and the last one is a standard result (see the matrix cookbook eq. 327).