Generating Labels for Regression of Subjective Constructs using Triplet Embeddings

Human annotations serve an important role in computational models where the target constructs under study are hidden, such as dimensions of affect. This is especially relevant in machine learning, where subjective labels derived from related observable signals (e.g., audio, video, text) are needed to support model training and testing. Current research trends focus on correcting artifacts and biases introduced by annotators during the annotation process while fusing them into a single annotation. In this work, we propose a novel annotation approach using triplet embeddings. By lifting the absolute annotation process to relative annotations where the annotator compares individual target constructs in triplets, we leverage the accuracy of comparisons over absolute ratings by human annotators. We then build a 1-dimensional embedding in Euclidean space that is indexed in time and serves as a label for regression. In this setting, the annotation fusion occurs naturally as a union of sets of sampled triplet comparisons among different annotators. We show that by using our proposed sampling method to find an embedding, we are able to accurately represent synthetic hidden constructs in time under noisy sampling conditions. We further validate this approach using human annotations collected from Mechanical Turk and show that we can recover the underlying structure of the hidden construct up to bias and scaling factors.



There are no comments yet.


page 1

page 2

page 3

page 4


To Trust, or Not to Trust? A Study of Human Bias in Automated Video Interview Assessments

Supervised systems require human labels for training. But, are humans th...

Joint Multi-Dimensional Model for Global and Time-Series Annotations

Crowdsourcing is a popular approach to collect annotations for unlabeled...

Uncertainty Estimates for Ordinal Embeddings

To investigate objects without a describable notion of distance, one can...

Learning Embeddings for Product Visual Search with Triplet Loss and Online Sampling

In this paper, we propose learning an embedding function for content-bas...

A Probabilistic approach for Learning Embeddings without Supervision

For challenging machine learning problems such as zero-shot learning and...

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations

The increasing accuracy of automatic chord estimation systems, the avail...

Deep Robust Subjective Visual Property Prediction in Crowdsourcing

The problem of estimating subjective visual properties (SVP) of images (...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Continuous-time annotations are an essential resource for the computational study of hidden constructs such as affect or behavioral traits over time. Indeed, the study of these hidden constructs is commonly tackled using regression techniques under a supervised learning framework, which heavily rely on accurately labeled features with respect to the constructs under study. Formally, regression problems deal with finding a mapping

, where is the feature space, and is the label space. Note that if is indexed by time, then it is sometimes called a continuous-time label111As opposed to discrete labels without time dependency.. In this paper, we are interested in finding labels , such that is a good proxy for a hidden construct . As an example, in affective computing, is often a dimension of affect such as arousal (emotion intensity) or valence (emotion polarity), and it is assumed to be characterizable by data in the observation space (e.g. audio, video, or bio-behavioral signals).

In the current literature, continuous-time labels in are often generated from a set of continuous-time annotations acquired from a set of human raters or annotators . Each annotator uses perceptually interpretable features to generate annotations about the construct [1, 2, 3]. In the sets above, is the number of samples in time, and represents the dimension of the set of perceptual features (e.g. audio levels, frames in a video) used for the real-time annotation acquisition. More generally, annotators are requested to do the mapping:


where each is specific to annotator for a construct . Usually, several of these single annotations are collected from several annotators , processed, and combined to create a single label . This problem is called annotation fusion.

To train accurate statistical models, it is important that the labels used are precise and accurate, and properly reflect the variable under study [4]. Unfortunately, the annotation of hidden cues such as behavioral traits is a challenging problem due to several factors including diverse interpretations of the construct under study, differences in the perception of scale, improper design of the annotation-capturing tools, as well as disparate reaction times [2, 5, 6]. All of these affect the fidelity of individual annotations , as shown in Figure 1.


(a) Task A:


(b) Task B:
Figure 1: Two real-time human annotation tasks with known ground truth (intensity of green over time, shown by the thick black lines). Six annotations are plotted in each task. Different colors represent each annotation done in real-time by a different annotator in a synthetic data experiment. Data retrieved from [6].

To better study these challenges and the efficacy of algorithms to generate , we build upon perceptual annotation tasks proposed previously by [6] where the ground truth is known, as a way to evaluate annotation fusion and correction algorithms. We proposed these tasks to decouple the problems of annotations themselves and the interpretation of hidden constructs. Figure 1 shows the outcome of these experiments, where nine human annotators were asked to annotate the intensity of green color in two different tasks (A and B)222More complex real-world scenarios with coupled problems will be the subject of a future communication.. Six annotations are plotted for clarity for each task. Figure 1 exhibits many of the artifacts that complicate the fusion of continuous-time annotations: variable reaction times, overshooting fast changes, time-varying biases, disparate interpretations of scale, and difficulties in annotating constant intervals of the variable under study (mainly due to real-time corrections in the annotation process of the annotators themselves).

1.1 Related work in annotation fusion

Related recent research has attempted to estimate the underlying construct

by using the continuous-time annotations . Different works have addressed a subset of the aforementioned challenges (time lags, scale interpretations). For example, [5, 7] study and model the reaction lag of annotators by using features from the data and shift each annotation before performing a simple average to fuse them, thus creating a unique label (EvalDep). Dynamic time warping (DTW) proposed by [8] is another popular time-alignment method that monotonically warps time to maximize alignment, which is usually combined with weighted averaging of signals. [9]

propose the use of a Long-Short-Term-Memory network (LSTM) to fuse asynchronous input annotations, by conducting time-alignment and de-biasing the different annotations.


present a method for modeling multiple annotations over a continuous variable, and compute the ground truth by modeling annotator-specific distortions as filters whose parameters can be estimated jointly using Expectation-Maximization (EM). However, this work relies on heavy assumptions in the models for mathematical tractability, that do not necessarily reflect how annotators behave. All of the aforementioned works involve post-processing the raw continuous-time annotations, and performing the annotation fusion by averaging weighted signals in different (non)linear ways.

A different set of approaches is used to learn a warping function so that the fusion better correlates with associated features, such as those presented by [11, 12]. These spatial-warping methods can be combined with time warping, as shown by [13, 14, 15]. All of these approaches rely on using a set of features.

[6] propose a framework based on triplet embeddings to correct a continuous-time label generated by a fusion algorithm. This approach warps the fused label by selecting specific windows of it to collect extra information from human annotators through triplet comparisons. In [16], the authors use triplet embeddings to fuse real-time annotations. However, in these works the question of whether triplet comparisons alone can be used to generate the label is not studied. This is the topic of this paper.

1.2 Contributions

In this paper we study the performance of a new methodology to acquire and create a single label for regression by changing the sampling procedure of the latent construct. We sample this information by asking annotators questions of the form “is the signal in time-frame more similar to the signal in time-frame or ?” to build a 1-dimensional embedding in Euclidean space, where forms a triplet. Figure 2 shows an example of a query in the proposed sampling method for tasks A and B where the comparison is based on the shade (intensity) of the color.

Formally, we propose that annotators perform the following mapping:


where is a perceived dissimilarity of construct by annotator . We use a set of queried triplets and the corresponding annotations to calculate the embedding .

We motivate this approach using three key observations. First, psychology and machine learning/signal processing studies have shown that people are better at comparing than rating items [17, 18, 2, 19], so this sampling mechanism is easier for annotators than requesting absolute ratings in real-time. Second, the use of triplet embeddings naturally solves the annotation fusion problem, since it is done by taking the union of sets (details in Section 3). Third, triplet embeddings offer a simple way of verifying the agreement of the annotations, given by the number of triplet violations in the computed embedding.

We empirically show that it is possible to reconstruct the hidden green intensity signal of tasks A and B in Figure 1 under different synthetic noise scenarios in the triplet labeling stage. These reconstructions are accurate up to a scaling and bias factor but do not suffer from artifacts such as time-lags present in real-time annotations. Moreover, to test our approach, we gather triplet comparisons for the same experiments from Amazon Mechanical Turk and show that it is possible to reconstruct the hidden green intensity values over time up to scaling and bias factors when humans perform the triplet comparisons.

2 Triplet Embeddings

We first recall the general setting of Triplet Embeddings from a probabilistic perspective. Let items that we want to represent through points , respectively, with . The items ’s do not necessarily lie in a metric space, but we assume there exists a dissimilarity or pseudo-measure . This dissimilarity may be perceptual, such as comparisons of affect in the context of affective computing. We use this dissimilarity to do comparisons of the form:


to find the embedding .

Formally, let be the set of all possible unique triplets for items:


Note that , which may be a very large set to query. We observe a set of triplets , such that

, and corresponding realizations of the random variables

, where , such that:



is a function that behaves as a cumulative distribution function

[20] (sometimes called link function), and therefore has the property that . Hence, the ’s indicate if is closer to than

, with a probability depending on the difference


Let be the Gram matrix of the embedding. We can estimate (and hence ) by minimizing the empirical risk:



is a (margin-based) loss function and

is defined as:


and zeros everywhere else, so that the Frobenius inner product (and therefore, contributes only a sign). After minimizing Eq. 8, we can recover from up to a rigid transformation using the SVD.

In a maximum likelihood framework, is induced by our choice of , assuming that the are independent. For example, if is the logistic function , the induced loss is the logistic loss [21]. This setup is equivalent to Stochastic Triplet Embeddings [22], since the logistic loss and softmax are equivalent in this setup.

[21] prove that the error (where is the true underlying Gram matrix) is bounded with high probability if . Therefore, the practical number of triplets that need to be queried is instead of .

When computing a 1-dimensional embedding (i.e. ), each can be interpreted as the value that the embedding takes at time index , therefore representing a time series.

3 Labeling triplets with multiple annotators

Eq. 7 shows a way to encode the decision of a single annotator when queried for a decision as in Eq. 5. However, for multiple annotators we need to extend this model. Let be a set of annotators. We define as the set of triplets annotated by annotator , so we observe a random variable for each . The labels are defined as:


where is the annotator’s (inner) model for the dissimilarities, and is the unique function that drives the probabilities for each annotator.

3.1 Annotation fusion

Due to annotation costs, we choose the sets such that they are disjoint:


so that all queries are unique and any annotated triplet is labeled by at most one annotator.

Note that the fusion process occurs in this step: The annotation fusion in a triplet embedding approach is done by taking the union of all the individually generated sets to generate a single set of triplets , and using all corresponding labels , defined for each annotator and each corresponding triplet .

One difficulty of this multi-annotator model is the fact that the distribution of depends on the annotators through and , and, ultimately, the loss function is annotator–dependent. Fortunately, in our experiments, we can assume and , as we show experimentally in Figure 4. We will extend this to annotator–dependent distributions in a future communication.

3.2 Triplet violations and annotation agreements

Triplet violations occur when a given triplet does not follow the calculated embedding , this is:


Therefore, we can count the fraction of triplet violations using:


where is Kronecker’s delta.

To compute the expected number of correctly labeled triplets in , we can derive another random variable that models the correct annotation of triplet based on :


where is the probability of success.

Using Eq. 14 we can model the number of correctly labeled triplets as a Poisson Binomial random variable :


Its expected value is the sum of the success probabilities:


After computing from , and assuming that the optimization routine has found the best possible embedding for , then the fraction of triplet violations in is linearly related to by:


is a measure of disagreement between all triplets used to compute the embedding . means that all used triplets agree with the computed embedding , meaning that all triplet labels agree with each other.

4 Experiments

We conduct two simulated experiments and one human annotation experiment using Mechanical Turk to verify the efficacy of our approach. We use the two synthetic data sets proposed in [6], for which the values for are known. We use this data because the reconstruction errors can be computed and we can assess the quality of the resulting labels. The two tasks correspond to videos of green frames with varying intensity of color over time and where hidden construct is the intensity of green color (shown in thick black lines in Figure 1). The video in task A is 267 seconds long, and 178 seconds long in task B.

To construct our triplet problem we first downsample the videos to 1Hz, so that the number of frames equals the length of the video in seconds. This reduces the number of unique triplets, since the number grows , which may become prohibitively large. We also set the dimension to , since we want to find a 1-dimensional embedding that represents the intensity of green color over time.

All of our experiments are implemented in Julia v1.0 [23], and available at

4.1 Synthetic triplet annotations

We simulate the annotation procedure by comparing the scalar green intensity values of frames of the video using absolute value between points, such that the dissimilarity for Eq. 5 is:


where and are time indices.

We generate a list of noisy triplets by randomly and uniformly selecting each triplet from the pool of all possible unique triplets. Each triplet is correctly labelled by with probability .

We use eight different fractions of the total possible number of triplets using logarithmic increments such that , which goes from to of . We use a logarithmic scale to have more resolution for smaller percentages of the total number of possible unique triplets. Note that for 267 frames (task A), the total number of unique triplets is 9,410,415. The queried triplets are randomly and uniformly sampled from all possible unique triplets, since there is no guarantee of better performance for active sampling algorithms in this problem [24].

We use various algorithms available in the literature to solve the triplet embedding problem: STE (with fixed and tSTE with [22], GNMDS (parameter-free) with hinge loss [25], and CKL with [26]. Note that STE and GNMDS pose convex problems, while tSTE and CKL pose non-convex problems. Therefore, we perform 30 different random starts for each set of parameters.

We now describe the three experimental settings we use to validate our approach.

Simulation 1: Constant success probabilities

We choose to be approximately constant, such that the probability , where , (to add small variations). We run three different experiments for .

Picking the values of randomly affects our calculation of (Eq. 16), but we will assume that these have been fixed a priori, meaning that the annotation process has a fixed probability for labeling any triplet .

Simulation 2: Logistic probabilities

A more realistic simulation is given by labeling the triplets in according to the following probabilities:


which is the logistic function. We use different values for . Intuitively, the triplets with smaller differences between and should be harder to label, and a more realistic noise model than constant errors independent of the difficulty of the task. Note that this noise model induces the logistic loss used in STE.

Mechanical Turk annotations

Using the list of images generated earlier we sample of the total number of triplets of images randomly and uniformly. In this setting, we sample approximate triplets, with for task A, and for task B. To compute the embedding we use STE with parameter .

To obtain the list of annotated triplets, we show the annotators options A and B against a reference, and instructions as in Figure 2. We do not provide further instructions for the case where 333This introduces noise into our annotations by forcing the annotators to make a decision.. For this task, we pay the annotators $0.02 per answered query.


Figure 2: Question design for queries in Mechanical Turk.

4.2 Error measure

We use the error measure proposed in [27], and compute the error by first solving the following optimization problem:


where are the scaling and bias factors, and is the length of . We use this MSE and not a naive MSE between the ground truth and the reconstructed label because the embedding is optimal only up to scaling and bias factors.

We also report Pearson’s correlation between the ground truth and the estimated embedding, to compare our method with other proposed algorithms in a scale-free manner.

5 Results and analysis

5.1 Synthetic annotations

[width=]MSE_constant_noise.pdf [width=]MSE_logistic_noise.pdf

Figure 3: MSE as a function of the number of observed triplets

with constant and logistic noise in triplet labels. Each point in the plots represents the mean over 30 random trials, while the shaded areas represent one standard deviation from the average MSE values.

Figure 3 shows the MSEs as a function of for both synthetic experiments. For both constant and logistic noise in tasks A and B we generally obtain a better performance as the amount of noise in the triplet annotation process is reduced (larger or ). This is not always true in the algorithms that propose non-convex loss functions (tSTE, CKL), where sometimes more noise generates better embeddings. We infer that these algorithms sometimes find better local minima under noisier conditions.

The MSE in Figure 3 typically becomes smaller as increases. This is true (generally) for tSTE, STE, and CKL. GNMDS does not always produce a better embedding by increasing the number of triplets employed.

We also note that the embedding in task B is easier to compute than that of Task A. We observe two possible reasons for this: (1) Task A has constant intervals while task B has none (and constant regions may be harder to compute in noisy conditions), and (2) the extreme values in task A seem harder to estimate, since these occur for very short intervals of time that are less likely to be sampled.

Overall, STE is the best-performing algorithm independent of noise or task. We note that tSTE with approaches STE in many of the presented scenarios. In fact, tSTE becomes STE with as , so these results are expected.

5.2 Mechanical Turk annotations

5.2.1 Annotator noise

In the Mechanical Turk experiments, 170 annotators annotated triplets in task A, and 153 in task B. To understand the difficulty of the tasks and the noise distributions for the annotators, we estimate the probabilities of success for both tasks, using the top three annotators.

To estimate , we partition the triplets based on into intervals with the same number of triplets. For each interval, we compute the average distance of the triplets. For each triplet , we know the outcome (realization) of the random variable since we know the hidden construct . We assume that the success probability is constant in this interval, so that . Finally, we use the maximum likelihood estimator for success probabilities for each interval:


In Figure 4, we show the function for each of the top annotators with the most answered queries and compare it to the logistic function with . The comparison between the estimated probabilities of success and the logistic function shows that this is a very good noise model for this annotation task, while also telling us that we should expect the best results from STE when computing the embedding from the crowd-sourced triplet annotations. Noticeably, our initial assumption of an annotator-independent noise model is verified.

5.2.2 Mechanical Turk embedding

We present in Figure 5 the results for the reconstructed embeddings using triplets generated by annotators via Mechanical Turk. We show the reconstructed embeddings obtained using 0.25% and 0.5% of the total number of triplets for each task. Although there is some visible error, we are able to capture the trends and overall shape of the underlying construct with only 0.5% or less of all possible triplets for both tasks. We also plot a scaled version (according to Eq. 20) of the fused annotation through EvalDep, using the continuous-time annotations from [6] (Fig. 1).

Task Fusion technique MSE
A EvalDep 0.00489 0.906
Proposed (0.25%) 0.00145 0.973
Proposed (0.50%) 0.00132 0.975
B EvalDep 0.00304 0.969
Proposed (0.25%) 0.00305 0.969
Proposed (0.50%) 0.00285 0.971
Table 1: MSE and Pearson’s correlation for proposed method and the state of the art continuous-time fusion technique against ground truth. For our method, percentage is with respect to the total number of triplets .
Triplet violations
Task MTurk (0.25%) 0.5% EvalDep
A 0.262
B 0.139
Table 2: Triplet violations for the Mechanical Turk experiment. Percentages correspond to percentage of total triplets observed. We include the fraction of triplet violations as computed by the labels generated with EvalDep.

We show in Table 1 the MSE for each task, where percentages again represent the number of triplets employed. We observe that the MSE is lower for a higher number of labeled triplets used. This is expected: we have more information about the embedding as we increase the number of triplets that we feed into the optimization routine, therefore producing a higher quality embedding. We also show a scale-free comparison through Pearson’s correlation. This is important because Pearson’s correlation captures how signals vary over time and neglects differences in scale and bias. In both tasks A and B, we show that our method improves upon previous work.

5.2.3 Triplet violations and annotator agreement

We display in Table 2 the number of triplet violations for each task. We record the true percentage of triplet violations according to our ground truth (generated using distances and , as in Eq. 12) and then compare them to the annotation responses. We also display the number of triplet violations according to the computed embeddings . We see that the percentage of triplet violations according to our ground truth and the triplet violations calculated from the embeddings is not the same, being overestimated in task A and underestimated in task B. We also observe that even if the number of violations increases in task A, the MSE is reduced with a higher number of triplets. This happens because a higher number of triplets more easily defines an embedding.


(a) Task A


(b) Task B
Figure 4: Probabilities of success as a function of the distance between the frames and from the reference . Only the top annotators have been included.


(a) Task A


(b) Task B
Figure 5: Results for Mechanical Turk annotations. The computed embeddings have been scaled to fit the true labels (Eq. 20). (a) Reconstruction for task A using 0.25% (23,526) and 0.5% (47,052) of triplet comparisons. (b) Reconstruction for task B using 6,931 (0.25%) and 13,862 (0.5%) triplet comparisons. In both cases, the estimated green intensity is less than zero due to scaling.

6 Discussion

Section 5 shows that it is possible to use triplet embeddings to find a 1-dimensional embedding that resembles the true underlying construct up to scaling and bias factors. There are several factors to consider for our proposed method.

Annotation costs

One of the challenging aspects of using triplet embeddings is the growth of the number of unique triplets for objects or frames. As mentioned earlier, the results by [21] suggest however that the theoretical number of triplets needed scales with . In our experiments, we use triplets with for task A and for task B, with small MSEs and better approximations of the underlying ground truth compared to the state-of-the-art.

Embedding quality

The embeddings reconstructed are more accurate than the method proposed in [7]. Moreover, no time-alignment is needed since the annotation process does not suffer from reaction times. It is also important to note that sharp edges (high frequency regions of the construct) are most appropriately represented and do not get smoothed out, as with averaging-based annotation fusion techniques (where annotation devices such mice or joysticks and user interfaces perform low-pass filtering).

In terms of reconstruction, the scaling factor is an open challenge. We see two possible ways to work with the differences in scaling when the underlying construct is unknown: (1) Learn the scaling in a machine learning pipeline that uses these labels to create a statistical model of the hidden construct, or (2) normalize the embedding such that and , and train the models using either these labels or the derivatives . However, we note that continuous-time annotations do suffer from same loss of scaling and bias, since both techniques are trying to solve an inverse problem where the scale is lost.

Feature sub-sampling for triplet comparisons

In the experiments of this paper, we sub-sample the videos to 1[Hz] so that we have a manageable number of frames . Down-sampling is possible due to the nature of the synthetic experiment we have created, but may not be suitable for other constructs such as affect in real world data, where annotation of single frames might lose important contextual information. In these scenarios, further investigation is needed to understand how to properly sub-sample more complex annotation tasks.

7 Conclusion

In this paper, we present a new sampling methodology based on triplet comparisons to find continuous-time labels of hidden constructs. To study the proposed methodology, we use two experiments from [6] and show that it is possible to recover the structure of the underlying hidden signals in simulation studies using human annotators to perform the triplet comparisons. These labels for the hidden signals are accurate up to scaling and bias factors.

Our method performs annotator fusion seamlessly as a union of sets of queried triplets , which greatly simplifies the fusion approach compared to existing approaches which directly combine real-time signals. Moreover, our approach does not need post-processing such as time-alignments or averaging.

Some challenges for the proposed method include dealing with the annotation costs given the number of triplets that needs to be sampled, and also learning the unknown scaling and bias factors.

As future directions, we are interested in several paths. We believe it is necessary to study the proposed method to label constructs where the ground truth does not exist, as is the case of human emotions, and look at the effects of comparing frames versus comparing short videos in the triplet annotation tasks.


  • [1] Roddy Cowie and Randolph R Cornelius. Describing the emotional states that are expressed in speech. Speech Communication, 40(1):5–32, 2003.
  • [2] Angeliki Metallinou and Shrikanth Narayanan. Annotation and processing of continuous emotional attributes: Challenges and opportunities. In Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, pages 1–8. IEEE, 2013.
  • [3] Carlos Busso, Murtaza Bulut, SS Narayanan, J Gratch, and S Marsella. Toward effective automatic recognition systems of emotion in speech. Social emotions in nature and artifact: emotions in human and human-computer interaction, J. Gratch and S. Marsella, Eds, pages 110–127, 2013.
  • [4] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
  • [5] Soroosh Mariooryad and Carlos Busso. Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pages 85–90. IEEE, 2013.
  • [6] Brandon M Booth, Karel Mundnich, and Shrikanth S Narayanan. A novel method for human bias correction of continuous-time annotations. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3091–3095. IEEE, 2018.
  • [7] Soroosh Mariooryad and Carlos Busso. Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Affective Computing, 6(2):97–108, 2015.
  • [8] Meinard Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
  • [9] Fabien Ringeval, Florian Eyben, Eleni Kroupi, Anil Yuce, Jean-Philippe Thiran, Touradj Ebrahimi, Denis Lalanne, and Björn Schuller. Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recognition Letters, 66:22–30, 2015.
  • [10] Rahul Gupta, Kartik Audhkhasi, Zach Jacokes, Agata Rozga, and Shrikanth Narayanan. Modeling multiple time series annotations based on ground truth inference and distortion. IEEE Transactions on Affective Computing, 2016.
  • [11] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
  • [12] Mihalis A Nicolaou, Stefanos Zafeiriou, and Maja Pantic. Correlated-spaces regression for learning continuous emotion dimensions. In Proceedings of the 21st ACM international conference on Multimedia, pages 773–776. ACM, 2013.
  • [13] Feng Zhou and Fernando De la Torre. Canonical time warping for alignment of human behavior. In Advances in neural information processing systems, pages 2286–2294, 2009.
  • [14] Feng Zhou and Fernando De la Torre. Generalized time warping for multi-modal alignment of human motion. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1282–1289. IEEE, 2012.
  • [15] George Trigeorgis, Mihalis A Nicolaou, Bjorn W Schuller, and Stefanos Zafeiriou. Deep canonical time warping for simultaneous alignment and representation learning of sequences. IEEE Transactions on Pattern Analysis & Machine Intelligence, (5):1128–1138, 2018.
  • [16] Brandon M Booth, Karel Mundnich, and Shrikanth Narayanan. Fusing annotations with majority vote triplet embeddings. In Proceedings of the 2018 Audio/Visual Emotion Challenge and Workshop, pages 83–89. ACM, 2018.
  • [17] Neil Stewart, Gordon D. A. Brown, and Nick Chater. Absolute Identification by Relative Judgement. Psychological Review, 112(4):881–911, 2005.
  • [18] Georgios Yannakakis and John Hallam. Ranking vs. preference: a comparative study of self-reporting. Affective Computing and Intelligent Interaction, pages 437–446, 2011.
  • [19] Georgios N Yannakakis and Héctor P Martínez. Ratings are overrated! Frontiers in ICT, 2:13, 2015.
  • [20] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189–223, 2014.
  • [21] Lalit Jain, Kevin G Jamieson, and Rob Nowak. Finite sample prediction and recovery bounds for ordinal embedding. In Advances in Neural Information Processing Systems, pages 2711–2719, 2016.
  • [22] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In Machine Learning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1–6. IEEE, 2012.
  • [23] J. Bezanson, A. Edelman, S. Karpinski, and V. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017.
  • [24] Kevin G Jamieson, Lalit Jain, Chris Fernandez, Nicholas J Glattard, and Rob Nowak.

    NEXT: A system for real-world development, evaluation, and application of active learning.

    In Advances in Neural Information Processing Systems, pages 2656–2664, 2015.
  • [25] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David Kriegman, and Serge Belongie. Generalized non-metric multidimensional scaling. In Artificial Intelligence and Statistics, pages 11–18, 2007.
  • [26] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pages 673–680, USA, 2011. Omnipress.
  • [27] Yoshikazu Terada and Ulrike von Luxburg. Local ordinal embedding. In International Conference on Machine Learning, pages 847–855, 2014.