The Ambiguous World of Emotion Representation

09/01/2019 ∙ by Vidhyasaharan Sethu, et al. ∙ University of Michigan University of Southern California UNSW Universität Augsburg The University of Texas at Dallas 6

Artificial intelligence and machine learning systems have demonstrated huge improvements and human-level parity in a range of activities, including speech recognition, face recognition and speaker verification. However, these diverse tasks share a key commonality that is not true in affective computing: the ground truth information that is inferred can be unambiguously represented. This observation provides some hints as to why affective computing, despite having attracted the attention of researchers for years, may not still be considered a mature field of research. A key reason for this is the lack of a common mathematical framework to describe all the relevant elements of emotion representations. This paper proposes the AMBiguous Emotion Representation (AMBER) framework to address this deficiency. AMBER is a unified framework that explicitly describes categorical, numerical and ordinal representations of emotions, including time varying representations. In addition to explaining the core elements of AMBER, the paper also discusses how some of the commonly employed emotion representation schemes can be viewed through the AMBER framework, and concludes with a discussion of how the proposed framework can be used to reason about current and future affective computing systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 9

page 11

page 13

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Emotions are an essential part of natural human-computer interaction (e.g. [9, 22, 72]). There is considerable potential to augment technologies by leveraging information about users’ emotions [55], for a range of applications from improving human communication [38] to health care [56]. However, current emotion recognition systems use models of emotions that are restricted, compared with those that we humans use, which fundamentally limits their utility. For example, consider the following scene: a user has had a rough day and upon arriving at home sarcastically interacts with a virtual assistant. The virtual assistant, however, is not equipped for the full range of human sarcasm, and it fails to react “appropriately.” By contrast, a human listener, whose mental model of emotions is much richer, would be capable of recognising sarcasm or, if the expression truly is ambiguous, would respond by eliciting more information to help resolve the ambiguity. The essential challenge is that emotions are complicated and current methods of mathematically describing emotions often do not take the full complexity of the signal into account. This manuscript presents and motivates the AMBiguous Emotion Representation (AMBER) framework, which is explicitly designed to describe and reason about emotion representations, including the ambiguity that is a natural part of emotional displays.

Current affective computing systems generally represent emotions with categorical labels (e.g., angry, happy) or as values on numerical scales (e.g., arousal/activation - calm vs. excited, valence - negative vs. positive) [28, 68]. There has also been interest in ordinal representations that acknowledge the inherently ordinal characteristic of emotion perception [44, 75, 76]

(e.g., medium valence on a scale of {low, medium, high}). However, all three types of labels are single-valued point estimates that by their very nature cannot quantify the ambiguity present in emotions, both expressed and perceived.

There are several examples of emotion representations that acknowledge this problem. For example, Vidrascu and Devillers proposed the notion of blended emotions, represented using both a major and a minor categorical label [71], and a similar two-category approach without the major/minor distinction has also been proposed [70]. The emotion profile representation is a more general approach that describes emotion in terms of the numerical intensity of a finite number of emotion primitives [27, 49], and other authors have similarly noted that discrete emotions with intensities should be considered [1]. The entire notion of annotator agreement, which is pervasive in descriptions of affective datasets, implicitly acknowledges the ambiguity in emotion perception among any group of annotators. This ambiguity is a central feature in emotion recognition, from problem formulation to system evaluation.

Research in emotion has, thus far, naturally incorporated the assumptions from these domains: variability among annotators is undesirable and should be treated as noise [11], averaged to produce a point estimate [69], or removed to arrive at a dominant emotion label [50]. However, concepts of annotation noise, averaged annotations or dominant emotions are at best reductionist assumptions that may be very unrealistic in many practical instances. Emotion representations need to be capable of reflecting the diversity of human annotation, due to the inherently subjective nature of affective experiences, both while expressing and perceiving emotions.

There are relatively few examples of emotion representations in affective computing to date that can represent this diversity. Existing efforts include soft labels where a numerical intensity is associated with each emotion category [79, 23, 37, 49, 39], inclusion of emotion distributions [34, 73, 78], and the use of confidence [66]. Despite this, a single-valued average emotional label has almost always been adopted as the ground truth representation, and modelling the ambiguity as expressed by inter-observer agreement levels within automatic affect analysers remains a noted challenge in the field [29]. The essential challenge is that current emotion representations, and indeed machine learning approaches employed in affective computing in general, lack the capability to deal adequately with subjective quantities such as emotions, where the information to be modelled and inferred may be inherently ambiguous. It should be noted at this stage that we restrict the use of the term ambiguity (in this paper) to refer to the challenge arising from subtlety in expression and differences in perception. We also make the distinction between ambiguity and uncertainty, a term we use to indicate impreciseness of machine learning models trained on a finite dataset.

This paper explores explicit and implicit assumptions that are inherent to common emotion representation schemes (Section 2) and the ability of such schemes to represent ambiguity and uncertainty (Section 3). It also identifies desirable characteristics of emotion representation schemes, and proposes a new general framework for generating and comparing representations, termed AMBER, which is explicitly designed to accommodate ambiguity and uncertainty in all contexts (Section 4). The paper also shows examples of how AMBER can unify existing emotion representations, reason about them, and expand them to incorporate ambiguity and uncertainty into affective technologies.

2 Current Uses of Affective Computing

Emotion representations are widely used in a range of Artificial Intelligence (AI) based technologies. Example technologies include: affective-aware Human-Computer Interaction (HCI) and Human-Machine Interface (HMI) frameworks; supportive technologies in remote education and healthcare systems, and intelligent business and customer service systems [31, 32]. These technologies can broken down into two core applications: (1) emotion perception and (2) emotion synthesis and conversion. The first requires an emotion representation to interpret human emotions. The second requires an emotion representation for the generation of emotion. In the remainder of this section we provide an overview for existing technologies that use affective computing.

Healthcare is a growing area of affective computing. Affective technologies can aid in diagnosis (e.g., providing objective markers for mental health conditions such as depression [17, 53]) and symptom severity tracking (e.g., tracking mania and depression symptoms for populations with bipolar disorder [35, 45]). Affective technologies can also provide the emotional reasoning that enables embedded agents – virtual or robotic – to provide support services or in cybertherapy settings [5, 48, 59].

Humans have a tendency to anthropomorphize computers and related technologies, displaying social attitudes and behaviours towards them. Agents with greater empathic accuracy are, therefore, considered to be more effective at comforting users, even if this improved accuracy is at the cost of restricting user input behaviours [6]. However, in order to achieve this improved empathic response without restricting user inputs, affective reasoning systems must have access to richer models of emotion that take ambiguity into account. Consider affect-aware learning technologies, technologies that recognise and react to a user’s current emotional state to create emotionally supportive learning environments [10, 21]. The creation of such environments presupposes the ability to operate in naturalistic settings, and adapt learning tasks using fail-soft paradigms, which ensure the learning process is not impaired by incorrectly recognising the user’s emotional state [21]. Fundamentally, they require the ability to characterise the emotional richness, the emotional ambiguity, of the learning interaction.

Internet-of-Things (IoT) applications must also operate in naturalistic and ambiguous environments. These applications track discrete emotion representations often in a distinct, binary or triadic classification paradigms, such as positive affective state recognition, happiness recognition, and boredom recognition [58]. IoT applications generally require computation on mobile platforms, which limits the computational complexity of the algorithms. However, with increases in the computational power of mobile devices, the feasibility of complex emotion recognition has been greatly enhanced.

Affective Gaming (AG) uses affective measurements to enhance the gaming experience [12, 77]

. Video games often aim to elicit emotions through gameplay characteristics such as story-lines, characters and their (perceived) personalities, and video and music effects. AG takes this one step further, by aiming to detect the emotional reactions of the player, and use these reactions to adjust gameplay. This task is challenging as gaming generally does not occur in controlled laboratory environments. Yet, the constraints that are present in gaming offer an opportunity to collect rich affective data that include self-annotation, which can enable semi-supervised and active learning paradigms 

[12]. In this domain, it is less important to know the exact emotional state than in other domains such as affective tutoring,. What matters is emotional changes. Furthermore, players are more willing to be in negative emotional states – frustration, fear, anxiety – if it enhances their overall gaming experience [77].

However, despite the prolific use of affective computing in existing technology, most affective technologies are not designed to handle ambiguity. Consequently, they cannot reason about, and react appropriately to, affect that is ambiguous. This is critically important. Affective computing technologies will not be adopted if they cannot correctly reason over the full span of human emotion expression, or if they seek to make concrete judgements about underlying affect when absolute judgement is neither warranted nor possible.

Figure 1:

An illustration of the Brunswik functional lens model depicting the expression and perception of emotions and the distinction between ambiguity and uncertainty, both of which lead to variance in the output of automatic emotion recognition systems.

3 Description and Limitations of Existing Emotion Representations

Affective representations must be grounded in psychological theories of emotion expression and perception in order to circumvent the limitations discussed thus far. This section provides an overview of the common psychological theories of emotion (Section 3.1) and standard methods of implementing these theories (Section 3.2).

3.1 Theories of Emotion

There is no universally accepted theory of emotions. Several different theoretical accounts of emotions exist, although often fragmented and focusing on distinct aspects (e.g., biological, psychological, philosophical).

The categorical label representation for emotional state, popular in emotion recognition, has some foundation in theory going back to the Darwinian perspective of universality of basic emotions (such as the six – happiness, sadness, anger, fear, surprise, and disgust – identified by Ekman as realised in facial expressions). Various emotion theory researchers have proposed lists of emotion categories (i.e., linguistic labels), typically with under 20, offering plausible representations for computing; a summary of the leading proposals can be found in Cowie and Cornelius [13]. Circumplex models offer a way of capturing conceptual similarity between emotion categories, represented in a circular visual representation with proximally situated categories being similar to one another [57, 60].

The dimensional representation of emotional states, which also has a rich history in the psychology literature, is especially suitable for continuous graded representations, especially for human ratings. An early account by Schlosberg [64, 65] considered graded representations of judgments of facial expressions and led to the proposal of a two-dimensional axis (pleasantness and attention). The later proposal of Mehrabian and Russell [47] offers a three-dimensional representation – pleasure, arousal and dominance. These dimensions have been based on factor analysis of experimental data. Other alternative proposals include those from Scherer [62], along the two dimensions of positive/negative evaluation and activity, and the positive affect and negative affect dimensions of Watson and Tellegen [74].

The classic review by Cowie et al. [15], especially geared toward computational approaches, summarises a two-dimensional, circular space. One of its axes corresponds to the evaluation (from negative to positive) dimension, interpreted from within cognitive emotion theory as a simplified description of appraisal [62] with valence of stimuli considered fundamental to all appraisals [52]. The other axis is the activation (from passive to active) dimension, with theoretical inspiration drawn from the action tendency theory [24]. It should be noted that the names given to these dimensions are somewhat varied across the different theoretical accounts, with terms activation, arousal, and activity used interchangeably; and similarly valence, pleasure, and evaluation; and dominance and control dimension. In this paper we will use the terms arousal and valence. Such multi-dimensional descriptions have been shown to benefit from a greater level of generality across a range of studies, allowing for describing the intensity of emotions including changes over time. These properties are necessary for an analysis of variability in emotion expression for both recognition [27] and synthesis [67].

Another important aspect lies in relating these (internal latent) emotional state representations to observable signals in human communication and interaction. An adapted version of the Brunswick functional lens model [63] offers a conceptual framework for representing the expression and perception in interpersonal interaction. Within this framework, the behaviour (e.g., vocal, visual) of a target speaker forms the basis of perceivers’ judgements about the target’s true latent state (e.g., emotion). This model provides a theoretical framework that describes the expression, communication, and perception process. In this model, there are at least two participants: the speaker (e.g., left-most in Figure 1) and the perceiver(s) (e.g., the right-most in Figure 1). The speaker produces and encodes an emotional message as distal indicators. The distal indicators may be either unimodal (e.g., the voice) or multimodal (e.g., the face, voice, body gestures, etc.) and are then transmitted to a perceiver. The transmitted information is referred to as proximal percepts. The proximal percepts are decoded by the perceiver and the perceiver makes a perceptual judgment.

In summary, models of human annotation of perceived emotions can target emotion state representations, and access them through their observable expressions guided by a rich history of established emotion theories.

3.2 Computational Representations of Emotion

From the perspective of automatic emotion recognition, the focus has been largely on seeking representations of emotions (especially in defining constructs that are the targets for computation) that can be derived from observations of emotional expressions. These have been guided by theory but often motivated by the ultimate practical application of the recognition technology (e.g., in detecting frustration), especially given there are no broadly agreed upon set of representations, and in how they are behaviourally realised across social-cultural contexts. One of the key challenges in defining a theoretically-grounded construct for computation is the wide range of variation in emotion episodes, from fully blown emotions to varying shades of emotional colouring. The term “emotional state” (and emotion-related state) proposed by Cowie and Cornelius [13] is especially apt for modelling emotions. The question then relates to what are plausible emotional representations.

Emotion representations are inherently tied to the time scale that they describe. Common time-scales include conversation-level, speaker turn-level, and sentence-/utterance-level. Studies have used shorter lexical units to increase the temporal resolution (e.g., FAU AIBO corpus was annotated at the word level [4]). Dimensional labels are annotated either over discrete segments or in a time-continuous manner. Categorical labels are often annotated over discrete segments of speech. One exception is the SEMAINE database, which used time-continuous traces for categories such as anger, disgust, happiness and amusement [46]. However, these categories were not consistently annotated for all recordings.

In the remainder of this section, we discuss the collection paradigms for the common emotion representations and their underlying challenges.

3.2.1 Categorical Labels

Categorical labels are commonly collected using multiple choice surveys. This provides a discrete set of labels.

One of the challenges associated with categorical labelling strategies is the process of obtaining the labels. Surveys are deployed with lists of emotion classes. Yet, the inclusion of certain classes in the survey biases the responses, forcing evaluators to select a specific class [61]. Consequently, the class other may be included, providing evaluators the opportunity to list other categories.

A second challenge with categorical labelling is the large number of overlapping emotional states. For example, Cowie and Cornelius [13] listed 38 emotional states. But, these terms are not mutually exclusive [39] (e.g., happiness and excitement). Studies have addressed this challenge by proposing primary and secondary emotions. The primary emotion corresponds to the prominent emotional class, and the secondary emotion(s) include all the other affective states perceived in the stimulus [8, 41, 20, 13]. Lotfian and Busso [40] demonstrated that leveraging secondary emotions can lead to improvements in classification performance in emotion recognition evaluations.

3.2.2 Numerical Labels

Valence and arousal are the most popular dimensions used in affective computing. They are commonly collected using Likert-type scales. A common survey includes self-assessment manikins (SAMs), non-verbal pictorial representations describing different levels of the emotional attributes.

There is an inherent relationship between categorical and numerical descriptions of emotion. We visualise this relationship using sentences from the MSP-Podcast corpus [41], labelled with categorical emotions in the arousal-valence space (the corpus was annotated with numerical and categorical labels). The figure shows that sentences labelled with the same emotional category often span broad areas in the arousal-valence space. Using emotional attributes provides a powerful representation to quantify the within-class variability of categorical emotions (e.g., different shapes of happiness).

One of the challenges associated with numerical labelling is most salient given time-continuous annotations. In this case, evaluators perceive the emotional content in real-time and respond by moving the cursor of the graphical user interface (GUI) to the perceived value. Examples of toolkits for time-continuous evaluators are Feeltrace [14], Gtrace [16], and CARMA [25]. The advantage of this approach is the higher temporal resolution of the emotional annotations, which provides local information to identify emotional segments. A disadvantage of this approach is the reaction lag between the emotional content and the annotations provided by the evaluators, which can be higher than three seconds [51]. Fortunately, there are algorithmic solutions to compensate for this reaction lag using both static delays [42, 43] and delays that vary as a function of the acoustic space [36].

A second challenge is that valence and activation alone are not enough to fully characterise all emotional behaviours. For example, different emotional categories such as anger and fear present similar values for arousal and valence (i.e., low valence, high arousal). Figure 2 also illustrates this observation. However, many affective computing datasets contain very small amounts of these overlapped emotions (e.g., fear), which has lessened the computational focus on this problem.

Figure 2: Categorical labels of sentences from the MSP-Podcast corpus on the arousal-valence space. The categorical labels are placed at the centroids of their sentences. The figure illustrates the relationship between categorical and numerical labels.

3.2.3 Ordinal Labels

Ordinal labels have a well-defined order but no notion of distance between the labels. They result from direct comparisons between two or more stimuli. This approach can be implemented with either attributes (e.g., is video one more positive than video two?), or with categorical classes (e.g., is video one happier than video two?). Several studies have argued that comparing two or more stimuli is more reliable than separately assigning absolute scores or labels to a given stimulus. Yannakakis et al. [76, 75] presented arguments supporting the ordinal nature of emotions, providing clear evidence across domain within affective computing. Their conclusion was that ordinal labels increase the validity and reliability of affective labels, leading to better models.

Ordinal labels are often derived from categorical or numerical labels. However, they can also be directly obtained with perceptual evaluations. AffectRank [75] is an example of a toolkit to directly annotate ordinal labels. Ordinal labels can be directly used to train preference learning models such as RankNet [7] and RankSVM [33].

3.3 Limitations

As outlined in this section, a number of schemes for representing emotions have been proposed, each with a different set of pros and cons. The lack of consensus on a single scheme and the sheer number of various schemes that have been employed reflect the lack of a settled theory of emotions, the complexity of emotions, and the emphasis of differing properties of emotions in different studies. A major stumbling block to efforts to converge to fewer but more powerful representations is a lack of a common framework within which current schemes can all be compared and their implicit assumptions in each of them examined and compared.

Additionally, as mentioned in section 1, most current work in affective computing employs emotion representation schemes that circumvent any ambiguity in perceived emotions. Even the few studies of ambiguity-aware emotion representations suffer from the lack of well motivated methods for measuring accuracy and comparing with each other. This is once again due to the lack of a common description language.

4 Proposed AMBER Framework

As outlined in section 3, a number of approaches have been employed to represent emotions. Broadly they are all descriptions of emotions defined on one of three possible spaces: (a) space of categorical emotion labels; (b) space of numerical affect attributes; or (c) space of ordinal affect labels. In this section we develop the AMBiguous Emotion Representation (AMBER) framework, a unified mathematical description within which the myriad of emotion representations defined, in any of these three spaces of emotion labels, can be described. The AMBER framework can be thought of as broadly comprising of two components: (a) a common representation scheme based on a set of suitable descriptors, which we refer to as attribute descriptors, that can describe all three possible spaces of emotion labels (categorical, numerical and ordinal); and (b) a function based representation of emotion defined on the attribute descriptors to encode ambiguity.

4.1 Representation based on Attribute Descriptors

The benefits of a single framework that can mathematically describe all three kinds of emotions labels include the ability to: (1) reason about known emotion representation schemes; (2) compare them; and (3) identify implicit assumptions made when employing them. We begin with the observation that all three types of emotion labels can be described using a finite number of possible labels (e.g., one of big six

categorical labels) or a finite dimensional vector (e.g., a 2-dimensional vector comprising of an

arousal and a valence scores). Based on this, we build a common representation framework on the notion of attribute descriptors, a finite number of which will be used to describe any space of emotion labels.

Mathematically, each attribute descriptor is simply a set on which we impose additional mathematical constraints to reflect the properties of the emotion label space of interest. Specifically, if we require attribute descriptors to describe an emotion label space, then each attribute descriptor, , is defined as an ordered set of elements such that,

(1)

where denotes that is the attribute descriptor; is the lowest element in the set (precedes all other elements); and is highest element in the set (is preceded by all other elements).

To illustrate how emotion attribute descriptors represent familiar quantities in common emotion representation schemes, consider two examples. First, we consider an example where emotional state is represented as a point on an arousal-valence plane. The first affect attribute, , could denote valence and the second one, , could denote arousal. If we assume that both valence and arousal vary between -1 and 1, then the lowest element in the set () would typically be and the highest element in them () would be . All other elements of the set are the real numbers between and . i.e., .

Second, we consider an example where emotional state is represented as one of a set of emotional categorical labels, such as the big six. Here, we could employ six affect attributes (i.e., ) and each one, , would denote one of the six possible categorical labels, happy, surprised, afraid, sad, angry and disgusted. Each attribute descriptor would then be a two element set, such that and denotes the absence and presence of a corresponding categorical label. A typical categorical label would then be a sparse set of attribute descriptors with only one of them indicating the presence of a categorical label (non-binary variations corresponding to blended emotion labels and mixtures are introduced in section 4.2 and examples are discussed in section 5.1).

In general, given a set of affect attributes, the emotion at time is represented by a set of elements, , as follows:

(2)

where each .

AMBER permits an explicit statement of the mathematical structure underlying emotion representations. This is important, because these statements, and the associated assumptions, are more often implicit. Consider a common treatment of the arousal-valence space (Figure 2): clustering via Euclidean distance. This treatment makes two assumptions: (1) that the appropriate metric is Euclidean and (2) that the two axes are orthogonal. These assumptions are not inherently incorrect, but they must be stated. The AMBER framework provides a language to do so. For example, the affect attributes and , denoting valence and arousal, are both mapped to on the real number line. Second, they are orthogonal to each other and the standard Euclidean distance metric exists in the space .

4.2 Ambiguity Awareness in the Proposed Framework

The representation given by equation (2) can describe emotion representation schemes employing any of the three types of emotion label spaces. However, it is restricted to single-valued representations only. This restriction can be overcome by removing the constraint that the representation should be given by only one element chosen from each attribute descriptor set, . Specifically, we can extend it by allowing the representation to be a set of functions, one per attribute descriptor that denotes a combination of multiple elements of that attribute descriptor set.

Let us consider the emotion profile representation [49] as an example to illustrate the use of a suitable ambiguity aware representation. This scheme represents the emotion in terms of a finite set of categorical labels (Angry, Happy, Neutral and Sad

) but instead of picking one of them, the emotion is represented as the set of probabilities of each one being the right label. When this scheme is described using the proposed framework, the emotion label is a set of four probabilistic measures (on for each attribute descriptor) that indicates the probability of presence of that emotion category,

. Where, each attribute descriptor, , corresponds to one of the four categorical labels. Each of these sets comprises of two elements denoting the presence or absence of that emotion category.

Mathematically, this proposed extension, which allows ambiguity in the representations to be quantified, defines the representation of the emotional state, , as

(3)

where denotes a function that associates each element of with a positive real number at time . We can compare two labels, and at time using this function. If , for , we can say that it is more likely that represents the emotion at time compared to . This function is herein referred to as the ambiguity function. Examples of emotion representation schemes that take into account ambiguity in labels and how they may be viewed in terms of the framework introduced here are discussed in section 5.

This extension in (3) generalises equation (2) by relaxing the assumption that a single label per attribute descriptor adequately represents perceived emotion by assigning a level of certainty to every possible label. Specifically, the emotion representation given by equation (2) can be seen as equivalent to a special case of that given by equation (3) where,

(4)

i.e., emotion representations of the form indicated by equation (2) make the assumption that there is no ambiguity about the perceived emotional state, which is clearly known to be incorrect as discussed in section 1. An illustration of this distinction in the context of categorical labels is shown in Figure 3.

5 Reinterpreting Common Representations through an AMBER lens

As mentioned in section 3, a number of methods have been employed in the literature to represent emotions. This section will illustrate how all of these methods can be reinterpreted in terms of the proposed AMBER framework. This in turn will allow the explicit as well as implicit assumptions behind these methods to be viewed and compared within a common mathematical framework and, consequently, allow us to assess their suitability to various contexts.

5.1 Categorical Labels

We will begin by evaluating the categorical labelling scheme, where a single label is assigned to an interval of interest, through the lens of AMBER. First, each emotion dimension, , denotes a different emotion category (for e.g., denotes Anger, denotes Happy, etc.) with each emotion dimension given by a two element set,

(5)

where and denote the absence and presence of the emotion category.

Second, the emotion labels are not time-varying within the intervals of interest. For example, when an utterance is labelled as Happy it is generally assumed that this emotion category applies at all times within the entire interval of interest, i.e.,

(6)

Finally, in many cases there is an implicit assumptions that the categorical emotion labels are chosen from a finite set of possible emotion dimensions (categories) and these are all mutually exclusive (see Box 1 for analysis of a representation scheme where this assumption is relaxed), i.e.,

(7)

It is clear when viewed through the AMBER framework that all three assumptions impose well-defined mathematical restrictions on the emotion representation scheme. But, the framework also suggests how these assumptions may be relaxed to obtain more generalised representations. For instance, relaxing the constraint given by equation (7) that the emotion dimensions are mutually exclusive and allowing more than one dimension to take on the value opens up the path for categorical emotion labelling schemes where labels can comprise multiple emotion categories for a single particular time interval.

Box 1: Blended Emotions

The notion of ‘blended emotions’ as described by Vidrascu and Devillers is one such method where emotions are represented using both a ‘major’ and a ‘minor’ categorical label [71]. Here, up to two emotion dimensions may take on the value and some ambiguity is allowed. Specifically, it can be seen that the ‘blended emotions’ representation may be given as

(8)

with three additional constraints. Firstly,

(9)

where the assumption allows for

to be treated as a probability distribution without any loss of generality. Secondly,

(10)

where denotes the Iverson bracket. i.e., takes the value if is true and the value if is false. This condition constrains the representation to only two labels at most. Finally, takes on one of two values, i.e., with and

(11)

This conditions ensures that one of the two labels denotes the major emotion and the other the minor emotion.

The emotion profile representation [49] is an more general categorical emotion representation which makes fewer implicit assumptions, as can be seen when viewed within the AMBER framework. Specifically, the emotion profile representation drops the constraint given by equation (10). i.e., it represents emotions as a set of ambiguity measures, each corresponding to a different emotion categorical label.

The most general emotion representation that can be defined within a categorical label space within the AMBER framework is depicted in Figure 3. This is essentially a set of measures of ambiguity of a finite number of emotion categories that varies with time.

Figure 3: A depiction of the differences between ambiguous and non-ambiguous categorical emotion representations within the proposed AMBER framework: (A) shows an ambiguity aware label at a specific time () given by the set of probabilities of the presence of each of the four possible emotion category labels; (B) shows that assuming a single categorical label is akin to complete certaintly

5.2 Numerical Attribute Descriptors

Numerical attribute descriptors are probably the most widely used emotion representation schemes today in the field of affective computing. They can be described within the proposed framework by introducing a notion of distance, , between pairs of elements within each attribute descriptor set, , each of which represents one of the numerical attributes of interest. Specifically, the bounded ordered set is mapped to a closed interval on the real number line via a suitable function . After mapping the emotion into a number, we can adopt the natural metric on real numbers as the distance between emotions (), i.e.,

(12)

This converts the ordered set (denoting arousal, valence, dominance, etc.) to a numerical scale and commonly the interval is chosen as . Disagreements concerning the representation of emotional state as perceived by different annotators can arise from differences in the perceived emotional state (and/or from differences in their mapping of emotional state to numerical attribute score - refer Box 2). This disagreement in turn may reflect important information such as the level of difficulty in perceiving the emotional state, and condensing them to a single-valued emotion representation (i.e., when equation (4) of the AMBER framework holds) does not capture this information.

Box 2: Annotator Transcription Normalisation

Some of the implications of adopting a numerical attribute emotion representation scheme are now much more explicit when viewed within the AMBER framework. Namely, this scheme typically assumes that different annotators transcribe the perceived emotion attributes identically, i.e., is identical for all annotators. If this assumption does not hold (and there is no obvious reason why it should), then it might be prudent to normalise for differences between annotators in terms of how they transcribe. Hypothetically, such a normalisation may be viewed as an attempt to learn annotator specific maps, , corresponding to the annotator, and modify them to obtain normalised labels, . The authors are not aware of any work to date that explicitly attempts this normalisation. However, an interesting approach to tackle the differences between annotators was reported by Grimm and Kroschel, whereby confidence in each annotator was estimated and the confidence measures were employed as weights when estimating a single-valued label as a weighted sum of the individual labels [26].

The time-varying numerical attribute labels also employ the distance measure given by equation (12) but the emotion representation additionally retains their dependence on time. Figure 4 shows an example of ambiguity aware time-varying arousal labels, wherein the ambiguity function at each time captures the ambiguity inherent in the set of arousal traces obtained multiple annotators. Furthermore, the abiguity function can be any distribution, including multimodal ones. This figure corresponds to the emotion labelling scheme proposed in [19].

In addition to the implications arising from adopting a distance metric within each attribute descriptor set, a few additional implications arise from the dependence on time. Namely, the distance metrics are almost universally assumed to be time-invariant or equivalently does not change with time. However experience tells us that annotators may not give identical numerical attribute scores to identical stimuli on two separate occasions. It is also common to not explicitly take into consideration constraints on the temporal dynamics of the labels, i.e., conditions on the relationship between and for any . Clearly, constraints can be explicitly considered, since emotional state will not change instantaneously, but the specific forms these constraints should take is still an open research question.

Figure 4: (A) A time-varying numerical emotion representations within the proposed AMBER framework based on numerical annotations obtained from multiple annotators (individual traces on the Arousal-Time plane) is given by a time varying distribution (ambiguity function depicted in 3D as a function of time, each slice is a distribution at a particular point in time); (B) an ambiguity unaware representation, such as using the mean annotation across all raters, ignores time-varying disagreement between annotators and the corresponding ambiguity functions are dirac deltas reflecting this.

Box 3: Numerical Attribute Label Distributions

In recent years there has been growing interest in accounting for the ambiguity in emotion labels reflected in the disagreement between annotations. While a number of approaches to handle this have been proposed, they all essentially reduce to the emotion representation given by equation (3) with some additional constraints. For instance, the scheme employed by Han et al. [30] assumes that the ambiguity function,

, is a Gaussian function and is represented by its mean and standard deviation. A less restrictive assumption is made by Dang et al.

[19]

, who allow the ambiguity function to be any distribution that can be represented as a Gaussian mixture model. Both approaches were extended to include a model of the temporal dynamics of the ambiguity function, making use of a Long Short-Term Memory Recurrent Neural Network

[30]

and a Kalman filter

[18] respectively. Finally, Atcheson et al. [3, 2] strongly suggest that time-varying annotations should be treated as stochastic processes.

5.3 Ordinal Labels

The defining feature underlying all ordinal labelling schemes is that there is a notion of order between any pair of labels but there is no notion of distance between them. That is, for any pair of elements within an emotion dimension, , one can say whether or whether , but cannot be defined. Ordinal labelling schemes can be broadly categorised into one of two categories. The first one typically involves labels drawn from a small finite set of ordered elements, such as a Likert-type scale or self-assessment manikins. The second category typically involves labels that reflect a comparison between two or more intervals of interest. This essentially gives rise to a ranking scheme. We refer to the first category as absolute ordinal labels and the second as relative ordinal labels in this paper.

The absolute ordinal labels can be viewed as probably the most general case within the AMBER framework and are given by equations (1) and (2), with typically the only condition imposed on them being that is a small finite strictly ordered set. The relative ordinal labels on the other hand can be viewed as imposing the condition that labels at distinct points in time reflect their rank relative to all other points in time. i.e.,

(13)

where, denotes an interval of interest, denotes the total number of intervals of interest, and denotes the relative rank of the label within emotion dimension associated with an interval relative to the labels associated with all other intervals. This condition reflects the observation that if every pairwise ordering relationship of emotion labels at different points in time is known then the set of labels at all points in time can be ranked in order. The labels as a function of time can then reflect this rank order (as indicated by equation (13)).

Box 4: Ordinal Labels for Preference Learning

Recently, Parthasarathy et al. [54] demonstrated the benefits of inferring relative ordinal labels from time-continuous numerical annotations as targets for a machine learning system. Specifically, they build on the qualitative agreement method where the average numerical label values over intervals of interest are compared pairwise amongst themselves to determine pairwise order relationships. For each trace, the approach generates an individual matrix (IM) with the trends in the trace (Figure 5A). Then, the consensus between multiple annotators is ascertained to only retain consistent trends observed across annotators (Figure 5B). The relative labels obtained with this approach lead to a ranked list of all intervals of interest with agreement as given by equation (13).

Figure 5: Deriving relative labels from time-continuous traces using the qualitative agreement analysis [54]. (A) The individual matrices are created by relative comparison between segments of the trace. (B) The consensus matrix is formed by combining individual matrices. A trend is set when the differences are greater than given thresholds. We cross out the entries without agreement

As previously mentioned in section 3.2.3, it has been suggested that ordinal labels exhibit greater validity and reliability compared with numerical and categorical ones. This has been attributed to the observation that people can more reliably compare two stimuli than assign an absolute score to a single stimulus. However, this does not mean ambiguity can be completely eliminated by simply adopting an ordinal scale. For instance, as illustrated in [54], when combining individual pair-wise comparison matrices to obtain a consensus matrix, agreement cannot be observed for all entries. A truly ambiguity-aware ordinal representation scheme would also quantify the disagreement at these entries. However, thus far no such scheme has been proposed. Nevertheless, the AMBER framework can still be brought to bear on the absolute and relative ordinal schemes suggesting that by defining suitable ambiguity functions, ambiguity aware ordinal representations may be defined. An illustration of such schemes are shown in Figure 6. In the case of absolute ordinal labels, ambiguity functions can be defined over attribute descriptor sets in a straightforward manner while keeping in mind that no distance metric can exist in this set. In the case of finite ordered sets, this may simply be a distribution function over the set of elements (as depicted in Figure 6A). Defining ambiguity functions over relative ordinal labels is not as straightforward but, as discussed in Box 4 (and shown in Figure 5), the labels are obtained from pairwise comparisons across time frames and an ambiguity function may be defined over the possible outcomes of each comparison (see Figure 6B), reflecting the relative counts of each outcome accumulated across all the annotators (for example, when comparing the arousal at frame 2 with what at frame 5, if 4 out of 6 annotators state it was higher and the other 2 state it was lower, the ambiguity function could take the value corresponding to higher and corresponding to lower).

Figure 6: (A) Absolute ordinal representation may be made ambiguity aware by defining a distribution over the finite ordered set of possible labels corresponding to each affect attribute. In this figure, the valence attribute, denoted by

is depicted; (B) Relative ordinal labels obtained from pairwise comparisons across time frames may be expanded to be ambiguity aware by defining the ambiguity function as a Bernoulli distribution over the two possible outcomes of each pairwise comparison.

6 Discussion - What purpose can AMBER serve?

The choice of how emotion is represented plays a key role in affective computing systems. It underpins almost every stage of the design and implementation, ranging from data collection, to the choice of machine learning model, and how the system will be used. In other words, the myriad of choices that needs to be made when developing affective computing systems are all in one way or another affected by implicit and explicit properties of the chosen emotion representation scheme. Providing a suitably precise and flexible mathematical language for describing and reasoning about emotion representation methods is the primary goal of the proposed AMBER framework. AMBER draws on commonalities across the various possible emotion representation schemes and encode them within 3 elements of the framework: (a) The relationship between the different sets of attribute descriptor; (b) the structure of each of the attribute descriptor sets; and (c) the properties of the ambiguity function. By enumerating the mathematical properties of these three elements, one can explicitly identify all the assumptions underlying an emotion representation scheme. This in turn can help better inform data collection paradigms, choice of machine learning models and algorithms, quantitative analyses of emotions and the interpretation of outcomes of affective computing systems. The following examples of hypothetical questions that may arise when dealing with affective computing systems are provided to illustrate the various ways in which the AMBER framework might be useful.

I am training an continuous time emotion prediction system and my training data has labels from 6 annotators, what loss function should I minimise?

This is really a two part question. Presumably the emotion representation scheme (categorical, dimensional, or ordinal) has been already determined, which implies the attribute descriptor sets are known and their mathematical structure is fixed (e.g., if the representation scheme is time continuous values of arousal and valence then there are two attribute descriptors arousal and valence and both are sets of real numbers that fall in the interval ). What then needs to be determined is a suitable ambiguity function based on the properties of the acquired labels (e.g., how many annotators?, were they independent?, was there any measurement error?). In this case, examples of an ambiguity function that encode the time-varying variability between the 6 annotations include: (a) a Gaussian at each time , with time varying means and standard deviations; (b) a Gaussian mixture model at each time , with time-varying parameters that can also cater for multimodal and non-Gaussian label distributions; (c) a Gaussian process over labels at all times that jointly encode both the distribution of multiple labels and their temporal dynamics (smoothness over time).

Following this, the loss function can be defined as metric that measures the separation between two ambiguity functions, one corresponding to the training labels and one obtained from the prediction system (more than one suitable metric is likely to exist, each with different properties some of which might be more desirable than the other). In this case, if the ambiguity function was chosen as a probability density function, as in the examples mentioned above, Kullback-Liebler divergence might serve as a suitable loss function. Finally, it should be noted that the prediction system may not explicitly predict an ambiguity function but nevertheless an implicit ambiguity function may be inferred from the form of the prediction and the associated loss function. For example, if the system is designed to predict a time-varying mean and variance labels, it might implicitly correspond to the choice of a unimodal symmetric density function (such as a Gaussian) as the ambiguity function.

I am collecting data to train an emotion prediction system, I can only practically obtain annotations using a Likert type scale for arousal and valence but my application demands a continuous arousal and valence predictions in the range . Can I still proceed with my plan? Will something go wrong? How can I reason about this?

They key challenge here is that the emotion representation scheme employed to annotate the training data and employed by the system to make its prediction differ in their mathematical structure (which in turn reflects differences in the emotion theory underpinning them). In the both cases, there are two attribute descriptors (arousal and valence). However in the first case they are both finite sets of ordered elements with no distance metric defined on the elements while in the second case they are both sets of real numbers in the range with the natural distance metric on real numbers defined on them. Consequently, treating the elements of the attribute descriptor sets in the annotation scheme as equidistant points within the attribute descriptor sets in the prediction scheme is not implicitly justified. This does not mean that a mapping from one scheme to another cannot be learned but based on their properties, it is likely that a scheme that exploits the fact that both are ordered sets without explicitly making use of distances within them may be more accurate and will be better justified.

I have some data that has been annotated with both categorical labels as well as numerical labels using arousal, valence and dominance as dimensions. Can I study the relationship between the categorical labels and clusters in the 3-dimensional arousal-valence-dominance space obtain using k-means to learn about the fundamental relationship between categorical and dimensional labelling schemes?

This question essentially translates to one of identifying the map between categorical representation scheme given by a set of binary values attribute descriptors (one for every possible emotion label, with the two elements of each set indicating presence or absence) and the numerical scheme given by a 3 sets of real valued numbers lying in the interval . This in itself is a justifiably posed question that is worth asking. However, the adopted method brings with it a number of implicit assumptions, not all of which can necessarily be justified. For instance, clustering the numerical labels in a 3-dimensional space assumes at the very least a suitable distance metric in that space. Further, typical implementations of the k-means algorithm tend to use a Euclidean distance metric which additionally assumes that the 3 attributes are orthogonal to each other. Finally, the method adopted to learn the relationship between clustered numerical labels and categorical labels might make further assumptions about the attribute descriptor sets (such as the 3-dimensional arousal-valence-dominance space is a vector space and is equipped with notions of scaling and translation). It is worth noting that all of these assumptions may be reasonable ones to make, however, the researcher making them should actively choose to do so and ideally also state them explicitly when reporting studies based on the outcomes of such analyses.

7 Conclusions

In this paper, we introduce the AMBER framework to describe emotion representations, including the ambiguity inherent in them, and thereby serving as a mechanism to reason about them. To the best of the authors knowledge, every emotion representation scheme employed in affective computing till date can be described within the proposed AMBER framework. Furthermore, it allows assumptions implicit in these emotion representation schemes to be clearly articulated and in a manner that allows them to be compared across different schemes. Consequently, it provides a means to reason about how to extend representation schemes in order to imbue them with desirable properties. For instance, if one desired an ordinal emotion representation scheme that incorporated ambiguity and would be employed by other AI systems to reason about human-computer interactions within a Bayesian framework, then in terms of the components of the AMBER framework, one would consider equipping a known ordinal scheme with an ambiguity function that was also a probability distribution. The proposed AMBER framework also provides a means to reason about analyses carried out using emotion labels and annotations. For example, is it reasonable to run a clustering algorithm on emotion labels? Should the joint probability distribution over arousal and valence be modelled? Is taking the mean of multiple annotations a suitable ’ground truth’? Finally, AMBER also provides a mathematical formalism that can aid with comparing emotion representation schemes and analyse methods for converting between them. For example, if a Likert-type scale is used to gather annotated labels from multiple annotators, how would a numerical label and an ordinal label derived from that differ? What would they have in common? Is it suitable to use a clustering based approach is used to convert from a numerical label to a categorical label? These sort of questions have always been hard to answer due to the inherent complexity and ambiguity in affect. The AMBER framework is an attempt to overcome some of this difficulty by providing a mathematical language that explicitly articulates the core elements of all emotion representations.

References

  • [1] AlZoubi, O., D’Mello, S. K., and Calvo, R. A. (2012). Detecting naturalistic expressions of nonbasic affect using physiological signals. IEEE Transactions on Affective Computing, 3(3):298–310.
  • [2] Atcheson, M., Sethu, V., and Epps, J. (2017). Gaussian process regression for continuous emotion recognition with global temporal invariance. In IJCAI 2017 Workshop on Artificial Intelligence in Affective Computing, pages 34–44.
  • [3] Atcheson, M., Sethu, V., and Epps, J. (2018). Demonstrating and modelling systematic time-varying annotator disagreement in continuous emotion annotation. Proc. Interspeech 2018, pages 3668–3672.
  • [4] Batliner, A., Steidl, S., and Nöth, E. (2008). Releasing a thoroughly annotated and processed spontaneous emotional database: the FAU Aibo emotion corpus. In Second International Workshop on Emotion: Corpora for Research on Emotion and Affect, International conference on Language Resources and Evaluation (LREC 2008), pages 28–31, Philadelphia, PA, USA.
  • [5] Bickmore, T. and Gruber, A. (2010). Relational Agents in Clinical Psychiatry. Harvard Review of Psychiatry, 18(2):119–130.
  • [6] Bickmore, T. and Schulman, D. (2007). Practical Approaches to Comforting Users with Relational Agents. In CHI ’07 Extended Abstracts on Human Factors in Computing Systems, pages 2291–2296, San Jose, CA, USA. ACM.
  • [7] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. (2005). Learning to rank using gradient descent. In International conference on Machine learning (ICML 2005), pages 89–96, Bonn, Germany.
  • [8] Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., and Mower Provost, E. (2017). MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 8(1):67–80.
  • [9] Calvo, R. A. and D’Mello, S. (2010). Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications. IEEE Transactions on Affective Computing, 1(1):18–37.
  • [10] Calvo, R. A. and D’Mello, S. (2012). Frontiers of Affect-Aware Learning Technologies. IEEE Intelligent Systems, 27(6):86–89.
  • [11] Chao, L., Tao, J., Yang, M., Li, Y., and Wen, Z. (2015). Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pages 65–72. ACM.
  • [12] Christy, T. and Kuncheva, L. I. (2014). Technological advancements in affective gaming: A historical survey. GSTF Journal on Computing, 3(4):34–41.
  • [13] Cowie, R. and Cornelius, R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40(1-2):5–32.
  • [14] Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., and Schröder, M. (2000). ’FEELTRACE’: An instrument for recording perceived emotion in real time. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, pages 19–24, Newcastle, Northern Ireland, UK. ISCA.
  • [15] Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. G. (2001). Emotion recognition in human-computer interaction. IEEE Signal processing magazine, 18(1):32–80.
  • [16] Cowie, R., Sawey, M., Doherty, C., Jaimovich, J., Fyans, C., and Stapleton, P. (2013). Gtrace: General trace program compatible with EmotionML. In Affective Computing and Intelligent Interaction (ACII 2013), pages 709–710, Geneva, Switzerland.
  • [17] Cummins, N., Scherer, S., Krajewski, J., Schnieder, S., Epps, J., and Quatieri, T. F. (2015). A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71:10–49.
  • [18] Dang, T., Sethu, V., and Ambikairajah, E. (2018). Dynamic multi-rater gaussian mixture regression incorporating temporal dependencies of emotion uncertainty using kalman filters. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4929–4933. IEEE.
  • [19] Dang, T., Sethu, V., Epps, J., and Ambikairajah, E. (2017). An investigation of emotion prediction uncertainty using gaussian mixture regression. In INTERSPEECH, pages 1248–1252.
  • [20] Devillers, L., Vidrascu, L., and Lamel, L. (2005). Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18(4):407–422.
  • [21] D’Mello, S. K. and Graesser, A. C. (2014). Feeling, thinking, and computing with affect-aware learning. In Calvo, R., D’Mello, S., Gratch, J., and Kappas, A., editors, The Oxford Handbook of Affective Computing, chapter 31, pages 419–434. Oxford University Press.
  • [22] Esposito, A., Esposito, A. M., and Vogel, C. (2015). Needs and challenges in human computer interaction for processing social emotional information. Pattern Recognition Letters, 66:41–51.
  • [23] Fayek, H. M., Lech, M., and Cavedon, L. (2016). Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 566–570. IEEE.
  • [24] Frijda, N. H. (1986). The emotions. Cambridge University Press.
  • [25] Girard, J. M. (2014). CARMA: Software for continuous affect rating and media annotation. Journal of Open Research Software, 2(1):1–6.
  • [26] Grimm, M. and Kroschel, K. (2005). Evaluation of natural emotions using self assessment manikins. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., pages 381–385. IEEE.
  • [27] Grimm, M., Kroschel, K., Mower, E., and Narayanan, S. (2007). Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10-11):787–800.
  • [28] Gunes, H. and Schuller, B. (2013). Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing, 31(2):120–136.
  • [29] Gunes, H., Schuller, B., Pantic, M., and Cowie, R. (2011). Emotion representation, analysis and synthesis in continuous space: A survey. In Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, pages 827–834. IEEE.
  • [30] Han, J., Zhang, Z., Schmitt, M., Pantic, M., and Schuller, B. (2017). From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proceedings of the 25th ACM international conference on Multimedia, pages 890–897. ACM.
  • [31] Harley, J. M., Lajoie, S. P., Frasson, C., and Hall, N. C. (2017). Developing Emotion-Aware, Advanced Learning Technologies: A Taxonomy of Approaches and Features. International Journal of Artificial Intelligence in Education, 27(2):268–297.
  • [32] Jaimes, A. and Sebe, N. (2007). Multimodal human-computer interaction: A survey. Computer Vision and Image Understanding, 108(1):116–134.
  • [33] Joachims, T. (2006). Training linear SVMs in linear time. In ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217–226, Philadelphia, USA.
  • [34] Joshi, D., Datta, R., Fedorovskaya, E., Luong, Q.-T., Wang, J. Z., Li, J., and Luo, J. (2011). Aesthetics and emotions in images. IEEE Signal Processing Magazine, 28(5):94–115.
  • [35] Khorram, S., Jaiswal, M., Gideon, J., McInnis, M., and Provost, E.-M. (2018). The priori emotion dataset: Linking mood to emotion detected in-the-wild. Interspeech 2018, pages 1903–1907.
  • [36] Khorram, S., McInnis, M., and Provost, E.-M. (2019). Jointly aligning and predicting continuous emotion annotations. IEEE Transactions on Affective Computing, page To appear.
  • [37] Kim, Y. and Kim, J. (2018). Human-like emotion recognition: Multi-label learning from noisy labeled audio-visual expressive speech. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5104–5108. IEEE.
  • [38] Lee, C.-C., Black, M., Katsamanis, A., Lammert, A. C., Baucom, B. R., Christensen, A., Georgiou, P. G., and Narayanan, S. S. (2010). Quantification of prosodic entrainment in affective spontaneous spoken interactions of married couples. In Eleventh Annual Conference of the International Speech Communication Association.
  • [39] Lotfian, R. and Busso, C. (2017). Formulating emotion perception as a probabilistic model with application to categorical emotion classification. In International Conference on Affective Computing and Intelligent Interaction (ACII 2017), pages 415–420, San Antonio, TX, USA.
  • [40] Lotfian, R. and Busso, C. (2018). Predicting categorical emotions by jointly learning primary and secondary emotions through multitask learning. In Interspeech 2018, pages 951–955, Hyderabad, India.
  • [41] Lotfian, R. and Busso, C. (2019). Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, To appear.
  • [42] Mariooryad, S. and Busso, C. (2013). Analysis and compensation of the reaction lag of evaluators in continuous emotional annotations. In Affective Computing and Intelligent Interaction (ACII 2013), pages 85–90, Geneva, Switzerland.
  • [43] Mariooryad, S. and Busso, C. (2015). Correcting time-continuous emotional labels by modeling the reaction lag of evaluators. IEEE Transactions on Affective Computing, 6(2):97–108. Special Issue Best of ACII.
  • [44] Martinez, H., Yannakakis, G., and Hallam, J. (2014).

    Don’t classify ratings of affect; rank them!

    IEEE transactions on affective computing, (1):1–1.
  • [45] Matton, K., McInnis, M. G., and Mower Provost, E. (2019). Into the wild: Transitioning from recognizing mood in clinical interactions to personal conversations for individuals with bipolar disorder. In Interspeech.
  • [46] McKeown, G., Valstar, M., Cowie, R., Pantic, M., and Schröder, M. (2012). The SEMAINE database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing, 3(1):5–17.
  • [47] Mehrabian, A. and Russell, J. A. (1974). An approach to environmental psychology. the MIT Press.
  • [48] Messinger, D. S., Duvivier, L. L., Warren, Z., Mahoor, M., Baker, J., Warlaumont, A. S., and Ruvolo, P. (2014). Affective Computing, Emotional Development, and Autism. In Calvo, R., D’Mello, S., Gratch, J., and Kappas, A., editors, The Oxford Handbook of Affective Computing, chapter 39, pages 516–536. Oxford University Press.
  • [49] Mower, E., Mataric, M. J., and Narayanan, S. (2011). A framework for automatic human emotion classification using emotion profiles. IEEE Transactions on Audio, Speech, and Language Processing, 19(5):1057–1070.
  • [50] Mower, E., Matarić, M. J., and Narayanan, S. S. (2009). Evaluating evaluators: A case study in understanding the benefits and pitfalls of multi-evaluator modeling. In Tenth Annual Conference of the International Speech Communication Association.
  • [51] Nicolle, J., Rapp, V., Bailly, K., Prevost, L., and Chetouani, M. (2012). Robust continuous prediction of human emotions using multiscale dynamic cues. In International conference on Multimodal interaction (ICMI 2012), pages 501–508, Santa Monica, CA, USA.
  • [52] Ortony, A., Clore, G. L., and Collins, A. (1990). The cognitive structure of emotions. Cambridge university press.
  • [53] Pampouchidou, A., Simos, P., Marias, K., Meriaudeau, F., Yang, F., Pediaditis, M., and Tsiknakis, M. (2017). Automatic assessment of depression based on visual cues: A systematic review. IEEE Transactions on Affective Computing.
  • [54] Parthasarathy, S., Cowie, R., and Busso, C. (2016). Using agreement on direction of change to build rank-based emotion classifiers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11):2108–2121.
  • [55] Picard, R. W. (2000). Affective computing. MIT press.
  • [56] Picard, R. W. and Healey, J. (1997). Affective wearables. Personal Technologies, 1(4):231–240.
  • [57] Plutchik, R. (1980). Emotion: A Psychoevolutionary Synthesis. Harper and Row.
  • [58] Politou, E., Alepis, E., and Patsakis, C. (2017). A survey on mobile affective computing. Computer Science Review, 25:79–100.
  • [59] Riva, G., Calvo, R., and Lisetti, C. (2014). Cyberpsychology and Affective Computing. In Calvo, R., D’Mello, S., Gratch, J., and Kappas, A., editors, The Oxford Handbook of Affective Computing, chapter 41, pages 547–558. Oxford University Press.
  • [60] Russell, J. A. (1980). A circumplex model of affect. Journal of personality and social psychology, 39(6):1161.
  • [61] Russell, J. A. (1993). Forced-choice response format in the study of facial expression. Motivation and Emotion, 17(1):41–51.
  • [62] Scherer, K. (1984). On the nature and function of emotion: A component process approach, pages 293–317. Lawrence Erlbaum Associates, Inc., New Jersey.
  • [63] Scherer, K. (2003). Vocal communication of emotion: A review of research paradigms. Speech Communication, 40(1-2):227–256.
  • [64] Schlosberg, H. (1941). A scale for the judgment of facial expressions. Journal of experimental psychology, 29(6):497.
  • [65] Schlosberg, H. (1954). Three dimensions of emotion. Psychological review, 61(2):81.
  • [66] Schröder, M., Baggia, P., Burkhardt, F., Pelachaud, C., Peter, C., and Zovato, E. (2015). Emotion markup language. The Oxford Handbook of Affective Computing, page 395.
  • [67] Schröder, M., Cowie, R., Douglas-Cowie, E., Westerdijk, M., and Gielen, S. (2001). Acoustic correlates of emotion dimensions in view of speech synthesis. In Seventh European Conference on Speech Communication and Technology.
  • [68] Schröder, M., Devillers, L., Karpouzis, K., Martin, J.-C., Pelachaud, C., Peter, C., Pirker, H., Schuller, B., Tao, J., and Wilson, I. (2007). What should a generic emotion markup language be able to represent? In International Conference on Affective Computing and Intelligent Interaction, pages 440–451. Springer.
  • [69] Schuller, B., Valster, M., Eyben, F., Cowie, R., and Pantic, M. (2012). Avec 2012: the continuous audio/visual emotion challenge. In Proceedings of the 14th ACM international conference on Multimodal interaction, pages 449–456. ACM.
  • [70] Sobol-Shikler, T. and Robinson, P. (2010). Classification of complex information: Inference of co-occurring affective states from their expressions in speech. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7):1284–1297.
  • [71] Vidrascu, L. and Devillers, L. (2005). Real-life emotion representation and detection in call centers data. In International Conference on Affective Computing and Intelligent Interaction, pages 739–746. Springer.
  • [72] Vinciarelli, A., Esposito, A., André, E., Bonin, F., Chetouani, M., Cohn, J. F., Cristani, M., Fuhrmann, F., Gilmartin, E., Hammal, Z., Heylen, D., Kaiser, R., Koutsombogera, M., Potamianos, A., Renals, S., Riccardi, G., and Salah, A. A. (2015). Open Challenges in Modelling, Analysis and Synthesis of Human Behaviour in Human–Human and Human–Machine Interactions . Cognitive Computation, 7(4):397–413.
  • [73] Wang, J.-C., Yang, Y.-H., Wang, H.-M., and Jeng, S.-K. (2015). Modeling the affective content of music with a gaussian mixture model. IEEE Transactions on Affective Computing, 6(1):56–68.
  • [74] Watson, D. and Tellegen, A. (1985). Toward a consensual structure of mood. Psychological bulletin, 98(2):219.
  • [75] Yannakakis, G., Cowie, R., and Busso, C. (2017). The ordinal nature of emotions. In International Conference on Affective Computing and Intelligent Interaction (ACII 2017), pages 248–255, San Antonio, TX, USA.
  • [76] Yannakakis, G., Cowie, R., and Busso, C. (2019). The ordinal nature of emotions: An emerging approach. IEEE Transactions on Affective Computing, To appear.
  • [77] Yannakakis, G. N. and Paiva, A. (2014). Emotion in Games. In Calvo, R., D’Mello, S., Gratch, J., and Kappas, A., editors, The Oxford Handbook of Affective Computing, chapter 34, pages 458–471. Oxford University Press.
  • [78] Zhang, B., Essl, G., and Mower Provost, E. (2017). Predicting the distribution of emotion perception: capturing inter-rater variability. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 51–59. ACM.
  • [79] Zhou, Y., Xue, H., and Geng, X. (2015). Emotion distribution recognition from facial expressions. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1247–1250. ACM.