Cross-Cultural and Cultural-Specific Production and Perception of Facial Expressions of Emotion in the Wild

08/13/2018 ∙ by Ramprakash Srinivasan, et al. ∙ 0

Automatic recognition of emotion from facial expressions is an intense area of research, with a potentially long list of important application. Yet, the study of emotion requires knowing which facial expressions are used within and across cultures in the wild, not in controlled lab conditions; but such studies do not exist. Which and how many cross-cultural and cultural-specific facial expressions do people commonly use? And, what affect variables does each expression communicate to observers? If we are to design technology that understands the emotion of users, we need answers to these two fundamental questions. In this paper, we present the first large-scale study of the production and visual perception of facial expressions of emotion in the wild. We find that of the 16,384 possible facial configurations that people can theoretically produce, only 35 are successfully used to transmit emotive information across cultures, and only 8 within a smaller number of cultures. Crucially, we find that visual analysis of cross-cultural expressions yields consistent perception of emotion categories and valence, but not arousal. In contrast, visual analysis of cultural-specific expressions yields consistent perception of valence and arousal, but not of emotion categories. Additionally, we find that the number of expressions used to communicate each emotion is also different, e.g., 17 expressions transmit happiness, but only 1 is used to convey disgust.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although there is agreement that facial expressions are a primary means of social communication amongst people [1], which facial configurations are commonly produced and successfully visually interpreted within and across cultures is a topic of intense debate [2, 3, 4, 5, 6]. Resolving this impasse is essential to identify the validity of opposing models of emotion [7], and design technology that can interpret them [8].

Crucially, this impasse can only be solved if we study the production and perception of facial expressions in multiple cultures in the wild, i.e., outside the lab. This is the approach we take in this paper. Specifically, we identify commonly used expressions in the wild and assess their consistency of perceptual interpretation within and across cultures. We present a computational approach that allows us to successfully do this on millions of images and billions of video frames.

1.1 Problem definition

Facial expressions are facial movements that convey emotion or other social signals robustly within a single or across multiple cultures [9, 7, 10]. These facial articulations are produced by contracting and relaxing facial muscles, called Action Units (AUs) [11]. Typically, people employ 14 distinct AUs to produce facial configurations [12]. Note we only consider the AUs that define facial expressions of emotion. We do not consider the AUs that specify distinct types of eye blinks (6 AUs), those that specify head position (8 AUs) or eye position (4 AUs), and those that are not observed in the expression of emotion (4 AUs); see [11, 12] for a detailed discussion of the use of these AUs in the expression of emotion.

Assuming each AU can be articulated independently, people are able to produce as many as 16,384 facial configurations. But how many of these facial configurations are expressions that communicate emotion?

We provide, to our knowledge, the first comprehensive, large-scale exploratory study of facial configurations in the wild. It is crucial to note that we can only afford a successful answer to the above question if we study facial configurations outside the laboratory, i.e., in the wild. Unfortunately, to date, research has mostly focused on in lab studies and the analysis of a small number of sample expressions [6]. In contrast, herein, we analyze over 7 million images of facial configurations and 10,000 hours of video filmed in the wild. Using an combination of automatic and manual analyses, we assess the consistency of production and visual perception of these facial configurations within and across cultures.

1.2 Summary of our results

Our study identifies 35 facial configurations that are consistently used across cultures. We refer to these expressions as cross-cultural. Our studies also identify 8 facial configurations that are used in some, but not all, cultures. These are termed cultural-specific expressions. This surprising result suggests that the number of facial expressions of emotion used within and across cultures is very small. The cross-cultural expressions represent <0.22% of all possible facial configurations. Cultural-specific expressions are <0.05% of all possible configurations.

We also investigate whether the visual analysis of these facial expressions yields consistent interpretations within and across cultures. We find that cross-cultural expressions yield consistent visual readings of emotion category and valence but not arousal. In contrast, cultural-specific expressions yield consistent inference of valence and arousal but not of emotion category.

Importantly, the number of expressions used to communicate each emotion category is different. For example, 17 express happiness, 5 express anger, and 1 expresses disgust.

Additionally, the accuracy of the visual interpretation of these expressions varies from a maximum of 89% to a minimum of 22%. This finding is fundamental if we wish to design technology that infers emotion from expressions. Some expressions communicate affect quite reliably, others do not. This also means that computer vision algorithms may be useful in some applications (e.g., searching for pictures in a digital photo album and in the web), but not in others (e.g., in the courts).

Furthermore, the unexpected results reported in the present paper cannot be fully explained by current models of emotion and suggests there are multiple mechanisms involved in the production and perception of facial expressions of emotion. Affective computing systems will need to be design that make use of all the mechanisms described in this paper.

1.3 Paper outline

Section 2 details the design of the experiments and results. Section 3 discusses the significance and implications of the results. We conclude in Section 4

2 Experiments and Results

We performed a number of experiments to study the cross-cultural and cultural-specific uses of facial configurations.

2.1 7 million images of facial configurations in the wild

The internet now provides a mechanism to obtain a large number of images of facial configurations observed in multiple cultures. We downloaded images of facial configurations with the help of major web search engines in countries where English, Spanish, Mandarin Chinese, Farsi, and Russian are the primary language I. We selected these languages because they (broadly) correspond to distinct grammatical families [13] and are well represented on the web [14].

Since we wish to find facial expressions associated with the communication of emotion, we used all the words in the dictionary associated with affect as keywords in our search. Specifically, we used WordNet [15]

, a lexicon of the English language defined as a hierarchy of word relationships, including: synonyms (i.e., words that have the same or nearly the same meaning), hyponyms (i.e., subordinate nouns or nouns of more specific meaning), troponymys (i.e., verbs with increasingly specific meaning), and entailments (i.e., deductions or implications that follow logically from or are implied by another meaning). These hierarchies are given by a graph structure, with words represented as nodes and word relationships with directed edges pointing from more general words to more specific ones. Undirected edges are used between synonyms since neither word is more general than the other.

Chinese English Farsi Russian Spanish
Countries China United States Iran Russia Spain
Taiwan Canada Mexico
Singapore Australia Argentina
Great Britain Chile
Peru
Venezuela
Colombia
Ecuador
Guatemala
Cuba
El Salvador
Bolivia
Honduras
Dominican Republic
Paraguay
Uruguay
Nicaragua
Costa Rica
Puerto Rico (US)
Panama
Equatorial Guinea
TABLE I: Table 1. Our image search was done using web search engines in the countries listed in this table. The behavioral experiment in Amazon Mechanical Turk (AMT) was open to residents of these countries too. Only people in the countries listed in each language were able to participate in the AMT experiment of that language group. Participants also had to pass a language test to verify proficiency on that language.

This graph structure allows us to readily identify words that describe affect concepts. They are the nodes emanating from the node “feeling” in WordNet’s graph. This yields a total of 821 words. Example words in our set are affect, emotion, anger, choler, ire, fury, madness, irritation, frustration, creeps, love, timidity, adoration, loyalty, happiness, etc. These 821 words were translated into Spanish, Mandarin Chinese, Farsi, and Russian by professional translators.

The words in each language were used as keywords to mine the web using search engines in multiple countries. Specifically, we used English words in web search engines typically employed in the United States, Canada, Australia, New Zealand and Great Britain; Spanish words in Spain, Mexico, Argentina, Chile, Peru, Venezuela, Colombia, Ecuador, Guatemala, Cuba, Bolivia, Dominican Republic, Honduras, Paraguay, Uruguay, El Salvador, Nicaragua, Costa Rica, Puerto Rico, Panama and Equatorial Guinea; Mandarin Chinese words in China, Taiwan and Singapore; Russian words in Russia; and Farsi words in Iran, table I.

The process described above yielded a set of about 7.2 million images of facial configurations representing a large number of cultures. These facial configurations were then AU coded by a computer vision algorithm [16]. To verify that the AU annotations provided by the computer vision system are accurate, we manually verify that at least ten sample images per language group were correctly labeled in each of the identified facial configurations.

ID AUs Examples ID AUs Examples ID AUs Examples
1 4 13 4, 7, 9, 10, 17 25 1, 2, 5, 25, 26
2 5 14 9, 10, 15 26 4, 7, 9, 25, 26
3 2, 4 15 12, 15 27 12, 25, 26
4 4, 7 16 6, 12, 25 28 2, 12, 25 26
5 12 17 2, 5, 12, 25 29 5, 12, 25, 26
6 2, 12 18 1, 2, 5, 12, 25 30 2, 5, 12, 25, 26
7 5, 12 19 6, 12, 25 31 1, 2, 5, 12, 25, 26
8 1, 2, 5, 12 20 10, 12, 25, 26 32 6, 12, 25, 26
9 6, 12 21 1, 2, 25, 26 33 4, 7, 9, 20, 25, 26
10 4, 15 22 1, 4, 25, 26 34 1, 2, 5, 20, 25, 26
11 1, 4, 15 23 5, 25, 26 35 1, 4, 5, 20, 26
12 4, 7, 17 24 2, 5, 25, 26
TABLE II: Our study identified 35 unique combinations of AUs typically seen in the cultures where English, Spanish, Mandarin Chinese, Farsi, and Russian are the primary language. ID is the unique identification number given to each expression. AUs is the list of active action units in that expression. Also shown is a sample image of each expression.

2.2 Automatic recognition of facial configurations

Computer vision systems now provide reliable annotations of the most commonly employed AUs. We used the algorithm defined in [16], which provides accurate annotations of AUs in images collected in the wild. Specifically, we automatically annotated the 14 most common AUs in emotional expression (AUs 1, 2, 4, 5, 6, 7, 9, 10, 12, 15, 17, 20, 25 and 26) [12]. To verify that the annotations provided by this computer vision algorithm were accurate, we manually verify that at least 50 images (10 per language group) in each possible facial configuration were correctly labeled.

A facial configuration is defined by a unique set of active AUs. And, a facial expression is defined as a facial configuration that successfully transmits affect within and/or between cultures.

Cross-cultural expressions: For a facial configuration to be considered a possible cross-cultural expression of emotion, we require that at least sample images, in each of the cultural groups, be present in our database.

The plot in Figure 1 shows the number of facial expressions identified with this approach in our database as a function of . Note that the -axis specifies the number of facial expressions, and the -axis the value of .

We note that the number of expressions does not significantly change after . Thus, in our experiments discussed below, we will set .

Cultural-specific expressions: For a facial configuration to be considered a possible cultural-specific expression, we require that at least sample images in at least one cultural group are present in our database. When more than one cultural group, but not all, have or more samples of this facial configuration, this is also considered a cultural-specific expression.

Fig. 1: Number of facial configurations in our dataset that have the number of images indicated in the x-axes.

2.3 Number of cross-cultural facial expressions

We wish to identify facial configurations that are commonly used across cultures. Thus, we identified the facial configurations with at least 100 image samples in each of the language groups. That is, for a facial configuration to be considered an expression, we required to have at least 100 sample images in each of the five language groups, for a total of 500 sample images.

Using this criterion, we identified a total of 35 facial expressions. That is, we found 35 unique combinations of AUs with a significant number of sample images in each of the five language groups. No other facial configuration had a number of samples close to the required 500.

Table III shows the list of active AUs and a sample image of each of these facial expressions. Each expression is given a unique ID number for future reference.

It is worth noting that 35 is a very small number; less than 0.22% of the 16,384 possible facial configurations. This suggests that the number of facial configurations used to express emotion in the wild is very small. Nonetheless, the number of identified expressions is much larger than the classical six “basic” emotions typically cited in the literature [9]. In fact, these results are in line with newer computational and cognitive models of the production of facial expressions of emotion [12, 6], which claim people regularly employ several dozen expressions. Our results are also consistent with the identification of at least 27 distinct experiences of emotion by [17].

An alternative explanation for our results is given by theoretical accounts of constructed emotions [18]. In this model, the identified expressions may be variations of some underlying dimensions of affect, rather than independent emotion categories. We assess this possibility in Section 2.6.

ID AUs Examples Languages ID AUs Examples Languages
36 1 English, Mandarin, Spanish 40 2, 5, 12 English, Russian, Farsi, Spanish
37 1, 2 English, Russian, Spanish 41 4, 5, 10, 25, 26 English, Farsi, Russian
38 5, 17 English, Mandarin 42 1, 25, 26 English
39 4, 7, 25, 26 Farsi, Mandarin 43 4, 7, 9, 10, 20, 25, 26 English
TABLE III: We identified only 8 combinations of AUs typically observed in some but not all cultures. ID is the unique identification number given to each expression. The ID numbers start at 36 to distinguish these facial expressions from those listed in Figure II. The AUs column provides the list of active action units in that expression. Also shown is a sample image of each expression and a list of the language groups where each expression is commonly observed.

2.4 Number of cultural-specific facial expressions

We ask whether the number of cultural-specific facial expressions is larger or smaller than that of cross-cultural expressions, as advocated by some models [7]. To answer this question, we identify the number of facial configurations that have at least 100 sample images in one or more of the language groups, but not in all of them.

This yielded a total of 8 expressions. Table III shows a sample image and AU combination of each of these facial expressions.

Of the eight cultural-specific expressions, one expression was identified in four of the language groups, three expressions in three of the language groups, two expressions in two of the language groups, and two expressions in a single language group, Table III.

It is worth noting that of the eight cultural-specific expressions, seven include cultures were English is the primary language. Previous research suggests that Americans are more expressive with their faces than people in other cultures [19, 20]. The results of the present paper are consistent with this hypothesis.

Nonetheless, and unexpectedly, the number of cultural-specific expressions is smaller than that of expressions shared across language groups. Eight expressions correspond to less than 0.05% of all possible facial configurations. And, of these, only two expressions were found in a single language. This is less than 0.013% of all possible configurations.

These results thus support the view that if a facial configuration is a facial expression of emotion, then this is much likely to be used in a large number of cultures than in a small number of them. This result could be interpreted as supporting the hypothesis that facial expressions primarily communicate emotion and other social signals across language groups [21, 22, 9, 10]. But, for this hypothesis to hold, the visual interpretation of these facial expressions would also need to be consistent across language groups. We test this prediction in Section 2.6.

2.5 Spontaneous facial expressions in video

To verify that the 43 facial expressions identified above are also present in video of spontaneous facial expressions, we downloaded 10,000 hours of video from YouTube. This corresponds to over 1 billion frames.

Specifically, we downloaded videos of documentaries, interviews, and debates where expressions are common and spontaneous.

All video frames were automatically annotated by our computer vision system and verified manually; using the same approach described earlier in the paper.

This analysis identified multiple instances of the 43 expressions of emotion listed in Tables II and III, further confirming the existence of these facial expressions of emotion in the wild.

2.6 Visual recognition of cross-cultural facial expressions

2.6.1 Are the facial expressions of emotion identified above visually recognized across cultures?

Some theorists propound that people should be able to infer emotion categories across cultures (e.g., happy, sad, and angrily surprised) [21, 12, 9], while others argue that expressions primarily communicate valence and arousal [7, 23, 24].

To test which of the above affect variables are consistently identified across language groups, we performed an online behavioral experiment. In this experiment, participants evaluated the 35 expressions observed in all language groups. Here, 50 images of each of the 35 expressions were shown to participants, one at a time, Figure 2.

Fig. 2: Typical timeline of the behavioral experiment completed by participants in Amazon Mechanical Turk. Shown here is an example of a possible response given in three languages (English, Spanish and Mandarin Chinese).

Participants were asked to provide an unconstrained label (1 to 3 words) describing the emotion conveyed by the facial expression and to indicate the level of arousal on a 6-point scale. Valence was obtained by identifying whether the word (or words) given by the participant was (were) of positive, neutral or negative affect as given in a dictionary. Details of the methods of this behavioral experiment and its computational analysis are in Section 2.6.3.

To deal with the many words one may use to define the same emotion, responses were aggregated to the most general word in WordNet. That is, if a subject selected a word that is a synonym or a descendant of a node used by another subject, the former was counted as a selection of the latter. For instance, if a subject used the word “anger” and another the word “irritated” to label the same facial expression, then the two were counted as a selection of “anger” because “anger” is an ascendant of “irritated” in WordNet.

A total of 150 subjects participated in this experiment, 30 subjects per language group. Each participant evaluated the 1,750 images in the experiment. Subjects in each language group had to be from a country where that language is spoken, Table I. Participants provided answers in their own language. Non-English responses were translated into English by an online dictionary.

2.6.2 Is there consistency of emotion categorization across language groups?

We find that the most commonly selected word was indeed identical in all language groups, except for expression ID 3 (which is produced with AUs 2 and 5), Table IV.

We also computed statistical significance between word selections. Here too, we find that the difference between the top selection and others are statistically significant except (again) for expression ID 3, Figure 3.

As can be appreciated in the figure, however, some labels were selected more consistently than others. For example, the first expression (produced with AU 4) is perceived as expressing sadness (the top selected word) about 22% of the time, but the last expression (produced with AUs 1, 4, 5, 20, 25 and 26) is perceived as expressing anger about 60% of the time.

ID Chinese English Farsi Russian Spanish
1 sadness sadness sadness sadness sadness
2 sadness sadness sadness sadness sadness
3 fear sadness fear surprise surprise
4 anger anger anger anger anger
5 happiness happiness happiness happiness happiness
6 happiness happiness happiness happiness happiness
7 happiness happiness happiness happiness happiness
8 happiness happiness happiness happiness happiness
9 happiness happiness happiness happiness happiness
10 sadness sadness sadness sadness sadness
11 sadness sadness sadness sadness sadness
12 anger anger anger anger anger
13 disgust disgust disgust disgust disgust
14 anger anger anger anger anger
15 happiness happiness happiness happiness happiness
16 happiness happiness happiness happiness happiness
17 happiness happiness happiness happiness happiness
18 happiness happiness happiness happiness happiness
19 happiness happiness happiness happiness happiness
20 surprise surprise surprise surprise surprise
21 sadness sadness sadness sadness sadness
22 surprise surprise surprise surprise surprise
23 surprise surprise surprise surprise surprise
24 surprise surprise surprise surprise surprise
25 anger anger anger anger anger
26 happiness happiness happiness happiness happiness
27 happiness happiness happiness happiness happiness
28 happiness happiness happiness happiness happiness
29 happiness happiness happiness happiness happiness
30 happiness happiness happiness happiness happiness
31 happiness happiness happiness happiness happiness
32 happiness happiness happiness happiness happiness
33 fear fear fear fear fear
34 fear fear fear fear fear
35 anger anger anger anger anger
TABLE IV: Most commonly used emotion category labels by subjects of the five language groups.

Thus, although there is a cross-cultural agreement on the categorical perception of the herein-identified facial expressions, this consensus varies wildly from one expression to the next.

Most current models and affect computing algorithms of the categorical perception of facial expressions of emotion do not account this large variability. These results are an urgent call to researchers to start coding this fundamental information in their algroithms. Otherwise, the outcome of emotion inference will carry little value, because it is not always a robust signal.

Crucially too, we note that several facial configurations are categorized as expressing the same emotion, Table IV. For example, anger and sadness are expressed using five distinct facial configurations each, whereas disgust is conveyed using a single facial configuration. This variability in number of facial configurations per emotion category is not accounted for by current models of the production of facial expressions, but it is essential to understand emotion.

Fig. 3: Categorical labeling of the expressions observed across cultures. The y-axis indicates the proportion of times subjects in the five language groups selected the emotion category labels given in the x-axis. The center number in each plot correspond to the ID numbers in Figure II. * p <.01, ** p <.001, *** p <.001.

Next, we ask if the consistency in the perception of valence is stronger than that of emotion categorization. Figure 4 shows the percentage of times subjects selected a negative, neutral and positive word to define each expression. Statistical differences between negative, neutral and positive valence are also specified in the figure. As we can see in the figure, there is overall strong agreement amongst participants of different cultures on the valence of the expression.

These results are in favor of a cross-cultural perception of valence in each of the 35 facial expressions, including the expression ID 3 for which emotion categorization was not consistent. Nonetheless, as with emotion categorization, the degree of agreement across expressions varies significantly, from a maximum of 89.05% to a minimum of 46.25%. This variability in effect size is unaccounted for in current models and algroithms. One possible way to address this problem is to include the modeling of personal experiences and context [3, 25], but it is currently unknown how this modeling would yield the results reported in this paper. We urge researchers to address this open problem.

Since we observed cross-cultural consistency in the perception of emotion category and valence, it is reasonable to assume the same will be true for arousal. However, we find that the selection of arousal yields little to no agreement amongst subjects of different or even the same language groups, Figure 5a and Figures 5b-d and 6a-b.

Additionally, the effect size of the top selection of arousal is generally very low. Also, as it is appreciated in the figure, there are very few selections that are statistically different from the rest. This suggests a mostly random selection process by human observers. There is a mild tendency toward agreement in the facial expressions with IDs 6 and 24. But, even in these cases, the results are too weak to suggest successful transmission of arousal.

Fig. 4: Perception of valence of the facial expression used across cultures. The y-axis indicates the proportion of times subjects in the five language groups selected positive, neutral and negative labels to describe each facial configuration. Thus, the x-axis specifies valence. The top-center number in each plot correspond to the ID of the facial configuration as given in Figure II. * p <.01, ** p <.001, *** p <.001.

2.6.3 Methods of the behavioral experiment

Once the set of facial expressions are identified, we performed a behavioral experiments.

Our participants had normal or corrected-to-normal vision. We assessed consistency of perception of the above-identified facial expressions by native subjects living in the countries listed in Table I. This was an online experiment performed in Amazon Mechanical Turk. Country of residence was checked by IP address, and proficiency in the major language of that country using a short grammatical test. A total of 150 subjects participated in this experiment, 30 subjects per language group. Before the experiment started, participants watched a short video which provided instructions on how to complete the task. The instructions and video were in the language of the participant. A small monetary payment was provided for completion of the experiment.

Participants assessed a total of 1,750 images. Images were displayed one at a time, Figure 2. A textbox was provided right of the image. Participants were asked to enter a 1 to 3-word description of the emotion expressed in the facial expression shown on left, Figure 2. Below this textbox, participants had a sliding bar. This bar was used to indicate the level of arousal perceived in the expression on left. The bar could be moved into six positions, going from disengaged (calm, relax, sluggish) to engaged (worked-up, stimulated, frenzy). The experiment was self-paced, i.e., there was no limit on time to complete this task.

Category labels provided by participants in a language other than English were automatically translated into English using the tools given in the Natural Language Toolkit (NLTK) [26]. The labels of all languages were then concatenated. This yielded the number of times each label was used by participants.

Next, we reduced the number of word redundancies. This was done using the hierarchical structure provided in WordNet. Specifically, any label that was a descendant of another label was converted to the latter, e.g., if a subject labeled a facial expression as expressing “outrage” and another subject labeled the same face as expressing “anger”, then “outraged”” would be converted into “anger” because anger is an ancestor of outraged in WordNet.

Also, some participants provided non-emotive labels. For example, “looks angry” was a typical response. To eliminate labels that do not define an emotion concept (e.g.,“looks”), we eliminated words that are not a descendant of the word “feeling” in WordNet.

This process yielded the final word tally for each facial configuration. Frequency of word selection was given as the fraction of times each word was selected over the total number of words chosen by participants. Standard deviations of these frequencies were also computed. A paired sample right-tailed

-test was performed to test whether each word is significantly more frequent than the subsequent word (i.e., most selected word versus second most selected word, second most selected word versus third most selected word, etc.).

Next, we used a dictionary to check the valence (positive, neutral, or negative) of each of the words obtained by the above procedure. This was given automatically by the semantic processing tools in NLTK, which identifies the valence of a word. This allowed us to compute the number of times participants selected positive, neutral and negative valence in each facial configuration. The same frequency and standard deviation computations described above are reported. Here, we used double-tailed -tests to determine statistical difference between positive versus neutral valence, positive versus negative valence, and neutral versus negative valence.

Finally, the same procedure described above was used to determine the frequency of selection of each arousal level as well as their standard deviation. A double-tailed

-test was used to test statistical significance between neighboring levels of arousal.

2.7 Visual recognition of cultural-specific facial expressions

We ask if the results obtained above replicate when we study facial expressions that are only common in some (but not all) of the five language groups. As in the above, we first assessed the consistency in categorical labeling of the eight cultural-specific expressions. For each expression, only subjects in the language groups where this expression was found to be commonly used were asked to provide categorical labels.

As we can see in Figure 7, there is very little agreement in labeling these cultural-specific expressions, except for expression ID 36. This is in sharp contrast with the labeling results of the cross-cultural expressions shown in Figure 3.

These results suggest that cross-cultural expressions are employed in multiple cultures precisely because they provide a relatively accurate communication of emotion category. But, cultural-specific expressions do not.

What is the purpose of cultural-specific expressions then? Do we see consistency within each individual culture? As can be seen in Figure 8, the answer is again no. There is little agreement in labeling cultural-specific expressions even within the same culture.

a. b. c. d.

Fig. 5: a. Perception of arousal of the facial expression used across cultures. The plots show the proportion of subjects’ responses (-axis) for each level of arousal (x-axis). b. Perception of arousal of the cross-cultural expressions by subjects in the Mandarin Chinese language group. The -axis indicates the proportion of times subjects selected each of the levels of arousal indicated in the -axis. c. Perception of arousal of the cross-cultural expressions by subjects in the English language group. d. Perception of arousal of the cross-cultural expressions by subjects in the Farsi language group. top, center number in each plot correspond to the ID numbers in Figure II. * p <.01, ** p <.001, *** p <.001.

a. b.

Fig. 6: a. Perception of arousal of the cross-cultural expressions by subjects in the Spanish language group.. b. Perception of arousal of the cross-cultural expressions by subjects in the Russian language group. The -axis indicates the proportion of times subjects selected each of the levels of arousal indicated in the -axis. The top, center number in each plot correspond to the ID numbers in Figure II. * p <.01, ** p <.001, *** p <.001.

Although cultural-specific expressions do not yield a consistent perception of emotion category, maybe they transmit some other affect information.

Fig. 7: The y-axis indicates the proportion of times subjects in the language groups where that expression is typically observed selected each of emotion category labels given in the x-axis. The center number in each plot correspond to the ID numbers in Figure III. * p <.01, ** p <.001, *** p <.001.
Fig. 8: Categorical labeling of the cultural-specific expressions by subjects in each of the language groups: Mandarin Chinese (a), English (b), Farsi (c), Russian (d), and Spanish (e). The y-axis indicates the proportion of times subjects selected the emotion category label given in the x-axis. The top, center number in each plot correspond to the ID numbers in Figure III. * p <.01, ** p <.001, *** p <.001.

As in the above, we ask if valence is consistently perceived in the cultures where these cultural-specific expressions are observed? Figure 9a summarizes the results.

It is clear from these results that valence is robustly and consistently interpreted by subjects of all cultures where the expression is used. In fact, the agreement across participants is generally very high.

This surprising finding, which contrasts with that of emotion categorization, suggest that while cross-cultural expressions may be used to transmit emotion category and valence, cultural-specific expression may only transmit valence.

Given the above result, we wondered if the perception of valence in cultural-specific expressions is also consistently interpreted in cultures where these expressions are not commonly observed.

Figure 9b shows the perception of valence of the eight cultural-specific expressions by subjects in cultures where the expressions are not commonly seen.

As can be seen in this figure, subjects in these cultures successfully interpret valence as well. These results support the view that the valence of cultural-specific expressions is consistently transmitted across cultures, even if those expressions are not typically seen in them.

Since the role of cultural-specific expressions seems to be that of communicating valence, we ask if they can also successfully transmit arousal.

We find that interpretation of arousal is relatively consistent, even if not highly accurate, Figure 10

a. Note the effect size of the most selected arousal value and the statistical significance between this and neighboring (left and right) levels of arousal. Additionally, note that the distribution in these plots is closer to that of an ideal observer (i.e., a Gaussian distribution with small variance).

Given this latest result, it is worth asking whether arousal is also successfully interpreted by people in cultures where these expressions are not commonly observed. We find this not to be the case, Figure 10b.

In summary, the above results support a revised model of the production and perception of facial expressions of emotion with two types of expressions. The first type includes expressions that communicate emotion category as well as valence. These are used across cultures and, hence, have been referred herein as cross-cultural expressions. The second type of expressions do not transmit emotion categories but, rather, are best at communicating valence within and across cultures and arousal within cultures where those expressions are typically employed. We call these cultural-specific expressions.

Of all affect concepts, valence is the one transmitted most robustly to an observer. Importantly, a linear machine learning classifier can correctly categorize (with 95.34% accuracy) positive versus negative valence using just a few AUs, Section

2.7.1.

Specifically, we find that positive valence is given by the presence of AU 6 and/or 12, whereas negative valence is given by the presence of AU 1, 4, 9 and/or 20.

Hence, only six of the main 14 AUs are major contributors to the communication of valence. However, this result may only apply to expressions seen in day-to-day social interactions, and not where the intensity of valence is extreme since these are generally misinterpreted by observers [27].

2.7.1 Automatic classification of valence

We define each facial expression as a 14-dimensional feature vector, with each dimension in this vector defining the presence (+1) or absence (0) of each AU. Sample vectors of the 43 facial expressions of emotion are grouped into two categories – those which yield the perception of positive valence and those yielding a perception of negative valence.

Subclass Discriminant Analysis (SDA) [28]

is used to identify the most discriminant AUs between these two categories – positive versus negative valence. This is given by the eigenvector,

, , associated with the largest eigenvalue,

, of the metric of SDA. SDA’s metric is computed as the product of between class dissimilarity and within class similarity. If the AU does not discriminate between the facial expressions of positive versus negative valence, then . If , then this AU is present exclusively (or almost exclusively) in expressions of positive valence. If , the AU is present exclusively (or almost exclusively) in expressions of negative valence. This process identified AUs 6 and 12 as major communicators of positive valence and AUs 1, 4, 9 and 20 as transmitters of negative valence. This result includes cross-cultural and cultural-specific expressions.

We tested the accuracy of the above-defined SDA classifier using a leave-one-expression-out approach. That means the samples of all expressions, but one, are used to train the classifier and the sample defining the left-out expression is used to test it. With expressions, there are possible expressions that can be left out. We computed the mean classification accuracy of the possible left-out samples, given by the ratio of over . This yielded a classification accuracy of 95.34%. That is, the AUs listed above are sufficient to correctly identify valence in a previously unseen expression 95.34% of the time.

Fig. 9: a. Perception of valence of the cultural-specific expression by subjects in the cultures where this expression is common. The plots show the proportion of times (y-axis) subjects selected positive, neutral and negative labels (x-axis) to describe each facial configuration. b. Perception of valence of the eight cultural-specific expressions by subjects in the cultures where the expressions are not commonly observed. The top-center number in each plot correspond to the ID of the facial configuration as given in Figure III. * p <.01, ** p <.001, *** p <.001.
Fig. 10: a. Perception of arousal of the cultural-specific expression in cultures where these expressions are commonly used. Plots show the proportion of subjects’ responses (y-axis) for the level of arousal indicated in the x-axis. Arousal is given by a 6-point scale, from less to more engaged. These results suggest a quite accurate perception of arousal for cultural-specific expressions, except for expression 36. Expression ID 37 has a mid to high arousal (values between 1 and 3 in the tested scale); expression ID 38 a value of 1 to 2; expressions IDs 39, 40 and 41 a value 1; expression ID 42 a value of 2 to 3; and expression ID 43 a value of 1. We note that all these values of arousal are on the positive side (i.e., the expression appears engaged). This may indicate a preference for emotive signals that convey alertness and active social interactions in cultural-specific expressions. b. Perception of arousal of the eight cultural-specific expressions in cultures where the expressions are not commonly observed. Note that although there is a tendency toward the results shown in a, there is little to no statistical significance between the top selection and those immediately right and left of it. The exception if expression ID 43, whose result is statistically significant and its perception of arousal matches that in a. * p <.01, ** p <.001, *** p <.001. Top-center number in each plot specifies the expression ID.

Data availability: The data used in this paper is available from the authors and will be made accessible online (after acceptance).

3 Discussion

To derive algorithms, researchers in affective computing use existing theoretical models of the production and visual perception of facial expressions of emotion.

However, current theoretical models disagree on how many expressions are regularly used by humans and what affect variables these expressions transmit (emotion category, valence, arousal, a combination of them, or some other unknown, underlying dimensions) [3, 7]. Thousands of experiments have been completed to test these models, with some supporting the transmission of emotion category and others favoring the transmission of valence, arousal and other dimensions [6]. As a consequence, the area has been at an impasse for decades, with strong disagreements on how many expressions exist, what they communicate, and whether they are cultural-specific or cross-cultural [29, 4, 24, 30, 31].

We argue that previous studies were unable to resolve this impasse because they were performed in the laboratory, under highly controlled conditions. What is needed is an understanding of the expressions people of different cultures use in the wild, under unconstrained conditions.

The goal is to discover these expressions, rather than search for pre-determined ones. The present paper constitutes, to our knowledge, the first of these studies.

Specifically, we have studied 7.2 million facial configurations employed by people of 31 countries where either English, Spanish, Mandarin Chinese, Farsi, or Russian is the primary language, Table IV.

First, we found that while 16,384 facial configurations are possible, only 35 are typically observed across language groups.

We then showed that these expressions successfully transmit emotion category and valence across cultures.

These results are consistent with models of the production and perception of compound expressions [12, 6, 32] and models that suggest people experience at least 27 distinct emotions [17].

However, and crucially, the robustness of this categorical perception varies widely from expression to expression. This variability is not explained by current algorithms and model of the automatic analysis of the production and perception of facial expressions of emotion and, hence, they will need to be amended.

One possible explanation for this variability in perceptual accuracy is that emotions are affected by constructs of people experiences [3, 33]. A related thesis is that this variability is explained by the number of neural mechanisms involved in the interpretation of facial expressions [34, 35, 36]. Even if that were the case though, we would still need to explain why some expressions are more affected than others.

Moreover, it is worth noting that while the sensitivity of these categorizations is high, specificity is not, Table V. This suggests that the successful categorization of the herein identified expressions is important to humans, even if it comes at the cost of making errors elsewhere (i.e., false positives).

Importantly, we note that the number of facial configurations that communicate each emotion category varies significantly though, Table IV. At one extreme, happiness is expressed using seventeen facial configurations. At the other end, disgust only utilizes a single expression. This result could point to individual differences in expressing the same emotion category [12], variations due to context [37], or multiple mechanisms as suggested by predictive coding [3]. But, at present, affective computing algorithms that analyze the production and perception of facial expressions of emotion do not account for this effect.

Specificity Sensitivity
Happiness 0.616 0.961
Sadness 0.487 0.967
Anger 0.534 0.964
Surprise 0.461 0.949
Fear 0.418 0.967
Disgust 0.349 0.983
TABLE V: Specificity and sensitivity of the visual recognition of the six basic emotion categories percieved by participants

Second, we discovered that the number of expressions that are used in one or more, but not all, cultures is even smaller than that of cross-cultural expressions. The number of these cultural-specific expressions is 8.

This result suggests that most facial expressions of emotion are used across cultures, as some theories of emotion propound [9, 10]. Crucially though, a subsequent study of the visual perception of cultural-specific expressions showed they successfully convey valence and arousal, but not emotion category. More striking was the finding that valence, but not arousal, is effectively interpreted in cultures where these cultural-specific expressions are not regularly employed. This is another finding that cannot be explained by current models and, consequently, is not coded in current affective computing algorithms.

Some studies in cognitive neurosceince have suggested there are dissociated brain mechanisms for the recognition of emotion categories versus valence versus arousal, [38, 39, 5, 40, 41]. These results are consistent with our finding of two sets of expressions, with only one set transmitting emotion category but both successfully communicating valence. Additionally, only the cultural-specific expressions convey arousal, and, even here, only within the cultures where those expressions are commonly observed. This is consistent with the work of [42] who found that distinct neural processes are involved in emotional memory formation for arousing versus valenced, non-arousing stimuli. However, our results seem to contradict those of a recent study showing robust transmission of arousal across cultures and species [43]. Nonetheless, it is worth noting that this latter study involved the recognition of arousal of an aural signal, not a visual one. This could suggest that arousal is most robustly transmitted orally.

Adding to this differentiation between oral and visual emotive signals [44] found that negative valence is successfully communicated orally across cultures, but that most culture-specific oral signals have positive valence. In contrast to these findings, our results suggest that the visual signal of emotion conveys positive and negative valence across cultures, and mostly (but not exclusively) positive valence in culture-specific expressions. Thus, there appears to be a more extensive communication of valence by the visual signal than by the oral one.

The above results are in support of a dedicated neural mechanism for the recognition of valence. A good candidate for this neural mechanism is the amygdala [45]. For instance, [5] have recently shown that the amygdala differentially responds to positive versus negative valence even when the emotion category is constant; although the amygdala may also determine arousal [41]. Another candidate is the subthalamic nucleus. Using single-cell recordings in humans, [5] identified dissociated cell populations in this brain regions for the recognition of arousal and valence. But, it is also likely that multiple areas contribute to the recognition of valence as well as other affect signals [34, 36, 35].

Thus, the cognitive neuroscience studies cited in the preceding paragraphs are in support of our results and, hence, affective computing methods must include them if the results are to be meaningful to users. We thus posit that the results presented herein will be fundamental to advance the state of the art in the area. New or amended models of emotion will need to be proposed.

4 Conclusions

Emotions play a major role in human behavior. Affective computing is attempting to infer as much information as possible from the aural and visual signals.

Yet, at present, researchers disagree on how many emotions humans have, whether these are cross-cultural or cultural-specific, whether emotions are best represented categorically or as a set of continuous affect variables (valence, arousal), how context influences production and perception of expressions of emotion, or whether personal constructs modulate the production and perception of emotion.

Laboratory studies and controlled in-lab experiments are insufficient to address this impasse. What is needed is a comprehensive study of which expressions exist in the wild (i.e., in uncontrolled conditions outside the laboratory), not in tailored, controlled experiments.

To address this problem, the present paper presented the first large-scale study (7.2 million images) of the production and visual perception of facial expressions of emotion in the wild. We found that of the 16,384 possible facial configurations that people can produce only 35 are successfully used to transmit affect information across cultures, and only 8 within a smaller number of cultures.

Crucially, our studies showed that these 35 cross-cultural expressions successfully transmit emotion category and valence but not arousal. The 8 cultural-specific expressions successfully transmit valence and, to some degree, arousal, but not emotion category.

We also find that the degree of successful visual interpretation of these 43 expressions varies significantly. And, the number of expressions used to communicate each emotion is also different, e.g., 17 expressions transmit happiness, but only 1 is used to convey disgust.

These findings are essential to change the direction in the design of algorithms for the coding and reading of emotion.

Acknowledgment

This research was supported by the National Institutes of Health (NIH), grant R01-DC-014498, and the Human Frontier Science Program (HFSP), grant RGP0036/2016.

References

  • [1] R. E. Jack and P. G. Schyns, “Toward a social psychophysics of face communication,” Annual review of psychology, vol. 68, pp. 269–297, 2017.
  • [2] L. F. Barrett, R. Adolphs, S. Marsella, A. Martinez, and S. Pollak, “Emotional expressions reconsidered: Challenges to inferring emotion in human facial movements,” Psychological Science in the Public Interest, under review.
  • [3] L. F. Barrett, How emotions are made: The secret life of the brain.   Houghton Mifflin Harcourt, 2017.
  • [4] P. Ekman, “What scientists who study emotion agree about,” Perspectives on Psychological Science, vol. 11, no. 1, pp. 31–34, 2016.
  • [5] M. J. Kim, A. M. Mattek, R. H. Bennett, K. M. Solomon, J. Shin, and P. J. Whalen, “Human amygdala tracks a feature-based valence signal embedded within the facial expression of surprise,” Journal of Neuroscience, vol. 37, no. 39, pp. 9510–9518, 2017.
  • [6] A. M. Martinez, “Visual perception of facial expressions of emotion,” Current opinion in psychology, vol. 17, pp. 27–33, 2017.
  • [7] J. A. Russell and J.-M. Fernández-Dols, The science of facial expression.   Oxford University Press, 2017.
  • [8] A. M. Martinez, “Computational models of face perception,” Current directions in psychological science, vol. 26, no. 3, pp. 263–269, 2017.
  • [9] P. Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992.
  • [10] C. E. Izard, Human emotions.   Springer Science & Business Media, 2013.
  • [11] P. Ekman and E. L. Rosenberg, What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS).   Oxford University Press, USA, 1997.
  • [12] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” Proceedings of the National Academy of Sciences, vol. 111, no. 15, pp. E1454–E1462, 2014.
  • [13] B. M. Rowe and D. P. Levine, A concise introduction to linguistics.   Routledge, 2015.
  • [14] W. Techs, “Usage of content languages for websites,” 2014.
  • [15] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
  • [16] C. F. Benitez-Quiroz, R. Srinivasan, A. M. Martinez et al., “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild.” in CVPR, 2016, pp. 5562–5570.
  • [17] A. S. Cowen and D. Keltner, “Self-report captures 27 distinct categories of emotion bridged by continuous gradients,” Proceedings of the National Academy of Sciences, vol. 114, no. 38, pp. E7900–E7909, 2017.
  • [18] E. Siegel, M. Sands, W. den Noortgate Van, P. Condon, Y. Chang, J. Dy, K. Quigley, and L. Barrett, “Emotion fingerprints or emotion populations? a meta-analytic investigation of autonomic features of emotion categories.” Psychological bulletin, 2018.
  • [19] C. F. Benitez-Quiroz, R. B. Wilbur, and A. M. Martinez, “The not face: A grammaticalization of facial expressions of emotion,” Cognition, vol. 150, pp. 77–84, 2016.
  • [20] M. Rychlowska, Y. Miyamoto, D. Matsumoto, U. Hess, E. Gilboa-Schechtman, S. Kamble, H. Muluk, T. Masuda, and P. M. Niedenthal, “Heterogeneity of long-history migration explains cultural differences in reports of emotional expressivity and the functions of smiles,” Proceedings of the National Academy of Sciences, vol. 112, no. 19, pp. E2429–E2436, 2015.
  • [21] C. Darwin and P. Prodger, The expression of the emotions in man and animals.   Oxford University Press, USA, 1998.
  • [22] G.-B. Duchenne, The mechanism of human facial expression.   Cambridge university press, 1990.
  • [23] J. Posner, J. A. Russell, and B. S. Peterson, “The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology,” Development and psychopathology, vol. 17, no. 3, pp. 715–734, 2005.
  • [24] J. A. Russell, “Core affect and the psychological construction of emotion.” Psychological review, vol. 110, no. 1, p. 145, 2003.
  • [25] B. T. Leitzke and S. D. Pollak, “Developmental changes in the primacy of facial cues for emotion recognition.” Developmental psychology, vol. 52, no. 4, p. 572, 2016.
  • [26] S. Bird and E. Loper, “Nltk: the natural language toolkit,” in Proceedings of the ACL 2004 on Interactive poster and demonstration sessions.   Association for Computational Linguistics, 2004, p. 31.
  • [27] H. Aviezer, Y. Trope, and A. Todorov, “Body cues, not facial expressions, discriminate between intense positive and negative emotions,” Science, vol. 338, no. 6111, pp. 1225–1229, 2012.
  • [28] M. Zhu and A. M. Martinez, “Subclass discriminant analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1274–1286, 2006.
  • [29] C. Chen and R. E. Jack, “Discovering cultural differences (and similarities) in facial expressions of emotion,” Current opinion in psychology, vol. 17, pp. 61–66, 2017.
  • [30] A. E. Skerry and R. Saxe, “Neural representations of emotion are organized around abstract event features,” Current biology, vol. 25, no. 15, pp. 1945–1954, 2015.
  • [31] K. A. Lindquist, M. Gendron, A. B. Satpute, and K. Lindquist, “Language and emotion,” Handbook of Emotions, 4th ed., The Guilford Press, New York, NY, 2016.
  • [32] L. F. Barrett, B. Mesquita, and E. R. Smith, “The context principle,” The mind in context, vol. 1, 2010.
  • [33] M. Gendron, D. Roberson, J. M. van der Vyver, and L. F. Barrett, “Cultural relativity in perceiving emotion from vocalizations,” Psychological science, vol. 25, no. 4, pp. 911–920, 2014.
  • [34] K. A. Lindquist, T. D. Wager, H. Kober, E. Bliss-Moreau, and L. F. Barrett, “The brain basis of emotion: a meta-analytic review,” Behavioral and brain sciences, vol. 35, no. 3, pp. 121–143, 2012.
  • [35] R. P. Spunt and R. Adolphs, “A new look at domain specificity: insights from social neuroscience,” Nature Reviews Neuroscience, vol. 18, no. 9, p. 559, 2017.
  • [36] M. Mather, “The affective neuroscience of aging,” Annual Review of Psychology, vol. 67, 2016.
  • [37] A. Etkin, C. Büchel, and J. J. Gross, “The neural bases of emotion regulation,” Nature Reviews Neuroscience, vol. 16, no. 11, p. 693, 2015.
  • [38] J. Chikazoe, D. H. Lee, N. Kriegeskorte, and A. K. Anderson, “Population coding of affect across stimuli, modalities and individuals,” Nature neuroscience, vol. 17, no. 8, p. 1114, 2014.
  • [39] R. J. Harris, A. W. Young, and T. J. Andrews, “Morphing between expressions dissociates continuous from categorical representations of facial expression in the human brain,” Proceedings of the National Academy of Sciences, vol. 109, no. 51, pp. 21 164–21 169, 2012.
  • [40] R. Srinivasan, J. D. Golomb, and A. M. Martinez, “A neural basis of facial action recognition in humans,” Journal of Neuroscience, vol. 36, no. 16, pp. 4434–4442, 2016.
  • [41] C. D. Wilson-Mendenhall, L. F. Barrett, and L. W. Barsalou, “Neural evidence that human emotions share core affective properties,” Psychological science, vol. 24, no. 6, pp. 947–956, 2013.
  • [42] E. A. Kensinger and S. Corkin, “Two routes to emotional memory: Distinct neural processes for valence and arousal,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 9, pp. 3310–3315, 2004.
  • [43] P. Filippi, J. V. Congdon, J. Hoang, D. L. Bowling, S. A. Reber, A. Pašukonis, M. Hoeschele, S. Ocklenburg, B. De Boer, C. B. Sturdy et al., “Humans recognize emotional arousal in vocalizations across all classes of terrestrial vertebrates: evidence for acoustic universals,” in Proc. R. Soc. B, vol. 284, no. 1859.   The Royal Society, 2017, p. 20170990.
  • [44] D. A. Sauter, F. Eisner, P. Ekman, and S. K. Scott, “Cross-cultural recognition of basic emotions through nonverbal emotional vocalizations,” Proceedings of the National Academy of Sciences, vol. 107, no. 6, pp. 2408–2412, 2010.
  • [45] S. Wang, R. Yu, J. M. Tyszka, S. Zhen, C. Kovach, S. Sun, Y. Huang, R. Hurlemann, I. B. Ross, J. M. Chung et al., “The human amygdala parametrically encodes the intensity of specific facial emotions and their categorical ambiguity,” Nature communications, vol. 8, p. 14821, 2017.