Deep Emotion: A Computational Model of Emotion Using Deep Neural Networks

08/25/2018 ∙ by Chie Hieida, et al. ∙ 16

Emotions are very important for human intelligence. For example, emotions are closely related to the appraisal of the internal bodily state and external stimuli. This helps us to respond quickly to the environment. Another important perspective in human intelligence is the role of emotions in decision-making. Moreover, the social aspect of emotions is also very important. Therefore, if the mechanism of emotions were elucidated, we could advance toward the essential understanding of our natural intelligence. In this study, a model of emotions is proposed to elucidate the mechanism of emotions through the computational model. Furthermore, from the viewpoint of partner robots, the model of emotions may help us to build robots that can have empathy for humans. To understand and sympathize with people's feelings, the robots need to have their own emotions. This may allow robots to be accepted in human society. The proposed model is implemented using deep neural networks consisting of three modules, which interact with each other. Simulation results reveal that the proposed model exhibits reasonable behavior as the basic mechanism of emotion.



There are no comments yet.


page 6

page 7

page 9

page 11

page 14

page 15

page 17

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Emotions are very important for human intelligence. For example, emotions are closely related to the appraisal of the internal bodily state and external stimuli. This helps us to respond quickly to the environment. Another important perspective in human intelligence is the role of emotions in decision-making. Moreover, the social aspect of emotions is also very important. Therefore, if the mechanism of emotions were elucidated, we could advance toward the essential understanding of our natural intelligence. In this study, a model of emotions is proposed to elucidate the mechanism of emotions through the computational model. Furthermore, from the viewpoint of partner robots, the model of emotions may help us to build robots that can have empathy for humans. To understand and sympathize with people’s feelings, the robots need to have their own emotions. This may allow robots to be accepted in human society. The proposed model is implemented using deep neural networks consisting of three modules, which interact with each other. Simulation results reveal that the proposed model exhibits reasonable behavior as the basic mechanism of emotion.

2 Keywords:

Emotion Model, Human–Robot Interaction, Empathic Communication, Machine Learning, Recurrent Attention Model, Convolutional Long Short-Term Memory, Deep Deterministic Policy Gradient

2 Keywords:

Emotion Model, Human–Robot Interaction, Empathic Communication, Machine Learning, Recurrent Attention Model, Convolutional Long Short-Term Memory, Deep Deterministic Policy Gradient

3 Introduction

The development of artificial intelligence (AI) in recent years has been remarkable. In certain tasks such as object recognition, it is said that AI has surpassed human capabilities. However, one might think that emotion separates human intelligence from AI. Is this true? If the human mind is created as a result of the calculations of the brain, then emotions could be simulated by a computer. Would this imply the possibility that a robot could have emotions? To answer this question, we should start thinking of the basic mechanism of emotion. If the mechanism of emotion were elucidated, we could get closer to the essential understanding of what a human being is.

Because emotions are very important to human beings, many studies on emotions have been carried out in the past. Here, the conventional studies on emotions are organized from the viewpoint of their research approach. Many psychological studies, among others, have tried to capture emotional phenomena. Emotional facial expression studies, for instance, are based on the idea of basic emotions theory. It is well known that Ekman insisted that there are six basic emotions, regardless of culture (Ekman and Wallace, 1971). Plutchik and Izard respectively assumed eight and ten basic emotions (Plutchik, 1980, 1982; Izard., 1977). Because the basic emotions theory is based on the evolutionary point of view, emotional expressions are defined at the nerve level, and the categories of expression and recognition of basic facial expressions are universal. The dimensional model of emotions is another well known approach for emotions (Russell, 1980; Schlosberg, 1954). This expresses emotions in approximately two to three cognitive dimensions based on a factor analysis of judgment on emotional stimuli such as expressive photographs or emotional expression words. Although many emotion-related studies are based on dimensional models, the mechanism behind emotional phenomena cannot be revealed.

Various emotional models have been proposed in the literature from the physiological viewpoint (James, 1884; Cannon, 1927). The central idea in the James–Lange theory is represented in the quote “We don’t laugh because we’re happy, we’re happy because we laugh” by James. However, the Cannon–Bard theory contradicts it. The question of which of these theories is correct has long been controversial. Schachter, in contrast, advocated a two-factor theory and developed an emotional theory that included these two competitive theories (Schachter and Singer, 1962). Cognitive theory is also famous for incorporating cognitive activities in the form of judgments, evaluations, or thoughts (Arnold, 1960; Lazarus, 1991; Ortony et al., 1988). These models give important implications for emotion; however, because they do not model the entire mechanism of emotions, they do not necessarily clarify what emotions are. Moreover, they are not computational models. In other words, there is also a problem that the models cannot be directly implemented on a computer.

Neuroscience has revealed neural circuits, such as the Papez circuit (Papez, 1937) and Yakovlev circuit (Yakovlev, 1948), that are relevant to emotions. LeDoux discussed the function of the brain in emotions in detail based on the anatomical point of view. He proposed the dual pathway theory, which claims that there are two types of emotional processing paths: automatic and rapid processing by the limbic system, and complicated and higher cognitive processing from the neocortex to the amygdala (LeDoux, 1986, 1989, 1998). More recently, the quartet theory of emotions, which claims four important systems for emotions in the brain, was proposed (Koelsch et al., 2015)

. These are the brainstem-centered, diencephalon-centered, hippocampus-centered, and orbitofrontal-centered systems. As a matter of course, the authors argue that the limbic/paralimbic structure, i.e., the basal ganglia, amygdala, insular cortex, and cingulate cortex, are also of importance for affective processes. As shown in computational neuroscience studies, the cortical-basal ganglia loop can be considered as a reinforcement learning module. In particular, the striatum plays a very important role in sensorimotor, attentional, and emotional processes. These neuroscientific findings are not only important for concretely considering emotion models, but also have direct implications on computational models.

From the viewpoint of human–robot interaction, emotion is one of the most important factors for partner robots. The intuition that the difference between humans and AI (robots) is in emotion implies that, in other words, the realization of the emotion model may be the key to realize robots and/or AI with high affinity for human beings. Picard proposed the idea of affective computing, in which the emotion recognition in humans has been studied extensively, mainly by examining facial expressions (Picard, 1997)

. The success of deep learning in recent years has accelerated this line of research. Of course, the classification of a person’s inner state based on facial expressions is very useful for the robot to communicate with us, because it can select its response according to the recognized result. However, it is fair to say that the recognition of facial expressions is different from a “true understanding” of the emotional states of others, even though a highly accurate facial expression recognition method is available thanks to deep learning technologies. Robots need to understand, sympathize, and act according to their partners’ complex emotional states in order to become accepted members of human society. Toward this goal, many efforts on designing emotional expressions for social robots have been made

(Breazeal, 2002). However, almost all these emotions have been designed manually. High-level complex social emotions for robots are difficult to preprogram manually. In fact, conventional studies have only been able to accomplish simple basic emotions such as happiness and sadness (Masuyama and Loo, 2015; Woo et al., 2015). An emotion model for robots based on the difference equation was proposed (Miwa et al., 2001); however, the system was too simple to generate complex higher-level emotions. The basic idea underlying this study is that the problem of emotions should be formulated as “understanding by a generative process of emotions” rather than “classification.” If we abandon the manual design of emotions, emotional differentiation (Bridges, 1932; Lewis, 2000) must be the right path to follow in order to achieve this ultimate goal. This idea shares the same goal as the affective developmental robotics proposed by Asada (Asada, 2015).

Accordingly, we propose a computational model of emotions, which is based on certain neurological and psychological findings in the literature. The purpose of this paper is first to present a general meta-level framework for the mechanism behind emotions. The literature on emotions in the past as discussed above motivates us to propose a three-layer model to cover emotions. The first layer corresponds to the appraisal module, which is responsible for quick evaluation of the external world and internal body. Interoception in particular, which is sensitivity to stimuli originating inside the body, is a very important factor. The second layer has an emotional memory to adjust the innate appraisal module in the first layer to the surrounding environment, which the agent is facing. The third layer includes reinforcement learning and sequence learning modules that correspond to the cortical-basal ganglia loop. This is because the important aspect of emotions is their role in decision-making (Moerland et al., 2017). The dual path theory by LeDoux is one of the important theories of emotions, which forms the basis of our proposed three-layer model. Moreover, this three-layer model roughly matches the recent neurobiological emotion model (Koelsch et al., 2015).

We also attempt to implement the three-layer model of emotion using deep neural networks. Our proposed implementation relies on a combination of the recurrent attention model (RAM) (Mnih et al., 2014) for the first layer, as well as the convolutional long short-term memory (LSTM) (Xingjian et al., 2015) and the reinforcement learning module using the deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015) for the third layer. The second layer is realized based on a mechanism of nonlinear smoothing, which makes the whole emotion system adaptable to the surrounding environment. Then, the implemented computational model of emotion is tested by employing certain tasks simulating mother-–infant interaction to evaluate the plausibility of the model. Some promising results are obtained in the experiment. For example, we found that the policy network represents emotional states and exhibits emotion differentiation in the proposed three-layer model. We believe this constructive approach toward emotions may yield a clue to the elucidation of human emotions. Moreover, the generative model of emotion is also important for achieving empathic communication between humans and robots.

The contributions of this study are threefold. First, this study investigates a meta-level model of emotion as a whole. Second, an implementation of the emotion model using deep leaning modules is provided. Third, we design a simulation task mimicking mother–infant interaction in order to evaluate the model, which reveals that the proposed model is indeed able to show emotion differentiation. It should be noted that our previous studies showed some preliminary ideas and examinations of the proposed emotion model (Hieida and Nagai, 2017; Hieida et al., 2018a, b). Although the basic idea of this work is shared with the previous studies, this paper provides a detailed explanation of the model and full implementation as an entire emotion network, which were not given in the previous works. Moreover, we repeated the entire experiments and the results presented in this paper are completely new.

The remainder of this paper is organized as follows. In the next section, the literature on emotions is discussed and then a model of emotions is proposed. Section 3 provides an implementation of the proposed emotion model using deep neural networks. Each module of the network is explained in detail. The experiments are presented in Section 4, which indicate plausibility of the proposed deep emotion model. Finally, this paper is summarized in Section 5.

4 Model of Emotion

Here, we present an overview of the basic idea of our proposed emotion model. Some important findings for our proposal are reviewed first, followed by the proposed model of emotions.

4.1 Emotions in literature

What is emotion? In order to propose a model of emotion, we should start from this important question. Moreover, we need to clarify the definition of emotion. To the best of our knowledge, there is no universal consensus on the definition of emotion; however, recent research reveals the importance of the body in emotion. This is what William James claimed long ago, also called the peripheral origin theory of emotion (James, 1884). Recent studies in cognitive neuroscience have revealed that interoception, which is a perception of the internal bodily state, is a key for the subjective experience of emotion (Terasawa et al., 2013). In the quartet theory of emotions, the brainstem-centered system corresponds to this type of emotion system (Koelsch et al., 2015). The brainstem is the oldest brain structure and the reticular formation plays an important role in this system. Another important aspect of embodiment in emotion is Damasio’s somatic marker hypothesis, which hypothesized that emotions evaluate external stimuli efficiently through our own body (Damashio et al., 1996). This motivates us to consider both internal and external appraisals simultaneously.

In any case, the physical body is an origin of emotions and is indispensable. In this paper, we consider the body and interoception as one subsystem, i.e., the first layer in the proposed model. In fact, despite the differentiating property of emotions, some basic emotions such as anger, joy, disgust, fear, sorrow, and surprise exist regardless of culture (Ekman and Wallace, 1971). This is possible because we as human beings share similar physical bodies and environments. This result supports the fact that emotions are based on our physical bodily states. The idea of active inference is also related to this embodied system, e.g., visual attention is relevant to the appraisal of visual stimuli (Friston et al., 2010; Seth and Friston, 2016).

Another important aspect of emotion is related to decision-making (Ledoux, 1998) and inference on causal attribution. For example, a misattribution of arousal, which is also known generally as the suspension-bridge effect, has been found to happen to someone who experiences the effects of fear of physical danger while meeting someone, and who mistakenly believes that the other person is the cause of their physical responses (Dutton, 1974). This higher-level cognitive process seems to be deeply related to the orbitofrontal-centered system. The reinforcement learning module, which originates from the cortical-basal ganglia loop, is also related. The relationship between active inference and reinforcement learning has been discussed (Friston et al., 2009), which implies this system is also related to the active inference. In this paper, we consider the decision-making as another subsystem, which we will call the third-layer in the proposed model.

The discussion so far implies that one’s emotional system is divided into two systems: 1) the hardwired innate system and 2) the learning system, which is related to the decision-making. Now, we define emotions (emotional states) and feelings (emotional feelings). Emotions are defined as a set of physical reactions, state changes of visceral and skeletal muscles, and changes in internal conditions. These changes are evoked by the above systems 1) and 2). In contrast, feelings are defined as perceptions of the emotional states. These definitions are based on Damasio’s definition (Damasio, 2003). It should be noted that the term “emotion” corresponds to “affect” in the area of psychology: this paper uses the term “emotion,” which is generally acceptable.

Figure 1: Illustration of “anger” and “fear,” which highlights the difference: (a) emotional feeling of anger, and (b) emotional feeling of fear.

In order to clarify the definition of emotions/feelings used in this paper, Fig. 1 illustrates concrete examples. In the figure, there are a stimulus A and a bodily state that evoke the “Fight” action, whereas a stimulus B and a bodily state activate the “Flight” action. In this case, the emotional state that stimulus A and the bodily sate cause is labeled as “anger,” and the emotional state caused by the stimulus B and the bodily sate is labeled as “fear.” This definition directly connects emotions to the somatic marker hypothesis, which means that the emotion should be generated by considering internal appraisal, external appraisal, and decision-making mechanisms.

Regarding the learning system, a memory-based system is an important candidate as a building block of the emotion model. In the quartet theory, the hippocampus-centered system corresponds to the memory-based system, in which the hippocampus and amygdala are mainly involved (Koelsch et al., 2015). The activity of the amygdala in emotion is particularly important and has been studied for a long time. Yakovlev’s circuit is one of the well known limbic systems and the amygdala is involved in the circuit (Yakovlev, 1948). Papez’s circuit is another well known limbic circuit, which includes the hippocampus (Papez, 1937). Although these are independent as circuits, they have mutual interaction and are closely related each other through the cortex, basal ganglia, and diencephalon (Mendoza and Foundas, 2007). In this paper, we consider the memory-based system as another subsystem, which we call the second layer in the proposed model. This subsystem gives flexibility to the innate-appraisal system, i.e., first layer, in order to adapt the whole system to the environment.

Eventually, emotions cannot be viewed locally, and need to be thought of as a network. Therefore, the abovementioned subsystems should be connected as a network to generate emotions. Furthermore, the important aspect of the model is its ability to explain various phenomena known in the art. Among others, emotion differentiation is an important phenomenon, because it is a key to implementing emotions for robots, as mentioned earlier. Bridges claimed that excitement, which is the origin of emotion, can be divided into several emotional categories based on observations of infants (Bridges, 1932). More recently, it has been reported that emotions such as pleasure, interest, surprise, sadness, anger, and fear were recognized one year after birth; pride, shame, guilty feelings, etc. emerged from approximately two and a half years; and not all but great majority of emotions appear by the age of three (Lewis and Ramsay, 1995). Our proposed model of emotion is discussed in the next subsection. Then, the model is implemented using deep neural networks and tested to determine whether it develops the emotional categories.

4.2 Proposed model of emotion

The proposed emotion model is illustrated in Fig. 2. The emotion model is divided into three layers: the first layer that reacts bodily to stimuli very fast, the second layer that accesses memories such that stimuli can be evaluated through experiences, and the third layer that makes future predictions and actions. These are derived from the abovementioned implications.

Figure 2: Schematic diagram of our proposed three-layer model of emotion.

The first layer reacts to stimuli very quickly using the body, which is called external appraisal. Moreover, this part reflects the situation of the body itself, i.e., internal appraisal, regardless of external perception. This layer is the reason why emotions depend on the physical body. Because the reactions are preprogrammed innately, they usually contain errors, which cause overreactions to stimuli. To alleviate this problem, the second layer accesses memories such that stimuli can be evaluated through experiences. This second layer makes it possible to suppress unnecessary reactions and, at the same time, react quickly to important problems. Of course, this is a trade-off between processing cost and accuracy of response to stimuli. Hence, the output of the first layer, which is modulated by the second layer to be precise, can be considered as the perception of dimensionally reduced evaluated results of the external and internal worlds, i.e., internal representation. Therefore, the perception of the output of the first layer can be regarded as interoception.

In the third layer, the output of the first layer is used together with the input stimuli for causal inference and prediction, as shown in Fig. 2. Subsequently to the prediction, decision-making is carried out using the input stimuli and the results of the prediction. The most important part of the third layer is reinforcement learning, which is responsible for the learning of optimal decision-making. One of the most important aspects of the reinforcement learning is the definition of a reward. In the model of emotion, the idea of “homeostasis,” which is a regulatory mechanism of the agent’s internal state, should be adopted. This is based on the drive reduction theory, which is the basic theory of motivation (Myers, 2010). It interesting that homeostasis is closely related to the diencephalon, which is one of the emotion systems in the quartet theory of emotions (Koelsch et al., 2015). Hence, a reward is provided when the output of the first layer, i.e., interoception, remains constant. This constant is not a completely constant value. It takes the average value of emotional state over a time window with a certain length. In other words, homeostasis is set not to keep the emotional state completely constant, but to discourage rapid changes. In our model, the average value that gradually changes in time is defined as “mood.” Thereafter, the neural patterns of the policy in the striatum, i.e., emotional states, are consciously recognized as emotional feelings. After the decision-making process, the prediction error is calculated followed by updating of the model in the third layer. Experiences are stored in the hippocampus as episodic memories, with emotional evaluation in the second layer. It is worth noting that the learning process exists only in the second and third layers.

Figure 3: Our proposed emotion model for implementation, which is a redrawn version of Fig. 2.

Figure 3 is a redrawn version of the proposed model in Fig. 2, in order to make it comprehensive for implementation. This figure directly claims some important points of our proposal. First, the internal and external appraisals, i.e., embodiment, are the sources of emotions. It is fair to say that without the physical body, there should be no emotions. Second, prediction is indispensable in the proposed model. Third, another key point in our model is the decision-making part, which is relevant to the somatic marker hypothesis. These viewpoints remind us to note the close relationship between our proposed emotion model and the embodied predictive interoception coding (EPIC) model, which was proposed recently (Barrett et al., 2015). The idea of the EPIC model is based on predictive coding and active inference (Friston and Stephan, 2007). Although we developed our proposed model independently of the EPIC model, some important ideas are shared between the two models. The main difference between the EPIC model and the model proposed in this study is that we propose the actual implementation of the proposed model by combining several deep learning modules, which are described in the next section. On the contrary, the EPIC model is a conceptual model and sticks firmly to the predictive coding.

Another important aspect for the model is the design of artificial emotional systems, which Cañamero contends (Cañamero and Gaussier, 2005). She claimed that emotions must be grounded in an internal value system that is meaningful for the robot’s physical and social niche. The model should establish a link between emotions, motivation, behavior, perception, and various aspects of “cognition,” and the link must be rooted in the body of the agent. As already discussed, our proposed emotion model has the potential to fulfill these requirements.

5 Implementation

This section proposes an implementation of the emotion model described in the previous section. The proposed implementation consists of a combination of deep neural networks, such as RAM, LSTM, and DDPG, except for the second layer. The second layer is realized by a simple smoothing mechanism to make the learning system tractable. In the following, we will look at the implementation of each module in turn.

5.1 Appraisal module (1st layer)

As we discussed earlier, the first layer is responsible for generating interoception based on both internal and external appraisals. The problem here is the generation of a physical response to stimuli. The model is required to generate a “human-like” response in order to replicate human emotions. In this study, we attempt to generate suitable affect values, i.e., a pair of valence and arousal values, from input visual stimuli using a neural network instead of generating physical body reactions (Hieida and Nagai, 2017). To replicate human-like innate reactions, we utilize several databases such as the international affective picture system (IAPS) database (Lang et al., 1999a, b), the open affective standardized image set (OASIS) (Kurdi et al., 2017), the Nencki affective picture system (NAPS) (Marchewka et al., 2014), and the Geneva affective picture database (GAPED) (Dan-Glauser and Scherer, 2011), to train the network in order to generate two-dimensional valence and arousal values for a given visual stimulus. Therefore this is a regression problem.

To take the active interoceptive inference into consideration, we propose the use of the RAM (Mnih et al., 2014)

. This is because visual attention is a very important factor for estimating arousal and valence values, and the RAM makes it possible to learn the visual attention and affect values simultaneously, as shown in Fig.


. Please refer to the appendix for details on the RAM. It should be noted that the RAM improved the performance of regression compared with the convolutional neural network (CNN), which is directly trained using pairs of images and ground truth.

Figure 4: Overview of the first layer implemented by the recurrent attention model: (a) block diagram of the first layer and image examples from IAPS (Lang et al., 1999a), and (b) network architecture of the RAM (Hieida and Nagai, 2017).

5.2 Emotional memory module (2nd layer)

Figure 5: Second layer implemented by the smoothing: (a) block diagram of the second layer, and (b) schematic example of the smoothing process. Because are always positive, the compensation term is positive in this example. This means that the input stimuli belonging to the category reinforce high affect values. The white circles for are moved to the black circles by the compensation term .

The second layer shown in Fig. 5 (a) can be regarded as an adaptation using data in the actual environment for the innate and fixed system, i.e., the first layer. The memory-based learning increases the accuracy of the prediction by using the past accumulated information experienced by the agent. Here, we formulate this as a problem of calculating the expected value (), where and represent the target value to be estimated and the stored data, respectively. This is a type of smoothing problem and the second layer is realized by a simple nonlinear smoothing technique, as shown in Fig. 5 (b).

More specifically, a time series including affect values and stimuli during a certain period of time is stored in the memory. The idea here is that the output of the RAM is modified by the compensation term in the second layer as follows:


where is an external appraisal at time , represents the output of the first layer for the input image at time , indicates the category of the input image, is an internal appraisal at time , which will be described later, and represents the output of the second layer (compensation term) for the image . , which modifies the first layer output , is updated using the stored data as follows:


where is the learning rate and is set to 0.1 in the later experiment. is a collection of time indices with the same image category and represents the number of images belonging to .

As shown in Fig. 5 (b), the smoothing process can capture temporal information. For example, when the next affect values for a particular image category increase frequently, the term compensates the affect value of the corresponding image input in the upward direction. However, the next affect values for vary and the sum of cancels out. With this smoothing process, we can expect the effect of lowering the load on the body, and it improves the prediction performance of the next layer.

Long-term potentiation (LTP) is a well known mechanism for connecting memory and learning. The hippocampus and amygdala are closely related to the LTP mechanism. The smoothing mechanism in this layer is assumed to mimic LTP in the functional level. Moreover, the amygdala is involved in the classical conditioning based on LTP. Thus, the process of this layer may replicate the classical conditioning. Fig. 5 (b) also explains this mechanism in a simple way. The second layer learns that the image category is the trigger of high valence and arousal values.

The output of RAM and second layer represents external appraisal. According to our definition, interoception is a combination of external appraisal, i.e., the output of RAM modulated by the second layer, and internal appraisal representing internal energy, as shown in Fig. 3. The internal energy increases or decreases according to the selected action. For example, moving the body forcefully consumes energy, consequently the internal energy decreases (the internal appraisal increases). Because the internal appraisal depends on the definition of the agent to be assumed, we explain certain details on the implemented internal appraisal module in the next subsection.

5.3 Internal appraisal

In this study, the internal appraisal module is implemented in a rule-based manner. Essentially, the internal appraisal increases when the agent acts as the internal energy is decreased. When the agent shows sadness or closes his eyelids, the internal appraisal decreases. This is because we assume that showing sadness leads to getting milk, and closing his eyelids corresponds to sleeping, which restores physical strength. The internal appraisal implies a physical strength bias in general, and the interoception is expressed by applying the physical strength bias to the external appraisal, as shown in Fig. 3. These assumptions are made because of a mother–infant interaction scenario in our later experiment. The agent has four facial parts to move according to external stimuli. Each facial part can be continuously controlled by the agent at the cost of corresponding power consumption. Thus, the agent has to learn (in the third layer) suitable facial expressions according to the external and internal worlds. More precisely, the internal appraisal can be rewritten as the following formula:


where is an action cost at time , and represents facial parts. represents constant physical fatigue and is set to in the later experiments. As four facial parts are assumed, has a value in the range from 0 to 4. Eq. (4) denotes a basic curve of physical strength with a time constant ( in the later experiments). Eq. (5) represents change in the parameter of the basic curve. It is natural that the parameter increases as the action is taken by the agent. When the agent closes his eyelids, the parameter is set such that the physical strength recovers. Additionally, even when the agent expresses a sad expression, the parameter is set for restoring physical strength. As we mentioned earlier, these settings are based on the assumption of mother-–infant interaction. In this study, is set to for closing eyelids and 75 for showing sadness. If these two values are the same, the two types of actions become meaningless. Therefore, they are set to different values, such that each action becomes meaningful in the reinforcement learning module. It should be noted that it is possible to design other rule-based internal appraisal modules according to the physical body of the agent and the scenario of the world in which the agent exists.

5.4 Decision-making module (3rd layer)

Figure 6: Overview of the third layer implemented by convolutional LSTM and DDPG: schematic of the third layer, network architecture, and infant agent used in the later experiments.

As shown in Fig. 6, the decision-making module is implemented using convolutional LSTM and DDPG. In a previous study, we have implemented using LSTM–DQN (Hieida et al., 2018a). The LSTM–DQN has a drawback that continuous actions cannot be dealt with. That is why the convolutional LSTM–DDPG is employed in this study. Please refer to the appendix for details on the convolutional LSTM and DDPG. To train the network (reinforcement learning), combinations of an input image such as Fig. 4(a), and the result of subtraction between an output of the RAM and an internal appraisal, i.e., interoception, are used. Another important part of reinforcement learning in general is the actions. This means that the implementation of the proposed emotion model requires actions, because the reinforcement learning is employed. Here, we discuss the actions used in this study. To consider the actions in the reinforcement learning, we need to assume the robot/agent to be used, because the actions to take vary depending on the body of the robot/agent. Without loss of generality, we assume the agent that is used in our later experiment in this study. The agent has action commands of its own facial expressions for given visual stimuli and interoception in the first layer, i.e., valence and arousal values. The convolutional LSTM is responsible for predicting an image and interoception values at the next time-step from the input image and current interoception values. The DDPG module generates an action command by taking the input image, interoception values, and predicted results by the convolutional LSTM, as an input. Figure 6 illustrates the overall processing of the decision-making module (third layer).

As discussed in the meta-level model, the idea of homeostasis is used for calculating the reward as follows:


where and

represent, respectively, the reward value and the vector consisting of valence and arousal values, i.e., interoception values, at time


represents the mood of the agent at that moment and is calculated as a mean vector of

and the average of the past frames. , , and represent a vector consisting of intermediate values between maximum and minimum interoception values, number of averaging frames, and a constant value, which translates the differential value to a reward value. Eq. (7) is intended to represent a mood, which is less likely to be provoked by a particular stimulus and is determined by the average of the last interoception values.

5.5 Learning of the model

Because our proposed model consists of several learning modules, several patterns can be considered as the timing of these updates. This study takes a simple idea of updating the LSTM and the second layer at each timing based on DDPG update loop. The entire learning algorithm of the proposed model is shown in Algorithm 1. In the algorithm, we set two parameters empirically as and . Figure 7 shows the whole network architecture of the proposed model. One can see the detailed parameters, such as number of input/output nodes, in the figure.

  Train the recurrent attention model (offline)
  Initialize the mood of the agent
  Initialize the second layer
  Randomly initialize critic network and actor with weights and
  Initialize target network and with weights
  Initialize replay buffer
  Initialize a random process for action exploration
  Receive an initial input image
  Calculate interoception using Eq.(2)
  Predict next image and interoception by LSTM module
  for  = 1,  do
     Select action according to the current policy and exploration noise
     Execute action and observe reward
     Receive an input image
     Calculate interoception using Eq.(2)
     Predict next image and interoception by LSTM module
     Store transition () in
     Sample a random minibatch of transitions () from
     Update critic by minimizing the loss:
     Update the actor policy using the sampled policy gradient:
     Update the target networks:
     Store the loss of LSTM
     Store interoception value and image for the second layer and the mood
     if  is divisible by  then
        Update LSTM module
     end if
     if  is divisible by  then
        Update the mood of the agent
        Update the second layer ’s
     end if
  end for
Algorithm 1 Deep emotion learning algorithm
Figure 7: Whole network architecture of the proposed deep emotion.

6 Experiments

We explain the experiment in this section. The experiment is roughly divided into three parts. In the first experiment, we verify the performance of the RAM (first layer). Because the first layer is assumed to return innate responses to stimuli, it is qualitatively evaluated through our subjective sense and children’s tendencies.

In the second experiment, we combined the RAM (first layer) and convolutional LSTM–DDPG (third layer), and observe the agent’s behavior and internal representation of the emotional state. Because the second layer is responsible for the adaptation of the system to the environment, we focused on the implementation of first and third layers in this experiment. In the third experiment, we combined the first, second, and third layers, implemented the whole emotion model, and verified its behavior. Then, by comparing with the second experiment, the significance of the second layer is examined.

6.1 Experiments on RAM (1st layer)

In order to test the performance of the RAM, we conducted the following experiment.

6.1.1 Experimental setup

As explained in 5.1, the RAM was trained using a set of images from IAPS, OASIS, NAPS, and GAPED. We used in total 24,270 images (4,045 original images 6 types of deformation such as rotations, flipping, and affine transformations) for training, and 100 images for testing (randomly selected from IAPS). After the training, the RAM was evaluated using the evaluation data. To qualitatively examine the property of the model, we also input single-color images to the RAM and observed the results. This evaluation is expected to provide an insight on the color preference of the trained network. Moreover, we also input face images with a certain facial expression to the RAM. The Japanese female facial expression (JAFFE) database (Dailey et al., 2010) was used in this experiment. The JAFFE database contains 213 images of seven facial expressions (pleasure, sad, angry, fearful, surprised, disgusted, and neutral). We visualized outputs from the RAM, i.e., valence and arousal values, for these input images.

6.1.2 Results

Fig. 8 (a) represents the ground truth, i.e., values from the IAPS database, and the results output by the RAM. The mean absolute errors for 100 test images are 0.48 for arousal and 0.46 for valence. These errors are sufficiently small as compared with the variation of human evaluation (Lang et al., 1999a, b). Fig. 8 (b) represents the results of visual attention for two different test images. One can see that the system successfully paid attention to visually important locations and estimated reasonable arousal and valence values in both cases.

Fig. 8 (c) shows the results of inputting single-color images; high values are observed around 45 degrees of hue, which corresponds to the color yellow. Additionally, high values are seen in the center, which corresponds to the color white. On the other hand, low values are observed around 270 and 100 degrees of hue, which correspond to purple and green, respectively. According to Yamawaki, for infants six months of age, warm colors, such as yellow, white, and pink, have high preference and cold colors, such as blue, green, and violet, have low preference (Yamawaki, 2010). This result implies that the output of the RAM shows reasonable reactions compared with an infant.

In the result of the facial image input, the output, i.e., valance and arousal values, tends to coincide with the category of the facial expression. The results are given in Fig. 8 (d). For instance, when the facial images with pleasure expression are input to the RAM, the output of the RAM tends to have a high valence value. However, anger facial expressions tend to draw low valence and high arousal values. These results indicate that a response called emotional contagion (Hatfield et al., 1993; Barsade, 2002) is observed in the trained RAM network.

Figure 8: Results of the first layer: (a) comparison between the output of the RAM and ground truth, (b) examples of the locations paid attention by the RAM (the red rectangle in each image represents the location of attention and a part of the facial image was blurred to make it impossible to identify individuals), (c) visualization of the RAM’s output for input single-color images, and (d) heat map of arousal/valence frequency for facial images.

6.2 Experiments on decision-making module (3rd layer)

6.2.1 Experimental setup

This experiment intends to show the performance of the decision-making mechanism in the proposed emotion model. Therefore, we connected the RAM and the third layer, which is implemented by the convolutional LSTM–DDPG. The second layer is not included in this experiment, because the whole emotion model is used in the next experiment and the results are compared to examine the importance of the second layer. The virtual agent (we use a free software package called “MakeHuman” for the modeling of 3-dimensional agent, which can change its facial expressions by moving eyelids, eyebrows, mouth, and corners of the mouth, is used as the body and the RAM and the convolutional LSTM–DDPG are implemented in the virtual agent. Then, we designed a “facial expression” task based on the mother–infant interaction scenario. In this task, the interaction partner, which is also a computer agent (mother agent), recognizes the agent’s facial expressions as one of four categories, and expresses back the corresponding facial expressions in the same category as that of the virtual agent (infant agent). The facial expression recognition of the infant agent by the mother agent is based on the following rules: 1) pleasure (when the corner of the mouth is raised), 2) anger (when the corner of the mouth falls, eyebrows are knitted, and eyes are more than half open), 3) sadness (when the corner of the mouth falls, eyebrows are knitted and eyes are more than half closed), and 4) neutral (otherwise).

This experimental design is based on a known phenomenon called “mirroring,” in which the mother intuitively imitates the infant’s expression on a daily basis (Winnicott, 1960). This is said to be important for young infants to learn emotional adjustment and social response (Murray et al., 2016). Especially for smiling, interactive smile games between infants and their caregivers are known as an important milestone in infant social development and build the foundation for later forms of social competence (Kaye and Fogel, 1980). Ruvolo and colleagues revealed that there exists a strategy for the timing when the child smiles and the relationship with his/her mother (Ruvolo et al., 2015). Thus, the purpose of this experiment is to observe the behavior learned by the infant agent and the change in interoception, emotional state, and emotion due to learning of a facial expression strategy.

In this experiment, we have two different conditions: “face-only condition” and “face+natural condition.” These conditions are set to compare the ideal condition of seeing only the face of the mother and the case where environmental factors exist. In the “face-only” condition, the infant agent always receives a facial image according to the infant agent expression (mirroring). The top row of Fig. 9 represents information on facial images, which are selected from JAFFE database (Dailey et al., 2010). There are two different facial images for each emotional category. One of these two images is selected randomly to present to the infant agent. Although the actual images used in this experiment cannot be shown, one can check the images by downloading the database from “JAFEE ID” corresponds to the filename of each image. On the contrary, in the “face+natural” condition, the infant agent randomly receives one of the facial images in the top row of Fig. 9 or one of the IAPS images in the bottom row of Fig. 9 as a visual stimulus. The natural images from IAPS mimic environmental stimuli. It should be noted that the facial images are stimuli that the infant agent can manipulate, because the facial images are selected according to the infant agent action. The IAPS images are, however, stimuli that cannot be manipulated, as they are randomly chosen. In other words, it is expected that the infant agent learns a policy to acquire intended stimulus according to a given facial image, and learns countermeasures, e.g., closing eyes, when an undesirable stimulus is presented from the IAPS images. In both cases, the image of the closed eyes portion is displayed as a black image when the agent closes his eyes.

We performed 100000 epochs of this training using the proposed emotion model and the abovementioned scenario. Each time learning progresses, we visualize the middle layer of the policy network in Fig.


using principal component analysis (PCA) in order to observe the state space constructed by the infant agent through the mother–infant interaction. If our emotions were correctly defined and properly implemented, then this state space could be divided into emotional categories by actions.

Figure 9: Images used in the experiment: (a) facial images selected according to the infant agent’s facial expression (JAFFE IDs are shown instead of actual images because of personality rights), and (b) natural images selected randomly.

6.2.2 Results

Figures 10 (a) and (b) show the learning curves of this experiment for the face-only and face+natural conditions, respectively. On the top row, the learning curves of the LSTM are shown. From these figures, one can see that the LSTM learns to predict the next stimuli and interoception values. The training loss rapidly decreases within 5000 epochs. By comparing the face-only condition and face+natural condition, it is natural that the prediction error is smaller in the face-only condition. For the reward in the bottom row of Figs. 10 (a) and (b), similar properties can be seen. In fact, the reward rapidly increases for less than 5000 epochs. The reward does not converge to a constant value. This fluctuation occurs because the reward is based on homeostasis, which is a difference between a current interoception value and the past averaged interoception values. If there is a sudden change such as recovery of strength, the reward tends to change suddenly. In spite of this fluctuation of the reward, it can be clearly seen that the face-only condition achieved the higher reward in total. This is because the prediction in face-only condition works better than the face+natural case. In other words, the infant agent can better control the external environment as the mother agent always shows the facial expression in response to the infant agent.

Now, we examine the change in internal representation, i.e., emotional state, behind this reinforcement learning. The results are shown in Fig. 11. Figures 11(a) and (b) show plots of the external appraisal and interoception values, respectively. The visualization of the middle layer of the policy network using PCA is shown in Fig. 11 (c). These results correspond to (a), (b), and (c) in Fig. 3. Each color represents a category of facial expression recognized by the mother agent. Specifically, green, yellow, blue, and red represent neutral, pleasure, sadness, and anger, respectively. The top rows of Figs. 11 (a)–(c) show the results of the face-only condition, and the bottom rows show the results of the face+natural condition. As mentioned previously, the face-only condition indicates that only stimuli that can be controlled by the infant agent are provided, whereas in the face+natural condition, half of the stimuli can be controlled by the infant agent and the other half cannot. From the results, one can see that the colors are mixed all over in Figs. 11 (a) and (b). This implies that the external appraisal and interoception do not explicitly provide emotion differentiation functionality. Moreover, it can be seen that the space does not expand in Figs. 11 (a) and (b) as the learning progresses. However, the state space expands and is divided for each color as learning progresses in Fig. 11 (c). We hypothesize that this is the basic mechanism of emotion differentiation, which is observed in the middle layer of the policy network. Because the interoception and external appraisal did not show differentiation, these results indicate the plausibility of the proposed emotion model.

We stop the learning process at certain epochs and run the infant agent using the learned model at each epoch to observe the behavior of the infant agent. From these observations, we found that the agent had the following behavioral changes (One can download the demo video of the running agent using learned models from

In 20000 epochs model, the infant agent often opens his eyes. In the model of 40000 epochs, he often closes his eyes. He changes facial expressions by stimulation in the model of 60000 epochs. He closes his eyes at the times when the internal appraisal increases in the model of 80000 epochs. Finally, after 100000 epochs, he shows various facial expressions and has succeeded in stabilizing emotions.

Essentially, the agent smiles very often and makes the other person smile. This behavior also seems to be altruistic, such that the agent is trying to make the partner smile. This behavior seems to be consistent with the findings in (Ruvolo et al., 2015). Actually, it is interesting that the infant agent is just smiling for the desired stimulus, that is, the agent learned a selective smile.

Figure 10: Learning curves of the LSTM and the DDPG: (a) face-only condition in experiment 4.2, (b) face+natural condition in experiment 4.2, and (c) face+natural condition in experiment 4.3.
Figure 11: Visualization of the internal representations in experiment 4.2 (first + third layers): (a) external appraisal during each period of epochs, (b) interoception values during each period of epochs, (c) PCA visualization of the middle layer of the policy network during each period of epochs. It should be noted that the top and bottom rows represent the results of the face-only and face+natural conditions, respectively.

6.3 Experiments on the whole system including the second layer

6.3.1 Experimental setup

In Section 6.2, we conducted an experiment using the first and third layers. This is because the integration of the first and third layers is the core part of the proposed emotion model. We are interested in the core mechanism of the emotion model and evaluated the model without using the second layer in the previous section. The whole system, including the second layer, is the focus of our interest in this section. Moreover, we can determine the importance of the second layer by comparing the results to the previous ones. We use exactly the same experimental protocol as in Section 6.2; however, only the face+natural condition is adopted as it is obvious that the face-only condition gives better performance in terms of prediction.

6.3.2 Results

The learning curve of the whole system is given in Fig. 10 (c). The upper graph represents the LSTM loss versus the number of epochs. This graph shows a similar tendency to the previous experiment, which means that the LSTM learns to predict the next image and interoception values. The lower graph shows the reward with respect to the number of epochs. This also shows the same tendency as the previous experiment.

It is interesting to compare the results between the previous and current experiments in terms of average errors in the LSTM and average reward that the agent obtained. For the LSTM, the averaged losses, which are represented by (face-only condition without second layer), (face+natural condition without second layer), and (face+natural condition with second layer) are expected to be in the order . In fact, the averaged losses are , , and ; and the order of these values is as expected. Exactly the same observations can be made with respect to the reward (larger is better in this case). The averaged reward values are , , and (). These results are obtained because the face-only condition is the easiest setting for the infant agent. Moreover, the second layer improves the prediction of the next situation, which leads to .

In order to show that the second layer actually works, the learned models (both with and without the second layer) were run for 3000 epochs and the interoception values were collected. Then, the mean absolute differences (MAD) of both models were compared. For the valence, the MAD of the previous experiment (face+natural condition without second layer) was . The MAD of the current experiment (face+natural condition with second layer) was . For the arousal, the MAD of the previous experiment and the current experiment were, respectively, and

. The t-test was performed on the MAD of both models and revealed that the MAD was significantly smaller in the case with the second layer (

). This indicates that the second layer works as we expected, and it improves the learning performance.

Here, discuss in detail the representation inside the network to find the basis of emotions. Figure 12 shows plots of the external appraisal, interoception values, and visualization of the middle layer of the policy network using PCA. These results correspond to (a), (b), and (c) in Fig. 3. Each color represents a category of facial expression recognized by the mother agent, as mentioned in the previous section. From these figures, one can see that the representation in the policy network divides the emotional category very clearly compared with the external appraisal and the interoception. Moreover, it is also clear that the policy network represents categories far better compared with Fig. 11, which does not include the second layer.

In this experiment, we also run the agent using the learned model. Figure 13 shows some typical facial expressions by the infant agent for each model at specific epochs. From the observations of the infant agent’s behavior, we found that the agent had the following behavioral changes (One can download the demo video of the running agent using learned models from
–20000 epochs: The agent often closes his eyes.
–40000 epochs: The agent often closes his eyes.
–60000 epochs: The agent often opens his eyes and he changes expressions by stimulation.
–80000 epochs: The agent closes his eyes at the times when the internal appraisal increases.
–100000 epoch: The agent shows various facial expressions (surprise, anger, etc.) and has succeeded in stabilizing affects.

Figure 12: Visualization of the internal representations in experiment 4.3 (the whole model): (a) external appraisal, (b) interoception, and (c) PCA visualization of the middle layer of the policy network during each period of epochs.
Figure 13: Examples of facial expressions by the infant agent using the learned model with the second layer. Please note that the facial input image on top right is blurred for personality rights.
Figure 14: Frequency ratio of facial expressions for each condition.

6.4 Discussion

In the first experiment, the RAM was evaluated. The results of this experiment show that the RAM has an ability to replicate the innate reactions of a human against specific stimulation. It is interesting that although the network does not learn the reactions directly, it can learn general human reactions. For example, when an image of a pleasure facial expression is input to the RAM, the arousal and valence values corresponding to pleasure are generated. Moreover, similar responses of infants to color are learned by the RAM. These facts indicate the existence of an innate and general response of humans to visual stimuli, and the RAM can extract such visual features. For the implementation of emotional robots, emotions are usually designed manually by the robot designer and the above results may free the robot designer from this difficult design task.

In the second experiment, we evaluated the proposed emotion models. According to the PCA results with the face-only condition, pleasure occupies half, and the remaining half seems to consist of a mix of neutral and sadness, and anger can be seen to a lesser degree. In an environment that the agent can always control, anger is not necessary to be output; that is, the agent learned to deal with stimuli by pleasure or otherwise. However, according to the result of the face+natural condition, although pleasure is predominant, anger increases, and neutral and sadness are separated as compared with the face-only condition. This is due to the necessity of selecting actions by classifying stimuli in more detail because of uncontrollable stimuli. Therefore, it can be surmised that not only the controllable stimuli but also the uncontrollable stimuli create our human-like rich emotions. The uncontrollable stimuli also give a very important meaning to learning to predict the future; that is, if the world is simple enough to predict perfectly, then the learning does not mean anything.

In the third experiment, the whole emotion model including the second layer was evaluated. By having the second layer in the emotion model, the state space, i.e., the middle layer of the policy network, has the representations of basic emotions such as anger, sadness, pleasure, and neutral. More interestingly, these emotional categories are located as assumed in the dimensional model; that is, neutral is located at the center, and pleasure, sad, and anger are located surrounding the neutral. Pleasure does not occupy the PCA space anymore, and it seems to be relatively evenly divided. In particular, the frequency of anger increases as shown in Fig. 14. Because the second layer works as a smoothing function, the interoception values of temporally adjacent stimuli are made closer, and sudden changes are reduced. As a result, prediction in the LSTM improves, and categorization of the stimulus is promoted. It is thought that these effects result in relatively uniform and distinct differentiation of the boundary surface of emotional categories.

Now, let us consider the behavioral output of the infant agent with the whole emotion model. In the early stage of learning, the agent closed his eyes well and the eyes are opened well in the second half of the early stage. This is similar to the development of infants. In general, infants initially almost always have their eyes closed (sleeping), and the time with their eyes open increases gradually. This process may be mainly dependent on the developmental process of the physical bodies of young infants. However, in the course of action selection, infants may have a stage to learn that the best policy is to close the eyes at the beginning, and gradually shift toward the policy of keeping their eyes open. This is only a speculation, which should be verified in the future. Additionally, the 100000 epoch result in Fig. 13 shows that the infant agent looks surprised by the snake. In the PCA space of the middle layer of the policy network, i.e., internal representation of emotional states, it is not clear whether the surprise category was generated, because the actions were classified with only four emotional categories. However, there is a possibility that a richer emotional space emerged as the internal representation of the proposed emotion model. This point still needs further analysis.

Here, the limitations of our proposed model are discussed. Because the IAPS used adult human subjects to label the arousal and valence values, there must be an issue of the RAM using the IAPS in the first place. However, we think that the averaging process of the labeled values reduced the individuality of the data and innate reactions were extracted. The results of the first experiment using the RAM implies that this is in fact true. Currently, biosignals from a real human body instead of the IAPS database are prepared to use for training the RAM as another direction of this research. Another issue to be addressed is the reward for reinforcement learning, which is currently based solely on the idea of “homeostasis.” The idea of intrinsic motivation that appeared as a series of counterarguments to drive reduction theory cannot be ignored (Kage, 1994). More complex tasks should be considered in the future, because the current facial expression task is too simple to examine the full functionality of the emotion model. We also consider using a real robot to examine more complex internal appraisals.

From the viewpoint of empathic communication, “other” should appear in Fig.3. Moreover, self/other discrimination must be considered in the model for generating higher-level social emotions. Language is another important aspect of the emotion model (Lieberman et al., 2007). We are currently working on the “emotional symbol grounding problem” using the idea of language acquisition by robots (Nishihara et al., 2017). In addition, it is necessary to consider empathy. For example, Lim et al. proposed multimodal emotional intelligence (Lim and Okuno, 2015)

. Their model was inspired by the mirror neuron system, which is a mechanism underlying human cognition

(Iacoboni, 2009). In considering empathy, the work on mirror neurons cannot be ignored.

7 Conclusions

In this study, a computational model of emotion, which consists of three layers was proposed. As the first layer, we examined a method for generating valence and arousal values by given visual stimuli using the RAM. Some promising results were obtained, which verified that the first layer is plausible for generating human-like quick reactions against specific stimuli. Next, we examined a decision-making mechanism, which is the third layer, by employing a convolutional LSTM and DDPG. As a result, the agent learned a selective smile and emotion differentiation was observed. Finally, the whole model including the second layer was integrated and its performance was studied. The results obtained in this experiment show that the second layer provided far better results compared with the model without the second layer. For future work, we will evaluate the proposed model using more complex tasks. The implementation on a real physical robot is also left for future work.

Author Contributions

CH and TN conceived of the presented idea. CH developed the theory and implemented the system. CH and TN analyzed the results, and all authors discussed the results. CH wrote the manuscript with support from TH and TN.


This research was subsidized by JSPS Science Research Fund JP 16 J 04930, JST CREST (JPMJCR15E3), and Grant-in-Aid for Scientific Research on Innovative Areas (26118001).


  • Arnold (1960) Arnold, M. B. (1960). Emotion and Personality. Emotion and Personality (Cassell & Company)
  • Asada (2015) Asada, M. (2015). Development of artificial empathy. Neuroscience Research 90, 41–50. doi:
  • Barrett et al. (2015) Barrett, Feldman, L., and Simmons, W. K. (2015). Interoceptive predictions in the brain. Nature reviews. Neuroscience 16.7, 419–429
  • Barsade (2002) Barsade, S. G. (2002). The ripple effect: Emotional contagion and its influence on group behavior. Administrative Science Quarterly 47, 644–675. doi:10.2307/3094912
  • Breazeal (2002) Breazeal, C. (2002). Designing sociable robots. The MIT Press
  • Bridges (1932) Bridges, K. M. B. (1932). Emotional development in early infancy. Child development , 324–341
  • Cañamero and Gaussier (2005) Cañamero, L. and Gaussier, P. (2005). Emotion understanding: robots as tools and models. Emotional Development , 235–258
  • Cannon (1927) Cannon, W. B. (1927). The james-lange theory of emotions: A critical examination and an alternative theory 39, 106––124
  • Dailey et al. (2010) Dailey, M. N., Joyce, C., Lyons, M. J., Kamachi, M., Ishi, H., Gyoba, J., et al. (2010). Evidence and a computational explanation of cultural differences in facial expression recognition. Emotion 10 6, 874–93
  • Damashio et al. (1996) Damashio, A. R., Everitt, B. J., and Bishop, D. (1996). The somatic marker hypothesis and the possible functions of the prefrontal cortex [and discussion]. Philosophical Transactions of the Royal Society B, Biological Sciences 351, 1413–1420
  • Damasio (2003) Damasio, A. (2003). Looking for Spinoza: Joy, Sorrow, and the Feeling Brain. Harvest books (Harcourt)
  • Dan-Glauser and Scherer (2011) Dan-Glauser, E. S. and Scherer, K. R. (2011). The geneva affective picture database (gaped): a new 730-picture database focusing on valence and normative significance. Behavior research methods 43, 468
  • Dutton (1974) Dutton, D. G. (1974). Some evidence for heightened sexual attraction under conditions of high anxiety. Journal of Personality and Social Psychology 30, 510–517
  • Ekman and Wallace (1971) Ekman, P. and Wallace, F. V. (1971). Constants across cultures in the face and emotion. Journal of personality and social psychology 17, 124–129
  • Friston et al. (2009) Friston, K. J., Daunizeau, J., and Kiebel, S. J. (2009). Reinforcement learning or active inference? PLOS ONE 4, 1–13. doi:10.1371/journal.pone.0006421
  • Friston et al. (2010) Friston, K. J., Daunizeau, J., Kilner, J., and Kiebel, S. J. (2010). Action and behavior: a free-energy formulation. Biological Cybernetics 102, 227–260. doi:10.1007/s00422-010-0364-z
  • Friston and Stephan (2007) Friston, K. J. and Stephan, K. E. (2007). Free-energy and the brain. Synthese 159, 417–458. doi:10.1007/s11229-007-9237-y
  • Hatfield et al. (1993) Hatfield, E., Cacioppo, J. T., and Rapson, R. L. (1993). Emotional contagion. Current Directions in Psychological Science 2, 96–100. doi:10.1111/1467-8721.ep10770953
  • Hieida et al. (2018a) Hieida, C., Horii, T., and Nagai, T. (2018a). Decision-making in emotion model. In Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. 127–128
  • Hieida et al. (2018b) Hieida, C., Horii, T., and Nagai, T. (2018b). Emotion differentiation based on decision-making in emotion model. In IEEE International Conference on Robot and Human Interactive Communication. to appear
  • Hieida and Nagai (2017) Hieida, C. and Nagai, T. (2017). A model of emotion for empathic communication. Companion of the 2017 ACM/IEEE International Conference on Human-Robot Interaction , 133–134
  • Iacoboni (2009) Iacoboni, M. (2009). Imitation, empathy, and mirror neurons. Annual review of psychology 60, 653–670
  • Izard. (1977) Izard., C. E. (1977). Human emotions (Springer US)
  • James (1884) James, W. (1884). What is an emotion ? Mind os-IX, 188–205. doi:10.1093/mind/os-IX.34.188
  • Kage (1994) Kage, M. (1994). A critical review of studies on intrinsic motivation. Japanese Journal of Educational Psychology 42, 345–359
  • Kaye and Fogel (1980) Kaye, K. and Fogel, A. (1980). The temporal structure of face-to-face communication between mothers and infants 16, 454–464
  • Koelsch et al. (2015) Koelsch, S., Jacobs, A. M., Menninghaus, W., Liebal, K., Klann-Delius, G., von Scheve, C., et al. (2015). The quartet theory of human emotions: An integrative and neurofunctional model. Physics of Life Reviews 13, 1–27. doi:
  • Kurdi et al. (2017) Kurdi, B., Lozano, S., and Banaji, M. R. (2017). Introducing the open affective standardized image set (oasis). Behavior research methods 49, 457–470
  • Lang et al. (1999a) Lang, P. J., Bradley, M. M., and Cuthbert, B. N. (1999a). International affective picture system (iaps): Technical manual and affective ratings. Gainesville, FL: The Center for Research in Psychophysiology, University of Florida
  • Lang et al. (1999b) Lang, P. J., Bradley, M. M., Cuthbert, B. N., et al. (1999b). International affective picture system (iaps): Instruction manual and affective ratings. The center for research in psychophysiology, University of Florida
  • Lazarus (1991) Lazarus, R. S. (1991). Emotion and Adaptation (Oxford University Press USA)
  • LeDoux (1986) LeDoux, J. E. (1986). Neurobiology of emotion (Cambridge University Press)
  • LeDoux (1989) LeDoux, J. E. (1989). Cognitive-emotional interactions in the brain. Cognition and Emotion 3, 267–289. doi:10.1080/02699938908412709
  • LeDoux (1998) LeDoux, J. E. (1998). The Emotional Brain: The Mysterious Underpinnings of Emotional Life. A Touchstone book (Simon & Schuster)
  • Ledoux (1998) Ledoux, J. E. (1998). The emotional brain: The mysterious underpinnings of emotional life. Simon & Schuster
  • Lewis (2000) Lewis, M. (2000). Self-conscious emotions. Emotions , 742
  • Lewis and Ramsay (1995) Lewis, M. and Ramsay, D. S. (1995). Developmental changes in infants’ responses to stress. Child Development 66(3), 657–670
  • Lieberman et al. (2007) Lieberman, M. D., Eisenberger, N. I., Crockett, M. J., Tom, S. M., Pfeifer, J. H., and Way, B. M. (2007). Putting feelings into words: affect labeling disrupts amygdala activity in response to affective stimuli. Psychological science 18 5, 421–8
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., et al. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
  • Lim and Okuno (2015) Lim, A. and Okuno, H. G. (2015). A recipe for empathy. International Journal of Social Robotics 7, 35–49
  • Marchewka et al. (2014) Marchewka, A., Żurawski, Ł., Jednoróg, K., and Grabowska, A. (2014). The nencki affective picture system (naps): Introduction to a novel, standardized, wide-range, high-quality, realistic picture database. Behavior research methods 46, 596–610
  • Masuyama and Loo (2015) Masuyama, N. and Loo, C. K. (2015). Robotic emotional model with personality factors based on pleasant-arousal scaling model. In Robot and Human Interactive Communication (RO-MAN), 2015 24th IEEE International Symposium on. IEEE , 19–24
  • Mendoza and Foundas (2007) Mendoza, J. and Foundas, A. (2007). Clinical Neuroanatomy: A Neurobehavioral Approach (Springer New York)
  • Miwa et al. (2001) Miwa, H., Umetsu, T., Takanishi, A., and Takanobu, H. (2001). Robot personality based on the equations of emotion defined in the 3d mental space. In Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation. vol. 3, 2602–2607. doi:10.1109/ROBOT.2001.933015
  • Mnih et al. (2014) Mnih, V., Heess, N., Graves, A., and Kavukcuoglu, K. (2014). Recurrent models of visual attention. In NIPS
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature 518, 529–533
  • Moerland et al. (2017) Moerland, T. M., Broekens, J., and Jonker, C. M. (2017). Emotion in reinforcement learning agents and robots: A survey. arXiv preprint arXiv:1705.05172
  • Murray et al. (2016) Murray, L., De Pascalis, L., Bozicevic, L., Hawkins, L., Sclafani, V., and Ferrari, P. F. (2016). The functional architecture of mother-infant communication, and the development of infant social expressiveness in the first two months. Scientific Reports 6:39019, 1–9
  • Myers (2010) Myers, D. (2010). Psychology (Worth publishers)
  • Nishihara et al. (2017) Nishihara, J., Nakamura, T., and Nagai, T. (2017). Online algorithm for robots to learn object concepts and language model. IEEE Transactions on Cognitive and Developmental Systems 9, 255–268. doi:10.1109/TCDS.2016.2552579
  • Ortony et al. (1988) Ortony, A., Clore, G., and Collins, A. (1988). The Cognitive Structure of Emotion, vol. 18
  • Papez (1937) Papez, J. (1937). A proposed mechanism of emotion. Arch Neurol Psychiatry 79, 217–224
  • Picard (1997) Picard, R. (1997). Affective computing. MIT Press. Cambridge
  • Plutchik (1980) Plutchik, R. (1980). Emotion: A Psychoevolutionary Synthesis (Harper and Row)
  • Plutchik (1982) Plutchik, R. (1982). A psychoevolutionary theory of emotions. Social Science Information , 529–553
  • Russell (1980) Russell, J. (1980). A circumplex model of affect 39, 1161–1178
  • Ruvolo et al. (2015) Ruvolo, P., Messinger, D., and Movellan, J. (2015). Infants time their smiles to make their moms smile. PLOS ONE 10, 1–10. doi:10.1371/journal.pone.0136492
  • Schachter and Singer (1962) Schachter, S. and Singer, J. (1962). Cognitive, social, and physiological determinants of emotional state. Psychological Review 69(5), 379–399
  • Schlosberg (1954) Schlosberg, H. (1954). Three dimensions of emotion. Psychological Review 61(2), 81–88
  • Seth and Friston (2016) Seth, A. K. and Friston, K. J. (2016). Active interoceptive inference and the emotional brain. Philosophical Transactions of the Royal Society of London B: Biological Sciences 371. doi:10.1098/rstb.2016.0007
  • Terasawa et al. (2013) Terasawa, Y., Fukushima, H., and Umeda, S. (2013). How does interoceptive awareness interact with the subjective experience of emotion? an fmri study. Human Brain Mapping 34, 598–612. doi:10.1002/hbm.21458
  • Winnicott (1960) Winnicott, D. (1960). The theory of the parent–infant relationship. International Journal of Psychoanalysis 41, 585–595
  • Woo et al. (2015) Woo, J., Botzheim, J., and Kubota, N. (2015). Verbal conversation system for a socially embedded robot partner using emotional model. In Robot and Human Interactive Communication (RO-MAN), 2015 24th IEEE International Symposium on. IEEE , 37–42
  • Xingjian et al. (2015) Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-c. (2015). Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems. 802–810
  • Yakovlev (1948) Yakovlev, P. (1948). Motility, behavior and the brain. stereodynamic organization and neural co-ordinates of behavior. J. Nerv. Ment. Dis. 107, 313–335
  • Yamawaki (2010) Yamawaki, K. (2010). A book that understands all of color psychology (Natsumesha CO.,LTD.)


Appendix A Recurrent attention model (RAM)

The RAM is a recurrent neural network (RNN) with visual attention proposed by Mnih et al.

(Mnih et al., 2014). In general, humans focus attention selectively on parts of the visual space instead of processing whole scene at once. Human visual perception acquires information when and where it is needed, and combine information from different fixations over time. This is how we build up an internal representation of the scene and we use the representation for decision making. Based on this idea the RAM, which is a novel framework for attention-based task-driven visual processing with neural networks, has been developed.

As shown in Fig. 4 (b), images with multiple resolutions are acquired from the original image at the center point . Then, each point and multiple images are input to the linear layer as . is the core network and takes , which is a previous internal representation, as an input. The action network and the location network take to calculate the valence/arousal values and location of the next step, respectively.

The parameters of RAM are defined as , and is optimized such that the total reward the agent can obtain when interacting with the environment is maximized. More specifically, the policy of the agent induces a distribution over possible interaction sequences and the reward is maximize under this distribution:


where depends on the policy. Although it is difficult to maximize

exactly, we can apply some techniques form the reinforcement learning by viewing the problem as a partially observable Markov decision process. In this case, the gradient can be expressed as


where are interaction sequences obtained by running the current agent for episodes. The learning rule is also known as the REINFORCE rule. It involves running the agent with its current policy to obtain samples of interaction sequences . Then, the parameters

of the agent are adjusted such that the log-probability of the chosen actions that have led to high cumulative reward is increased, while that of actions having produced low reward is decreased. Eq. (

A.2) requires us to compute ; however, this is the gradient of the RNN that defines the agent evaluated at time step

and can be computed by standard backpropagation.

In our scenario, the RAM must output the arousal/valence values for the input image as the final action. For the training images, these values are known and the policy, that outputs the correct values associated with a training image at the end of an observation sequence, can be directly optimized. This can be achieved by maximizing the conditional probability of the true values given the observations from the image, i.e., by maximizing , where corresponds to the ground-truth associated with the image from which observations were obtained. The original RAM follows this approach for classification problems, where it optimizes the cross-entropy loss to train the action network and the gradients are backpropagated through the core and glimpse networks. The location network is always trained with REINFORCE, which provides the parameter .

Appendix B Convolutional long short-term memory (LSTM)

Convolutional LSTM is a method combining CNN, which captures the features of images, and LSTM, which can handle long-term time series information, proposed by Xingjian et al. (Xingjian et al., 2015). Specifically, it is a network in which multiplication by the weight of LSTM is convolution, and the constituent element is composed of a memory cell , input gate , forget gate , and output gate .


where are inputs, are hidden states, the terms denote weight matrices, the

terms denote bias vectors,

denotes the convolution operator, and denotes the Hadamard product.

The memory cells are responsible for storing past states. The input gate has a role of adjusting the value added to the memory cell. It is possible to prevent the important information possessed by the memory cell from being lost due to the influence of the most unrelated information that is most recent, owing to the existence of this gate. The forget gate has a role of adjusting how much the value of the memory cell is held at the next time. The output gate serves to adjust how much the value of the memory cell affects the next layer. The existence of this gate can prevent the entire network from being disturbed by short-term memory and interruption of long-term memory.

In this study, we use two layers of convolutional LSTM; the filter is and the error is calculated by the mean square error. The learning rate is adaptive moment estimation (Adam) ().

Appendix C Deep deterministic policy gradient (DDPG)

DDPG is a reinforcement learning method using deep learning proposed by Lillicrap et al. (Lillicrap et al., 2015). As recently reported, “Deep Q Network” (DQN) algorithm (Mnih et al., 2015) is capable of human-level performance on many Atari video games using unprocessed pixels for input. Whereas DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces. Then, they presented a model-free, off-policy actor-critic algorithm (DDPG) using deep function approximators that can learn policies in high-dimensional, continuous action spaces. The DDPG algorithm is shown in Algorithm 2. The learning rate is Adam (actor network: , critic network: ). is the Ornstein–Uhlenbeck process. is 200. The size of

is 500. When new data comes in, old data is discarded. We used batch normalization.

  Randomly initialize critic network and actor with weights and .
  Initialize target network and with weights
  Initialize replay buffer
  for episode = 1, M do
     Initialize a random process for action exploration
     Receive initial observation state
     for t = 1, T do
        Select action according to the current policy and exploration noise
        Execute action and observe reward and observe new state
        Store transition () in
        Sample a random minibatch of transitions () from
        Update critic by minimizing the loss:
        Update the actor policy using the sampled policy gradient:
        Update the target networks:
     end for
  end for
Algorithm 2 DDPG algorithm