Fast Adaptation with Meta-Reinforcement Learning for Trust Modelling in Human-Robot Interaction

08/12/2019 ∙ by Yuan Gao, et al. ∙ 0

In socially assistive robotics, an important research area is the development of adaptation techniques and their effect on human-robot interaction. We present a meta-learning based policy gradient method for addressing the problem of adaptation in human-robot interaction and also investigate its role as a mechanism for trust modelling. By building an escape room scenario in mixed reality with a robot, we test our hypothesis that bi-directional trust can be influenced by different adaptation algorithms. We found that our proposed model increased the perceived trustworthiness of the robot and influenced the dynamics of gaining human's trust. Additionally, participants evaluated that the robot perceived them as more trustworthy during the interactions with the meta-learning based adaptation compared to the previously studied statistical adaptation model.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In order to navigate in natural, dynamic environments, populated with humans, robots need to learn how to act, collaborate and adapt to different situations. For example, an assistive robot, that supports the elderly, needs to understand the person’s need, performs basic object manipulation tasks, provides emotional support when needed, and more. Like humans, robots need the ability to adapt their behaviour and learn new ones through interaction with humans and other robots [rahwan2019machine]. These abilities are integral to achieving smooth human-robot interaction (HRI) for socially assistive robots.

Despite the fact that there exists an extensive body of work on application-driven adaptation in HRI, the fast adaptation that is grounded in realistic perception remains a challenge [sheridan2016human]. Recent developments have explored different aspects of human-robot interaction including physical human-robot interaction [ghadirzadeh2016sensorimotor], automatic reasoning [clark2018deep] and affective human-robot interaction [yuan2018when]. All these methods are successful in their own field, but view the social HRI process from a narrower perspective. Also, they lack the flexibility to be extended with more complex, learning-based perceptual models. However, more complex models require more training data, and collecting data from HRI experiments is rather complicated.

In our work, we aim at developing a general approach for fast adaptation in HRI using a neural network-based policy gradient method. Similarly to earlier work 

[leite2014empathic], we model the interactions as adversarial multi-armed bandit (MAB) problems [auer2002finite]. We address the sample inefficiency problem of policy gradient method using a meta-learning algorithm, called model-agnostic meta-learning (MAML) [finn2017model]. In Section III, we provide a formal description of our model.

One of the fundamental purposes of meta-learning is to give the agent the ability to draw conclusions on similar but unknown problems based on prior knowledge. This is very similar to how trust can shape people’s behaviour when they need to interact with unknown humans for the first time. In psychological studies, it is shown that trust is a result of dynamic interactions [mayer1995integrative]. Successful interactions will lead to feelings of security, trust and optimism, whilst failed interactions will bring unsecured feelings or mistrust. We suggest that the pre-training phase of the meta-learning could be interpreted as gaining trust towards the agent that the robot is going to interact with.

Figure 1: This figure shows one of the interaction processes from the third-person point of view. The participant wears a Mixed Reality Headset, through which she can observe virtual objects (keys, blue table and orange walls of the escape room maze) augmented into the real world.

In order to examine how our proposed model influences the perceived human’s bi-directional trust, we evaluate it in a human-robot interaction scenario. In this scenario, a participant has to collaborate with a robot to escape a room, created with mixed reality (MR). From speech and human’s 3D position in space, the robot learns to adapt to the needs of the participants. Our scenario is discussed in more details in Section IV. We used MR to create a dynamic environment that reacts to the person’s and robot’s behaviour, since MR presents additional benefits of flexibility and the speed of prototyping. Moreover, our previous studies [sibirtseva2018comparison] demonstrated that MR does not affect the task performance in human-robot interaction compared to other traditional media, even taking into account the novelty effect of the new technology.

We investigate the effect of our adaptation model on the perceived bi-directional trust. Our main hypothesis is that, due to faster adaptation, our model increases the participant’s perception of the robot’s trustworthiness and how much, in their opinion, the robot trust them. We demonstrate the results of the human study in Section V and discuss the future work in Section VI.

The three main contributions of this paper are:

  1. We propose a policy gradient method based on meta-learning, which can be pre-trained in simulated auxiliary environments and adapt fast in a real-world HRI scenario.

  2. We propose that the meta-learning process can be viewed as a part of trust modelling, based on the psychological and sociological trust formalization in human-human interaction.

  3. We performed a human-study to investigate the effect of our model on the subjective measures and found that our model increased the perceived bi-directional trust.

Ii Related Work

In our work, to achieve fast adaptation, we utilize human feedback and use it under the framework of reinforcement learning (RL). RL has been used since the early days of research in social HRI. One of the early works was conducted by Bozinovski et al. [bozinovski1996emotion], who considered the concept of emotion in its learning and behavioural scheme. Later, several researchers in HRI investigated the effect of RL algorithms like Exp3 [leite2014empathic, yuan2018when, ahmad2018emotion] or Q-learning [mataric1997learning, tsiakas2018task]

. With the development of deep learning 

[lecun2015deep], several methods were proposed to understand different modalities in HRI, for example, ResNet [he2016deep] for image processing and Transformer [vaswani2017attention] based solutions for text processing. Naturally, the same trend can be observed regarding the problem of adaptation in HRI. One of the pioneer works was conducted by Qureshi in 2017 [qureshi2016robot] where a Deep Q-Network [mnih2015human] was used to learn a mapping from visual input to one of the several predefined actions for greeting people.

However, for deep RL methods, one of the drawbacks is that the algorithms need a lot of training data to converge [mnih2015human]. This makes deep learning methods’ applicability in HRI limited, even for the most basic adaptation problem. Thus, the dilemma is clear: on the one hand, if statistical methods are used, the robot may not be able to capture the details of the interactive process [yuan2018when]. On the other hand, if deep learning methods are used, a lot of training data is needed to optimize the algorithms [mnih2015human]. One potential solution to the dilemma may be to use meta-learning. Using meta-learning for pre-training may address the problem of limited training data [duan2016rl, finn2017model]. Examples of this have been shown for image recognition [zoph2018learning]

and imitation learning 

[duan2017one]. However, to the best of our knowledge, there is no research on applying meta-learning in HRI.

Trust as a social phenomenon has been widely studied in psychology, sociology, and economics, however, the definition of trust is not commonly agreed upon across the disciplines. It was shown in numerous studies that trust is essential for successful human-human interactions [mcallister1995]. In regards to relationships with robots, trust plays a major role in human’s willingness to accept information from a robot [hancock2011meta] and to cooperate [freedy2007measurement]. Modelling trust can help us to build more intelligent socially assistive robots.

According to Marsh’s formalization, trust consists of three components, namely, basic, general and situational trust of an agent in another agent [marsh1994formalising]. The basic trust is the value derived and updated from the previous experience, and it helps the agent to make decisions about the unknown agents in future situations. The general trust is agent-specific, representing a bias towards another particular agent, while the situational trust is dependent on external conditions. In our case, we focus on modelling the basic trust, which can be viewed as a general disposition of the robot to be more trustworthy towards a human during an interaction. As a consequence, this can increase the speed of adaptation in a particular task. From the sociological standpoint, the convergence of the meta-reinforcement learning process can be considered as a part of trust modelling.

Ii-a Trust evaluation in HRI

There is no commonly used procedure to measure perceived trust in HRI. As stated in Hancock’s overview of the field, one of the most influential factors on perceived trust is a robot’s performance, its reliability and understandability [hancock2011meta]. Another study showed that human-related subjective measures, such as personality traits and level of expertise, also have an effect on how people perceive the robot’s trustworthiness [salem2015would]. Moreover, in human-computer interaction literature, cooperation is defined as a ”behavioural outcome of trust” [wilson2006all]. We combine all previously mentioned metrics together to measure perceived trust during our human study. In the majority of works, trust was measured after the interaction (e.g. [schaefer2013perception] and [aroyo2018trust]). However, trust is dynamic in nature and can be influenced by many different factors throughout an interaction [BLOMQVIST1997271]. To capture the changing dynamics of trust, it is advised to measure it multiple times during an interaction [mcallister1995, schaefer2016measuring].

Another thing to consider in trust evaluation is the HRI scenario. A common scenario to evaluate trust involves economic games, where a participant is asked to gamble for a monetary gain [lee2013computationally]. Some works, on the other hand, present a more general task, where a robot also asks participants to perform unusual requests [salem2015would]. In both cases, only one-directional trust towards the robot is measured. However, to our knowledge, there is no study that measures the bi-directional perceived trust in HRI. In this work, we develop an escape room scenario that allows variety in robot’s and human’s behaviours and is suitable for testing bi-directional perceived trust. The escape room scenario has been found to be helpful for investigating human’s behaviour in diverse collaborative tasks [pan2017collaboration]. Recently, a similar scenario has been used to study the effects of robot’s failure in human-robot collaboration [van2019take]. In an escape room scenario, players are locked in a room and they need to solve a series of puzzles in order to escape it under a time constraint. The main advantage of this scenario is its flexibility, which means that specific puzzles can be easily added to study different behaviours.

Iii Interaction Modelling

In this section, we describe our approach to model the human-robot interaction process. In Section III-A, we first introduce the general RL problem mathematically. Then in Section III-B, we describe the details of our model implementation.

Iii-a Preliminaries

We consider a general interaction process, modelled using and as the state and action of the robot agent at time during the interaction. The interaction could be viewed as maximization of the expected cumulative reward over trajectories


where is the final time step of the interaction, is the cumulative reward over . The expectation is under a distribution,


where is the policy we would like to train and is the forward model determined by the interaction dynamics. We now define a meta-learning technique consisting of a pre-training method and a refinement method . Here, takes policy and auxiliary environments to receive a meta-policy


Intuitively, the meta-policy is a policy that learned prior knowledge about the tasks it is going to solve. For more detailed explanation, interested readers refer to this survey paper [vanschoren2018meta].

We then consider a task-specific environment for policy refinement. The final outcome of the system, that we would like to have, is the sampled trajectories from all possible trajectories generated under the optimized policy using . Mathematically, is defined as , where .

Iii-B Proposed model

In order to model the interaction in our scenario, we loosely follow the assumption that a human-like robot should have the tripartite mental activities, namely conation , cognition and affection . The assumption is inspired by Hilgard’s tripartite classification of mental activities of human personality and intelligence in modern behaviour psychology [hilgard1980trilogy]. Here we define the interactive space as . For our particular escape room scenario, each instance of the mental functionality, , is then modelled and implemented as an environment of adversarial MAB problem. All of the three instances operate independently throughout the interaction process. We define different meta-learning processes for different functionality in . Each meta-learning strategy contains .

During the interaction, the robot needs to optimize its all policies to learn the most preferred action for each MAB environment. In order to keep the generality of the concept of trust, we also assume the observational states of each category to be fixed for each category .

Our methods involve two training processes. Firstly, we model the human feedback of each action of MAB as a Gaussian distributions

for all the auxiliary environments. This modelling is based on the fact that signal of emotion recognition normally follows Gaussian distributions [kragel2016decoding]. Simultaneously, we use to train initial random in order to get a meta policy . learns the inner structure of the problem which makes the adaptation during the interactive session much faster and data-efficient. Finally, we conduct human experiments and study different subjective measures along with interaction. Mathematically, the training steps can be summarized as follows:


Where , and

are mean and standard deviation associated with action distribution

, are auxiliary environments for pre-training and are real interactive environments. We use MAML  [finn2017model] as the meta-learning algorithms and trust region policy optimization (TRPO) [schulman2015trust] as optimization algorithm for all policies. Interested readers can refer to Appendix VII-A to see the differences in convergence between meta policy and randomly initialized policy for MAB problems with different number of actions using a neural network based policy.

Implementation details:

We used a PyTorch based MAML implementation

111 and built a customized MAB environment using OpenAI Gym [brockman2016openai]

. For each instance of mental functionality, an auxiliary environment is used as the pre-training method. For all instances, the same policy network structure is used, namely, one hidden layer with 100 neurons. MR engine Unity is used to animate the changes in the Hololens environment

222 The communication between Hololens and the algorithm in the policy refinement step is carried out via a UDP based communication method.

Iv Methodology

Iv-a Scenario

We built an escape room scenario to conduct a between-subjects study, in which participants interacted with a Pepper robot333 The escape room was created in augmented reality and participants were required to wear a Mixed Reality headset HoloLens444 to see walls of the virtual maze, triggers, keys, and the exit door (see Fig 2

). The interaction consists of three parts: an instance of conation, an instance of affection, and an instance of cognition. Each instance is triggered by the participant’s position in the virtual maze, recorded from the HoloLens. For each instance, the robot chooses one out of four actions, according to a probability distribution provided by the algorithm. Here, an action is implemented as a verbal question (i.e.

“Did you come here to bring me something?”, see Table I). After the participant answers, the robot updates the probability associated with the previous question and gives feedback accordingly (i.e. for the low probability “I do not believe it, but fine.”, see Table II). After completing all the interaction steps, the participant can escape the room and a run is over.

Figure 2: This figure shows the setup of our study. During a run, the participant starts from the dotted line circle to approach the red button trigger. Then it goes to the solid line circle to go out of the room. During the run, three instances of mental activities will each be triggered once.

For the control group, a statistical MAB algorithms Exp3 (C1), implemented as in the previous studies, was used. The experimental group interacted under our proposed model, a policy gradient based solution for MAB problem, together with meta-learning (C2).

Mental activity Examples of implementation
“Do you want to escape the room?”
“Do you come here to stay with me?”
“Did you come here to bring me something?”
“Did you come here to look for your friends?”
“Be careful with the walls. You should avoid them.”
“I can help you to do whatever you want to do here.”
“Hey, be relaxed, no matter what you do,
you have no fault in this game.”
“Are you worried? Don’t worry! I am here with you.”
“Here are the keys. Do you need the first key?”
“Here are the keys. Do you need the second key?”
“Here are the keys. Do you need the third key?”
“Here are the keys. Do you need the fourth key?”
Table I: Examples of questions the robot asks during an interaction based on the mental activity classification [hilgard1980trilogy].

Iv-B Procedure

Upon arrival, participants received a brief description of the experiment. Then they were asked to sign a consent form and fill in a pre-study questionnaire about their demographic background, prior experience with robots and technology, a personality assessment [gosling2003very], and negative attitude towards robots [syrdal2009negative]. In order for the algorithms to converge, the participant’s behaviour has to be consistent during several runs. Thus, the participants are asked to act according to the following rules during the interaction:

  • Your goal is to escape the room;

  • You need to get further information regarding the walls of the maze;

  • You need the second key;

  • You can answer only “yes” or “no”;

The first attempt at escaping the room is treated as a test run to familiarize the participant with the setup and is not taken into account in an adaptation algorithm. Each participant has to go through four sessions, while each session consists of three full runs, during which the robot gradually adapts to the participant’s requests. Overall, each participant goes through twelve iterations of algorithm adaptation. After each session, the participant fills in a short questionnaire evaluating perceived bi-directional trust. By the end of the fourth session, the participant fills in a questionnaire regarding the overall experience and the quality of interaction.

Iv-C Hypotheses

By comparing the effects of the two conditions we aim to test whether the overall perceived bi-directional trust is affected by different adaptation algorithms. We propose to define bi-directional trust as how trustworthy the participant perceives the robot and how much, in their opinion, the robot trusts them in return. Moreover, we hypothesize that the dynamics of how the bi-directional trust changes throughout the interaction sessions vary in two conditions. More formally,

  • H1: Perceived trust of a participant towards the robot is higher in the C2.

  • H2: The dynamics of perceived trust of a participant towards the robot differs in two conditions.

  • H3: Perceived trust of a robot towards the participant is higher in C2.

  • H4: The dynamics of perceived trust of a robot towards the participant differs in two conditions.

Based on the simulation results (see Fig. 3), we assume that our proposed model (condition C2) ,with faster adaptation and its specific adaptation dynamic, will influence participant’s perception of bi-directional trust in a positive way. Additionally, we expect that slower adaptation (condition C1) will cause fewer changes in subjective measures between four sessions, in comparison to the meta-learning based approach.

Confidence level Examples of robot’s replies
Low () “I do not believe it but fine.”
Medium-low () “Is that really so? I am not sure.”
Medium () “I understand now.”
High (0.8) “Awesome, I knew you would say so.”
Table II: Examples of robot’s replies to participants’ answers based on the confidence level, where denotes probability associated with the action.

Iv-D Measures

Iv-D1 Objective measures

In order to analyze the performance of the algorithms in our setup, we tested our method using simulated data. Like what we described in the previous section III-B, we assume user’s feedback is a continuous signal distributed with Gaussian function with mean

and variance

. Based on this simulated feedback, we compare the adaptation speed between the Exp3 algorithms in the control group and the meta-policy in the experimental group. In this simulation, we choose the number of actions in MAB settings to be four, similar to previous studies [yuan2018when, leite2014empathic].

Iv-D2 Subjective measures

Bi-directional perceived trust is measured from the participant’s point of view, how trustworthy they perceive the robot and how much they think the robot trusts them in return. We evaluate the bi-directional perceived trust by following Salem et al.’s work on trust evaluation in HRI [salem2015would]. Single items were extracted and adjusted to suit our scenario.

Subjective measures were collected in a form of a questionnaire, given to the participants after each session during the interaction. The participants were asked to evaluate their answers on the 5 point Likert scale (1 = “strongly disagree”, 5 = “strongly agree”).

We selected single modified items from [salem2015would]: “I perceive the robot as trustworthy”, “The robot perceives me as trustworthy”. Single modified item was added from the “Propensity to Trust survey” [evans2008survey]: “The robot anticipates my needs”. We further examined participant’s perception of the robot’s trust towards them with counteractive items: “The robot believed my answers”, “The robot questioned my answers”.

Iv-E Participants

A total of 24 subjects (11 female, 13 male), with ages ranging between 22 and 48 (, ), were recruited for this experiment. On a Likert scale from 1 to 5 (with 1 representing very little and 5 - very much), participants were found to have moderate experience interacting with robots (, ), negligible familiarity with Virtual or Augmented reality (, ), and major skills regarding digital technologies (, ). Participants were randomly assigned to one of the two conditions, resulting in two groups of 12 subjects each.

V Results

V-a Objective Measures

Fig. 3 shows the comparison between the overall adaptation speed between the control group and the experimental group over all the instances using simulated feedbacks. The x-axis indicates the average probability of all the right answers over all the instances throughout the four sessions. The y-axis shows the number of iterations that the algorithms has optimized. The result shows that our method has on average higher adaptation speed.

Figure 3: This figure shows the simulated adaptation results of the escape room scenario for two different algorithms after each interaction. The results of Exp3 algorithms is shifted to the right in order to show the difference clearly.

V-B Subjective Measures

The bi-directional perceived trust was analyzed using a mixed design repeated measures one-way ANOVA at significance level . The two conditions were viewed as between-subjects factors , while sessions numbers corresponded to within-subjects factor on the dependent variables of perceived trust towards the robot and perceived robot’s trust towards the participant. The overall measurements were compared based on the results of the tests of the between-subjects effects. Neither of the two measures violated the sphericity assumption. Dynamics analysis was carried out on the basis of the interactions’ statistical significance.

V-B1 Perceived trust towards the robot

For the condition C1 (see Fig. 4), the first session resulted into moderate scores ( 3.03, 1.24), then the perceived trust towards the robot dropped in the second session ( 2.42, .996). After that, it stabilized during the third ( 2.83, 1.267) and the fourth ( 2.67, .888) sessions. In the condition C2, at the beginning participants evaluated their perceived trust towards the robot at a similar moderate score as the C1 ( 2.92, .793). In the consecutive sessions, we can observe a gradual increase of the scores, specifically after the second session ( 3.17, .835), the third ( 3.75, .754), and finally the fourth ( 3.92, .515).

We found a statistically significant effect of condition on participant’s perception of the robot’s trustworthiness, , such that the average score of the C1 ( 2.75, .222) is lower than the C2 condition score ( 3.437, .222). Thereby, H1 was supported and the participants perceived the robot’s trustworthiness higher in the meta-learning based adaptation. A statistically significant effect of the session number was found, , regardless of condition.

There was a statistically significant effect of the session number and condition interaction, . This means that the dynamics of how the participants gain trust in the robot is significantly different in two conditions and H2 was supported.

Figure 4: The average value (5-point Likert scale) of the participant’s perceived trust towards a robot by session number and condition with 95% CI errors. The condition C1 is shifted to the right to show the difference clearly.
Figure 5: The average value (5-point Likert scale) of the robot’s perceived trust towards a participant from a robot by session number and condition with 95% CI errors. The condition C1 is shifted to the right to show the difference clearly.

V-B2 Perceived robot’s trust towards the participant

From Fig. 5, we can observe that in the beginning, the participants in the condition C1 evaluated moderately-low ( 2.08, .996) on how trustworthy the robot perceives them. In the second session, the average score decreased ( 1.67, .778) and then increased in the third session ( 2.5, 1.168). The scores in the final session decreased ( 2.17, .937). The results of C1 show fluctuations between sessions, but also have a positive trend overall. In comparison to the condition C1, the first session in condition C2 has higher average score ( 3.08, 1.379), which increased after the second session ( 3.58, 1.24), and flattened out afterwords in the both third ( 3.67, 1.155) and fourth ( 3.75, .965) sessions.

A statistically significant effect of condition was found on perceived trust towards the participant, , such that the C2 has higher average score of perceived trust towards the participant ( 3.521, .247) compared to the C1 ( 2.104, .247). This denotes that the participants perceived the robot as more trustworthy towards them in the C2 and H3 was supported.

Even though the overall robot’s trust towards the participant depends on condition, the differences in dynamics of how the perceived robot’s trust towards the participant changes in two conditions are not statistically significant, , H4 was rejected. Additionally, no statistical significance of session number was found, .

Vi Discussion

Despite H3 being supported and participants perceiving the robot as more trusting towards them in the C1, the dynamics of the perceived robot’s trust towards the participant was not significantly different between groups. As a consequence, H4 was not supported

. This can be explained by the meta-learning based adaptation algorithm converging on at least one of the actions during the first session. In other words, during the first session the robot might have gone from explicitly saying “I don’t believe you” to a more neutral “I understand”. Thus, the results show a significant difference between the first sessions in two conditions, but for the subsequent sessions, the perceived robot’s trust towards the participant fluctuates within the 95% confidence interval. We can hypothesize that we observe this due to too fast adaptation with the meta-learning pre-trained policy gradient algorithm and too slow adaptation with Exp3.

In contrast, the convergence rate of the algorithms did not have the same effect on how trustworthy the participants perceived the robot. The estimated marginal means in the second condition were significantly higher, meaning that the participants found the robot more trustworthy in the second group. Thus, we can conclude that

H1 was supported. However, we also found a significant effect of interaction, which can be interpreted as a difference in trends for two conditions and supports H2. A noteworthy distinction between the dynamics of the robot’s trust towards the participant and the trust towards the robot appears in the first session.

The discrepancies of how two algorithms influenced the dynamics of perceived bi-directional trust can be explained by the explicit nature of how the robot expressed its trust towards the participants, while human trust is a complex social construct. Further investigation is required in order to examine this phenomenon.

Compared to [yuan2018when], we established an alternative method to model the interactive adaptive process and show that our model is able to learn a prior from modelled auxiliary environments. This helps the algorithm to deal with the real-time requirement in human-robot interaction.

We think this work could be extended in two directions, both from the algorithmic and the human study point of view. From the algorithmic perspective, we want to incorporate deep learning perceptual modules for processing multimodal human input to enrich the interaction process. Second, trust modelling can be expanded to include implementation of general and situational trust, as well as reaction to trust violation, according to Marsh’s formalization [marsh1994formalising]. Finally, the developed system can be tested further for how different approaches of trust modelling influence the participants’ perception of trust, robot’s intelligence, and quality of interaction.

We hypothesize that the proposed models can help us to achieve a richer socially interactive process for real-time HRI.

Vii Conclusion and Future Work

In this work, we proposed to use a meta-learning based policy gradient method for addressing the problem of fast adaptation in HRI. Compared to the statistical model, it can be pre-trained in auxiliary environments and then adapted faster in the real HRI scenario.

We designed an escape room scenario in mixed reality to evaluate the proposed method and investigate its potential effects on the perceived bi-directional trust. Our results show that not only the algorithm adopted a higher learning rate after the meta-learning process but also has increased the participant’s perception on how trustworthy the robot perceives them.

In the future, we will combine this modelling with different neural network-based perception modules and examine their influences on the interactive process. From the trust modelling side, we will investigate the possible cause of the differences in the dynamics of perceived bi-directional trust. Moreover, we will address trust violation and examine human’s perception of different approaches to modelling it.


This work was supported by the COIN project (RIT15-0133) funded by the Swedish Foundation for Strategic Research and by the Swedish Research Council (grant n. 2015-04378)



Vii-a Comparison of meta policy and randomly initialized policy

In order to assess how meta-learning pre-training process influences the optimization of a neural network based solution for the multi-armed bandit problem, we compare the meta-policy with randomly initialized policy. The meta-policy is pre-trained with the simulated auxiliary environments. The feedback distribution in the auxiliary environments is modelled based on the previous research of modelling the perceived emotion using Gaussian class functions [zhang2017predicting].

Figure 6: This figure shows the relationship between the average number of samples needed to reach 95% confidence for any action and the number of actions in MAB problems. The results of the randomly initialized policy are shifted to the right to show the differences clearly.

Fig. 6 shows a comparison of the two classes of policies. We can observe that as the number of actions of MAB increases, the number of interactions needed to reach 95% confidence for a particular action increases. However, the meta learned policy needs less number of iterations than a randomly initialized policy for all MAB problems with the different number of actions.