Multi-modal Active Learning From Human Data: A Deep Reinforcement Learning Approach

by   Ognjen Rudovic, et al.
Imperial College London

Human behavior expression and experience are inherently multi-modal, and characterized by vast individual and contextual heterogeneity. To achieve meaningful human-computer and human-robot interactions, multi-modal models of the users states (e.g., engagement) are therefore needed. Most of the existing works that try to build classifiers for the users states assume that the data to train the models are fully labeled. Nevertheless, data labeling is costly and tedious, and also prone to subjective interpretations by the human coders. This is even more pronounced when the data are multi-modal (e.g., some users are more expressive with their facial expressions, some with their voice). Thus, building models that can accurately estimate the users states during an interaction is challenging. To tackle this, we propose a novel multi-modal active learning (AL) approach that uses the notion of deep reinforcement learning (RL) to find an optimal policy for active selection of the users data, needed to train the target (modality-specific) models. We investigate different strategies for multi-modal data fusion, and show that the proposed model-level fusion coupled with RL outperforms the feature-level and modality-specific models, and the naive AL strategies such as random sampling, and the standard heuristics such as uncertainty sampling. We show the benefits of this approach on the task of engagement estimation from real-world child-robot interactions during an autism therapy. Importantly, we show that the proposed multi-modal AL approach can be used to efficiently personalize the engagement classifiers to the target user using a small amount of actively selected users data.



There are no comments yet.


page 2

page 8


Improving Multi-Modal Learning with Uni-Modal Teachers

Learning multi-modal representations is an essential step towards real-w...

Sequential Late Fusion Technique for Multi-modal Sentiment Analysis

Multi-modal sentiment analysis plays an important role for providing bet...

Multi-modal Feedback for Affordance-driven Interactive Reinforcement Learning

Interactive reinforcement learning (IRL) extends traditional reinforceme...

Speech Driven Backchannel Generation using Deep Q-Network for Enhancing Engagement in Human-Robot Interaction

We present a novel method for training a social robot to generate backch...

Batch Recurrent Q-Learning for Backchannel Generation Towards Engaging Agents

The ability to generate appropriate verbal and non-verbal backchannels b...

Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised Deep Reinforcement Learning

This work focuses on learning useful and robust deep world models using ...

Vision-based Engagement Detection in Virtual Reality

User engagement modeling for manipulating actions in vision-based interf...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Human behavior is inherently multi-modal, and individuals use eye gaze, hand gestures, facial expressions, body posture, and tone of voice along with speech to convey engagement and regulate social interactions (el2006affective, ). A large body of work in human-computer interaction (HCI) and human-robot interaction (HRI) explored the use of various affective and social cues to facilitate and assess the user engagement (goodrich2008human, ; tsiourti2019multimodal, ). Most of these works can be divided into those that detect the presence of a set of the engagement cues or interaction events (chang2018ensemble, ; park2019aaai, ; gordon_affective_2016, ), or use supervised classifiers trained with social, physiological, or task-based interaction features (bohus2014managing, ; sanghvi2011automatic, ; castellano2009detecting, ).

Figure 1. Overview of the proposed multi-modal AL approach. The input are the recordings of the child-robot interactions during an autism therapy (we used the camera placed behind the robot). The image frames are first processed using open-source tools (openFace (openface, ) and openPose (openpose, )) to obtain the facial and body cues. Likewise, the audio recordings are processed using the openSMILE toolkit (opensmile, ). We also used the data collected by the E4 wristband (e4, ) on the child’s hand (providing autonomic physiology data such as galvanic skin conductance and body temperature, as well as the accelerometer data). These data are fed as input to the modality-specific engagement classifiers: the LSTM models followed by fully-connected layers (fcL). The classification of the engagement levels (low, medium, high) is performed by applying majority voting to the classifiers’ outputs. These are also fed into the Q-function of the RL policy for active data-selection (also modeled using an LSTM cell and fcL for the action selection: to ask or not ask for the label). If the label is requested, the input multi-modal data is stored in a data pool for labelling by the human expert. These data are then used to train the data-selection policy and target classifiers. During inference of the data of a new child, the actively-selected data are used to personalize the engagement classifiers.

A detailed overview of related multi-modal approaches can be found in (alameda2019multimodal, ; pantic2003toward, ). The majority of these approaches adopt either feature-level (e.g., by simple concatenation of the multi-modal input features) or model-level (e.g., by combining multiple classifiers trained for each data modality) fusion (baltruvsaitis2019multimodal, ). While this can improve the estimation of target outcomes (e.g., the user engagement) when compared to the single-modality classifiers, these methods usually adopt ”one-size-fits-all” learning approach where the trained models are applied to new users without the adaptation to the user. Consequently, their performance is usually limited when the users’ data are highly heterogeneous (e.g., due to the differences in facial expressions/body gestures as a result of individual engagement styles).

Recently, several works proposed models for personalized estimation of engagement in HRI. For instance, (rudovic2018personalized, )

proposed a multi-modal deep learning for engagement estimation that combines body, face, audio and autonomic physiology data of children with autism during the therapy sessions with a humanoid robot. However, most of multi-modal works in HCI and HRI are fully supervised 

(baltruvsaitis2019multimodal, ; kahou2016emonets, ; jaimes2007multimodal, ), i.e., they assume that the data used to train the models are fully labeled. Obtaining these labels is expensive, especially when dealing with a large amount of audio-visual recordings. Therefore, there is a need for methods that can automatically select the most informative instances that need to be labeled in order to train the target estimation models. More importantly, to improve the generalization of these models, we need to select the data of a new user that can be used to personalize the target models. Yet, how to select the most informative instances of the multi-modal data from the target user is an open research problem that has not been investigated much.

To address this, the approach proposed here uses the notion of AL (settles2010active, ). Central to the AL framework is the query strategy used to decide when to request a label for target data. The most commonly used query strategies include uncertainty sampling, entropy, or query-by-committee (settles2010active, ). Furthermore, more advanced query strategies have been proposed to adapt deep network classifiers based on the uncertainty of the network output (e.g., (kading2016active, ; wang2016cost, ; konyushkova2017learning, )). Yet, the candidate query strategies still must be specified by a human. More importantly, there is not one strategy that works the best for all users. Instead of using the heuristic strategies, recent (deep) AL approaches (e.g.,  (liu2018learning, ; fang2017learning, ; woodward2017active, ; wang2016learning, ; duan2016rl, )) have adopted a data-driven approach that learns a model-free AL off-line policy using RL (sutton1998, ). For instance, (woodward2017active, ) proposed a model where an agent makes a decision whether to request a label or make a prediction. The agent receives a reward related to its decision: a positive reward is given for correct predictions, and negative rewards for incorrect predictions or label requests. This can be achieved by the Q-function modeled using the notion of deep RL (mnih2013playing, ; fang2017learning, ). The main goal of these approaches is to adapt the prediction model to new tasks, using a minimum number of queries. Our work is a generalization of the RL framework for AL (fang2017learning, ) to multi-modal data, where instead of dealing with a single agent-environment, the agent deals simultaneously with multiple environments (i.e., data modalities).

Note that RL has previously been applied in the tasks of multi-modal learning. For instance, (qian2018multimodal, ) used RL to enhance the machine translation from different data modalities. Likewise, (jiang2017deep, ) used RL for image question answering, an inherently multi-modal learning problem. Also, in the context of visual dialogues, RL has been used with multi-modal learning (zhang2018multimodal, ). However, none of these works explored RL for AL from multimodal data. The most related approach to ours is the multi-view AL framework (muslea2006active, ). It uses the standard heuristic AL strategies to select data from multiple views. While different views can be seen as different modalities of the same phenomenon, like facial and body gestures of human behaviour, to our knowledge no previous work has attempted multi-modal AL from the real-world human-interaction data and using RL to learn an optimal data selection policy. Moreover, the model personalization using such approach has not been explored before.

To tackle the challenges of learning from multi-modal human data (as typically encountered in HCI and HRI), in this work we formulate a novel multi-modal AL approach. We show that this approach can be used to personalize the data-modality-specific classifiers to the target user using a small amount of labeled data of the user, which are automatically selected using the newly proposed multi-modal AL strategy. The main contributions of this work are: (i) We propose a novel approach for multi-modal AL using RL for training a policy for active data-selection. (ii) We propose a novel personalization strategy based on the actively-selected multi-modal data of the target user. (iii) We show on a highly challenging dataset of child-robot interactions during an autism therapy that the proposed approach leads to large improvements in estimation of engagement (low, medium, high) from the multi-modal data, when compared to non-personalized models, and heuristic AL strategies. The outline of the proposed approach is depicted in Fig. 1. Compared to traditional supervised classifiers for multi-modal data, our approach provides an efficient mechanism for actively selecting the most relevant data for training the target engagement classifier, thus, minimizing the human data-labelling efforts.

2. Preliminaries

2.1. Problem Statement and Notation

In our learning setting, we use multi-modal data recordings of child-robot interactions during an autism therapy (rudovic2018personalized, ), as described in Sec. 4. Formally, we denote our dataset as , where denotes the data modality (e.g., face, body, etc.). This dataset comprises video recordings of target interactions of children (later split into training and test child-independent partitions). We assume a single recording per child, which may vary in duration. Each recording is represented with a set of multi-modal features , where , with

containing the collection of the multi-modal features extracted every 60 ms from a sliding window of 1 second duration (30 image frames), resulting in

temporally correlated multi-modal feature representations (the dimension of modality is ). The features of each 1 second interval are associated with the target engagement label (see Sec. 4 for details). Note that can vary per child, i.e., per recording. Given these data, we address it as a multi-class multi-modal sequence classification problem, where our goal is two-fold: (i) to predict the target label given the input features extracted from the sliding window within the recording, and (ii) to actively select the data of each child so that our engagement estimation model can be personalized to the target child.

2.2. The Base Classification Model

As the base model for the engagement classifiers (

) and also to implement the Q-function in the RL component of our approach, we use the Long Short-Term Memory (LSTM) 

(hochreiter1997long, ) model, which enables long-range learning of time-feature dependencies. This has shown great success in tasks such as action recognition (donahue2015long, ; turaga2008machine, ) and speech analysis (graves2014towards, ; eyben2013real, ), among others. Each LSTM cell has hidden states augmented with nonlinear mechanisms that allow the network state to propagate without modification, be updated, or be reset, using simple learned gating functions. More formally, a basic LSTM cell can be described with the following equations:


where are the forget gates, input gates, and output gates respectively, is the candidate cell state, and is a new LSTM cell state. and are the weights mapping from the observation () and hidden state (), respectively, to the gates and candidate cell state, and

is the bias vector.

represents element-wise multiplication; and are the sigmoid and hyperbolic tangent functions respectively (woodward2017active, ). To model the window of time steps in data pairs, we feed the feature vectors from to the temporally unrolled LSTM cell. Then, their output-state values are passed through fully connected layers fcL,

(we use a rectified linear unitReLU), and average across the time steps. Finally, a sigmoid layer followed by the softmax is applied to this output to obtain the sequence label


2.3. RL for Data-selection Policy Learning

RL (sutton1998, ) is a framework that can be used to learn an optimal policy

for actively selecting data samples during model training and adaptation. This is a form of Markov Decision Process (MDP), which allows the learning of a policy that can dynamically select instances that are most informative 

(fang2017learning, ). More specifically, given an input feature vector (), we first compute a state-vector () that is then passed to the trained policy (Q-function), which outputs an action (). During training of the Q-function, the goal is to maximize the action-value function . This is at the heart of RL, and it specifies the expected sum of discounted future rewards for taking action in state and acting optimally from then on as:


The optimal function is given by the Bellman equation:


where indicates an expected value over the distribution of possible next states , is the reward at the current state (derived from the input features ), and is a discount factor, which incentivizes the model to seek reward in fewer time steps. The tuples represent a MDP, and they are used to train the Q-function in Eq. (3).

3. Methodology

We propose a novel approach for multi-modal AL that provides an optimal policy for the active data-selection. These data are used consequently to re-train the classification models for estimation of the target output (in our case, the engagement level) from fixed-sized video segments (1 second long). The proposed approach consists of two sequential processes: (i) the training of the classifiers for the target output, and (ii) the learning of the Q-function of the RL model for active data selection. These two are performed in a loop, where first the target classifiers are trained using the data of each modality (

), thus, classifiers are trained in parallel. Then, based on their outputs, we perform the model-level fusion to obtain the target label for the input. This label is used, along with the input features and the true label, to train the Q-function for active data selection. During the models’ training, this process is repeated by re-training the classifiers and the Q-function until both models have converged (or for a fixed number of iterations, i.e., training episodes). During inference of new data (in our case, the recording of the interactions between the new child and the robot), the learned group policy (i.e., the Q-function) is first used to select the data samples that the model is uncertain about. Consequently, these are then used to personalize the target classifiers by additionally training each classifier using the actively selected data of the target child. The personalized classifiers are then used to obtain the engagement estimates on the remaining data of the target child.

3.1. Engagement Classifiers

For each data modality () we train a separate LSTM model, resulting in the following ensemble of the models: . Each model is trained using the actively-selected samples from the target modality, where the number of possible samples is defined using a pre-defined budget () for active learning. The number of hidden states in each LSTM was set to 64 (found on a separate validation set), and the size of the fcL was then set to , where the network output was 1-hot encoded (three engagement levels). Specifically, in the dataset used, we have data modalities, comprising of the FACE (), BODY (), autonomic physiology (A-PHYS) () and AUDIO () modality. Thus, for the target task, we trained four LSTM models. From each data modality, we used the feature representations extracted using the open-source codes for face and body processing from image data (openFace (openface, ) and openPose (openpose, )), as done in (rudovic2018personalized, ). These provide the locations of characteristic facial points, facial action units activations (0/1) and their intensities (0-5), and locations of 18 body joints and their confidence. As features of A-PHYS, we used the data collected with the E4 (e4, ) wristband on a child’s hand, providing the galvanic skin response, heart-rate and body temperature, along with the 3D acelerometer data encoding the child’s movements. From the audio recordings, we used 24 low-level descriptors provided by the openSMILE (opensmile, ), an open-source toolkit for audio-feature extraction. The feature dimension per modality was: 257D (FACE), 70D (BODY), 27D (A-PHYS), and 24D (Audio), thus, 378 features in total. For more details about the feature extraction, see (rudovic2018personalized, ).

To obtain the engagement estimate by the proposed multi-modal approach, we perform the model-level fusion by combining the target predictions of the modality-specific LSTMs:


where in the case of ties, we select the most confident estimate, based on the soft-max outputs of each classifier. We also tried other model-fusion schemes, such as confidence-based weighting, and majority vote on the three most confident estimates, but the basic majority vote performed the best, and was used in the experiments reported.

3.2. Group-policy Learning for Active Data-selection

We first use the training data (i.e., recordings of the children in the training set) to learn the (initial) group-policy for making the decision whether to query or not the label for the input features (). For this, we implement the Q-function using the LSTM model, which receives in its input the states () and outputs actions (), as described in Sec. 2.3.

Input: Dataset , models rand, budget
Output: Optimized models
for  to  do
       , shuffle ;
       , ;
       for  to  do
             # construct a new state;
             # make a decision;
             if  then
                   # ask for label;
             end if
             # compute reward;
             if  then
                  store ;
             end if
             # construct a new state;
             store ;
             update using a batch from ;
       end for
      update models using ;
end for
Return: ,
Algorithm 1 Multi-modal Q-learning (MMQL)

States and Actions. We approximate the -function using the LSTM model with 32 hidden units. This is followed by a ReLU layer with the weight and bias parameters , and softmax in the output, i.e., actions. The actions in our model are binary (ask/ do not ask for label), and are 1-hot encoded using a 2D output vector for input states . For instance, if a label is requested, ; otherwise, and no data label is provided by an oracle (e.g. the human expert). For training the Q-function, the design of the input states is critical. Typically, the raw input features (in our case, , are considered to be the states of the model (woodward2017active, ). While this is feasible, in the case of multiple modalites, this can lead to a large state-space, which in turn can easily lead to the overfitting of the LSTM used to implement the Q-function. Here, we propose a different approach. Instead of using the input content (

) directly, we use the output of the engagement classifiers to form the states of the Q-function. Namely, the sigmoid layer of each target classifier outputs the probabilities for each class label in

, i.e., . To obtain the ”content-free”111Note that this can also be termed as ”meta-weights”, as in the previously proposed meta-learning AL frameworks, e.g., (konyushkova2017learning, ). states for our RL model, for the target input , we concatenate these estimated probabilities from each classifier to form the state vector . Furthermore, we augment this state vector by also adding the overall confidence of each classifier. This is computed as , where the sum on the right-hand side of the equation is the entropy of the classifier (we bound it to ). Finally, in the case of , this results in 16D state-vectors . As we show in our experiments, such the state representation leads to overall better results and learning of the Q-function than when the raw high-dimensional input features are used to represent the states (in our case, ).

Reward. Another important aspect of RL is the design of the reward function. Given the input multi-modal feature vectors , the active learner chooses an action of either requesting the true label or not. If the label is requested, the model receives a negative reward to reflect that obtaining labels is costly. On the other hand, if no label is requested, the model receives positive reward if the estimation is correct; otherwise, it receives negative reward. This is encoded by the following RL reward function:


where is the target label obtained by the majority vote from the modality-specific engagement classifiers (Sec. 3.1). This reward is critical for training the -function that we use to learn the data-selection policy. Note that this type of reward has previously been proposed in (woodward2017active, ), however, it has not been used in the context of multi-modal AL.

Input: New data , models , budget
Output: labeled data in , adapted models
, shuffle
for  to  do
       # construct a new state;
       # make a decision;
       if  then
             # ask for label;
       end if
      if  then
       end if
end for
update models using ;
make estimates ;
Algorithm 2 Personalized Engagement Estimation

Optimization. Given the space-action pairs, along with the rewards, the parameters of the -function are optimized by minimizing the Bellman loss on the training data:


which encourages the model to improve its estimate of the expected reward at each training iteration (we set ). This is performed sequentially over multi-modal data samples from the training children, and over a number of training episodes. The loss minimizaton is performed using Adam optimizer with the learning rate 0.001. The training of this new AL approach, named the Multi-modal Q-learning (MMQL), is summarized in Alg. 1.

Test child ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14
# of samples in class 0 697 0 72 31 541 2281 3903 478 125 0 188 20 0 1134
# of samples in class 1 683 0 107 72 241 1026 1004 772 318 0 101 204 107 44
# of samples in class 2 0 2539 2095 1452 0 0 898 87 549 1689 1524 5686 2107 0
Table 1. The distribution of the engagement levels for each test child. The data samples are obtained by applying a 1-second sliding window, with a shift of 20 ms, to the original recordings of the target children.
MMQL (cont=0) 48.6 / 77.5 61.3 / 80.0 36.5 / 77.3 49.0 / 76.4 60.7 / 80.5 56.6 / 82.3
MMQL (cont=1) 51.0 / 77.4 60.1 / 80.5 30.6 /78.9 49.1 / 74.4 58.0 / 79.6 54.6 / 82.3
UNC 48.4 / 69.1 63.5 / 74.1 19.4 / 53.5 50.2 / 69.7 59.0 / 71.9 46.1 / 72.9
RND 48.2 / 70.4 62.6 / 73.4 21.5 / 56.7 45.2 / 70.8 59.4 / 73.9 50.4 / 75.2
Table 2. ACC [%] of the models before (left) and after (right) adaptation to test children using different budgets (5, 10, 20, 50 and 100) for active data-selection (averaged across the children and budgets).
MMQL (cont=0) 28.9 / 42.2 33.7 / 51.8 23.2 / 44.1 24.2 / 36.2 33.2 / 48.5 32.0 / 52.6
MMQL (cont=1) 30.8 / 44.0 33.5 / 52.5 19.8 / 45.2 23.5 / 39.5 32.9 / 48.5 28.4 / 48.3
UNC 30.0 / 36.3 35.6 / 41.9 14.2 / 30.1 24.3 / 32.0 32.7 / 37.7 28.1 / 38.1
RND 29.9 / 37.0 35.0 / 41.6 16.0 / 31.1 24.2 / 32.7 33.4 / 38.9 29.6 / 38.9
Table 3. F-1 [%] of the models before (left) and after (right) adaptation to test children using different budgets (5, 10, 20, 50 and 100) for active data-selection (averaged across the 14 children and 5 budgets).

3.3. Personalized Estimation of Engagement

The learned group-policy for active data selection can be applied to multi-modal recordings of previously unseen children. However, the trained multi-modal engagement classifiers may be suboptimal due to the highly diverse styles of engagement expressions across the children with autism, in terms of their facial expressions, head movements, body gestures and positions, among others. These may vary from child to child not only in their appearance but also dynamics during the engagement episodes. To account for these individual differences, we adapt the proposed MMQL to each child. We do so by additionally training (fine-tuning) the modality-specific engagement classifiers using the actively selected data samples from the target child. Specifically, we start with the group-level engagement classifiers to obtain the initial engagement estimates, but also use the learned Q-function to select difficult data that need be expert-labeled. This is performed in an off-line manner: the requested samples of the target child are first annotated by an expert, and then used to personalize the engagement classifier. The main premise here is that with a small number of human-labeled videos, the engagement classifiers can easily be optimized for the target child as:


where are the child data-samples actively selected using the group-level Q-function, and under the budget . This is described in detail in Alg. 2.

4. Experiments

Data and Features. To evaluate the proposed approach, we used the cross-cultural dataset of children with ASC attending a single session (on average, 25 mins long) of a robot-assisted autism therapy (rudovic2017measuring, ). During the therapy, an experienced educational therapist worked on teaching the children socio-emotional skills, focusing on recognition and imitation of behavioral expressions as shown by neurotypical population. To this end, the NAO robot was used to demonstrate examples of these expressions. The data comprises highly synchronized audio-visual and autonomic physiological recordings of 17/18 children, ages 3-13, with Japanese/European cultural background, respectively. All the children have a prior medical diagnosis of ASC, varying in its severity. The audio-visual recordings were annotated by the human experts on a continuous scale [-1,1]. We discretized these annotations by binning the average continuous engagement score within 1 sec intervals into: low [-1,0.5], medium (0.5,0.8], and high (0.8,1] engagement. The multi-modal features were obtained as described in Sec. 3.1. We split the children into training (20) and test (14)222The data of one child were discarded because of severe face occlusions. at random, and used their recordings for the evaluation of the models presented here. Table 1 shows the highly imbalanced nature of the data of the test children.

Figure 2. The relative number of searched data samples when the states of the Q-function in MMQL approach are: the raw input features x (cont=1), and those constructed using the output of the modality-specific classifiers (cont=0).

Performance Metrics and Models. We report the average accuracy (ACC) and F-1 score of the models in the task of 3-class engagement classification. The reported results are obtained by evaluating the group-level models and models adapted to each test child (Alg. 2). The latter was repeated 10 times by random shuffling of the test child data, and the average results are reported. For training/adaptation of the models, we varied the budget for active data selection as

. The number of episodes during training was set to 100, and LSTMs were trained for 10 epochs in each episode (and during the classifier adaptation to the test child). To evaluate different model settings, we start with the uni-modal models (thus, trained/tested using only

data modality – FACE, BODY, A-PHYS or AUDIO) and the multi-modal approach with the feature level fusion (i.e., by concatenating the input features from modalities ). We compare these models to the proposed model-level fusion in the MMQL approach. As the baselines, we show the performance achieved with alternative data-selection strategies: random sampling (RND) and the most common AL heuristic for data selection – uncertainty sampling (UNC). In the case of multi-modal learning, the UNC scores for each sample were computed as the sum of the classifiers’ entropy (Sec.3.2).

Figure 3. The performance of the feature- and model-level fusion with different active data-selection strategies.

Results. Tables 2&3 show the summary of the performances achieved by the compared models. We first note that the classifiers trained on the training children have generalized poorly on the test children. This is expected especially when using data of children with autism, who exhibit very different engagement patterns during their interactions. As can be seen from the numbers on the right-hand side (obtained after the personalization of the classifiers using actively selected data of the target child), the models’ performances largely increase. This evidences the importance of the model personalizion using the data of the target child. Overall, among the uni-modal models, the BODY modality achieves the best performance, followed by the FACE, A-PHYS, and AUDIO modality, as also noted in (rudovic2018personalized, ). However, both multi-modal versions of the models (feature- and model-level fusion) bring gains in the performance, with the model-level fusion performing the best on average.

Comparing the MMQL models with the states based on the data content (cont=1) and those constructed from the classifiers outputs (cont=0), we note that there is no large difference in the performance for most models. Yet, the F-1 score of the MMQL with model-level fusion achieves a larger improvement (4.3%). On the other hand, by looking at Fig. 2, we note that the MMQL approach with cont=0 requires a lower search time to reach the budget, while achieving a similar performance to when the content is used. Thus, this simpler model is preferable in practice. In the rest of the experiments, we show the performance of the MMQL (cont=0) only. Compared to the baselines, we note that the proposed MMQL largely outperforms these base strategies for active data-selection, under the same budget constraints. This evidences that the proposed is able to learn a more efficient data-selection policy. By comparing the ACC and F-1 scores, we note that the proposed is able to improve the classification of each engagement level, while the RND/UNC strategies tend to overfit the majority class.

Figure 4. The performance per test child of the MMQL (model-level fusion, cont=0) before (in green) and after (in ) the personalization of the engagement classifiers. The results are shown for the budgets 5 and 10.

Similar observations can be made from Fig. 3, showing the performance of personalized models after the adaptation using different budgets. MMQL consistently outperforms RND and UNC sampling strategies, which we attribute again to the superior performance of the data selection strategy attained by the RL Q-function. This trend is even more pronounced for larger budgets, evidencing that the proposed Q-function consistently selects the more informative data samples, that are used to adapt the modality-specific classifiers to the target child. This holds for both, the feature- and model-level fusion within the proposed MMQL approach.

Fig. 4 shows the performance of the MMQL (model-fusion, cont=0) approach per child before and after the model personalization using data actively selected with the proposed RL approach. Note that even with 5 samples only, the engagement classification improves largely and for almost all test children. However, because of the highly imbalanced nature of these data (see Table 3), F-1 scores are relatively low, yet, the improvements due to the active adaptation of the target classifiers are evident. One of the challenges when working with such imbalanced data is that the classifiers tend to overfit the majority class as most of the active samples come from that class. While this is much more pronounced in the RND/UNC selection strategies, it is also one of the bottlenecks of the current approach since the target classifiers are updated offline. We plan to address this in our future work.

5. Conclusions

We proposed a novel active learning approach for multi-modal data of human behaviour. Instead of using heuristic strategies for active data-selection (such as the uncertainty sampling), our approach uses the notion of deep RL for active selection of the most informative data samples for the model adaptation to the target user. We showed the effectiveness of this approach on a highly challenging multi-modal dataset of child-robot interactions during an autism therapy. Specifically, we showed that the learned data-selection policy can generalize well to new children by being able to select their data samples that allow the pre-trained engagement classifiers to adapt quickly to the target child. We showed that this multi-modal model personalization can largely improve the performance of the engagement estimation for each test child using only a few expert-labeled data of the target child.


  • [1] Empatica e4:, 2015.
  • [2] X. Alameda-Pineda, E. Ricci, and N. Sebe. Multimodal behavior analysis in the wild: An introduction. In Multimodal Behavior Analysis in the Wild, pages 1–8. Elsevier, 2019.
  • [3] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In International Workshop on Human Behavior Understanding, pages 29–39. Springer, 2011.
  • [4] T. Baltrušaitis, C. Ahuja, and L.-P. Morency.

    Multimodal machine learning: A survey and taxonomy.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443, 2019.
  • [5] T. Baltrušaitis, P. Robinson, and L.-P. Morency. Openface: an open source facial behavior analysis toolkit. In

    IEEE Conference on Applications of Computer Vision

    , pages 1–10, 2016.
  • [6] D. Bohus and E. Horvitz. Managing human-robot engagement with forecasts and… um… hesitations. In Proceedings of the 16th international conference on multimodal interaction, pages 2–9. ACM, 2014.
  • [7] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2017.
  • [8] G. Castellano, A. Pereira, I. Leite, A. Paiva, and P. W. McOwan. Detecting user engagement with a robot companion using task and social interaction-based features. In Proceedings of the 2009 international conference on Multimodal interfaces, pages 119–126. ACM, 2009.
  • [9] C. Chang, C. Zhang, L. Chen, and Y. Liu. An ensemble model using face and body tracking for engagement detection. In Proceedings of the International Conference on Multimodal Interaction, pages 616–622. ACM, 2018.
  • [10] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
  • [11] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
  • [12] R. El Kaliouby, R. Picard, and S. Baron-Cohen. Affective computing and autism. Annals of the New York Academy of Sciences, 1093(1):228–248, 2006.
  • [13] F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in opensmile, the munich open-source multimedia feature extractor. In ACM International Conference on Multimedia, pages 835–838, 2013.
  • [14] F. Eyben, F. Weninger, S. Squartini, and B. Schuller.

    Real-life voice activity detection with lstm recurrent neural networks and an application to hollywood movies.

    In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 483–487, 2013.
  • [15] M. Fang, Y. Li, and T. Cohn. Learning how to active learn: A deep reinforcement learning approach. arXiv preprint arXiv:1708.02383, 2017.
  • [16] M. A. Goodrich, A. C. Schultz, et al. Human–robot interaction: a survey. Foundations and Trends® in Human–Computer Interaction, 1(3):203–275, 2008.
  • [17] G. Gordon, S. Spaulding, J. K. Westlund, J. J. Lee, L. Plummer, M. Martinez, M. Das, and C. Breazeal. Affective Personalization of a Social Robot Tutor for Children’s Second Language Skills. In

    Thirtieth AAAI Conference on Artificial Intelligence

    , 2016.
  • [18] A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning, pages 1764–1772, 2014.
  • [19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [20] A. Jaimes and N. Sebe. Multimodal human–computer interaction: A survey. Computer vision and image understanding, 108(1-2):116–134, 2007.
  • [21] A.-W. Jiang, B. Liu, and M.-W. Wang. Deep multimodal reinforcement network with contextually guided recurrent attention for image question answering. Journal of Computer Science and Technology, 32(4):738–748, 2017.
  • [22] C. Käding, E. Rodner, A. Freytag, and J. Denzler. Active and continuous exploration with deep neural networks and expected model output changes. arXiv preprint arXiv:1612.06129, 2016.
  • [23] S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, Y. Dauphin, N. Boulanger-Lewandowski, et al. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2):99–111, 2016.
  • [24] K. Konyushkova, R. Sznitman, and P. Fua. Learning active learning from data. In Advances in Neural Information Processing Systems, pages 4228–4238, 2017.
  • [25] M. Liu, W. Buntine, and G. Haffari.

    Learning to actively learn neural machine translation.

    In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 334–344, 2018.
  • [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [27] I. Muslea, S. Minton, and C. A. Knoblock. Active learning with multiple views. Journal of Artificial Intelligence Research, 27:203–233, 2006.
  • [28] M. Pantic and L. J. Rothkrantz. Toward an affect-sensitive multimodal human-computer interaction. Proceedings of the IEEE, 91(9):1370–1390, 2003.
  • [29] H. W. Park, I. Grover, S. Spaulding, L. Gomez, and C. Breazeal. A model-free affective reinforcement learning approach to personalization of an autonomous social robot companion for early literacy education. In Proceedings of the Thirty Third AAAI Conference on Artificial Intelligence, AAAI’19. AAAI Press, 2019.
  • [30] X. Qian, Z. Zhong, and J. Zhou. Multimodal machine translation with reinforcement learning. arXiv preprint arXiv:1805.02356, 2018.
  • [31] O. Rudovic, J. Lee, M. Dai, B. Schuller, and R. Picard. Personalized machine learning for robot perception of affect and engagement in autism therapy. Science Robotics, 2018.
  • [32] O. Rudovic, J. Lee, L. Mascarell-Maricic, B. W. Schuller, and R. W. Picard. Measuring engagement in robot-assisted autism therapy: A cross-cultural study. Frontiers in Robotics and AI, 4:36, 2017.
  • [33] J. Sanghvi, G. Castellano, I. Leite, A. Pereira, P. W. McOwan, and A. Paiva. Automatic analysis of affective postures and body motion to detect engagement with a game companion. In Proceedings of the 6th international conference on Human-robot interaction, pages 305–312. ACM, 2011.
  • [34] B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.
  • [35] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
  • [36] C. Tsiourti, A. Weiss, K. Wac, and M. Vincze. Multimodal integration of emotional signals from voice, body, and context: Effects of (in) congruence on emotion recognition and attitudes towards robots. International Journal of Social Robotics, pages 1–19, 2019.
  • [37] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
  • [38] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 2016.
  • [39] M. Woodward and C. Finn. Active one-shot learning. NIPS, Deep Reinforcement Learning Workshop, 2016.
  • [40] J. Zhang, T. Zhao, and Z. Yu. Multimodal hierarchical reinforcement learning policy for task-oriented visual dialog. arXiv preprint arXiv:1805.03257, 2018.