Socially assistive robotics (SAR) has great potential to provide accessible, affordable, and personalized therapeutic interventions for children with autism spectrum disorders (ASD). However, human-robot interaction (HRI) methods are still limited in their ability to autonomously recognize and respond to behavioral cues, especially in atypical users and everyday settings. This work applies supervised machine learning algorithms to model user engagement in the context of long-term, in-home SAR interventions for children with ASD. Specifically, two types of engagement models are presented for each user: 1) generalized models trained on data from different users; and 2) individualized models trained on an early subset of the user’s data. The models achieved approximately 90% accuracy (AUROC) for post hoc binary classification of engagement, despite the high variance in data observed across users, sessions, and engagement states. Moreover, temporal patterns in model predictions could be used to reliably initiate re-engagement actions at appropriate times. These results validate the feasibility and challenges of recognition and response to user disengagement in long-term, real-world HRI settings. The contributions of this work also inform the design of engaging and personalized HRI, especially for the ASD community.
Socially assistive robotics (SAR) is a promising new subfield of human-robot interaction (HRI), with a focus on developing intelligent robots that provide assistance through social interaction [22, 28]. As overviewed in this journal , researchers have been exploring SAR as a means of providing accessible, affordable, and personalized interventions to complement human care. However, HRI methods are still limited in their ability to autonomously perceive, interpret, and naturally respond to behavioral cues from atypical users in everyday contexts. This hinders the ability of SAR interventions to be tailored toward the specific needs of each user [35, 25].
These HRI challenges are significantly amplified in the context of SAR for individuals with autism spectrum disorders (ASD), but ASD is also the context where SAR is especially promising. ASD is a developmental disability characterized by difficulties in social communication and interaction. About 1 in 160 children worldwide are diagnosed with ASD , with a higher rate of 1 in 59 children in the United States . Therapists offer individualized services for helping children with ASD to develop social skills through games or storytelling , but such services are not universally accessible or affordable. To this end, researchers have been actively exploring SAR for children with ASD . Several short-term studies have already shown SAR to support learning in ASD users . Moreover, in this journal, Scassellati et al.  reported on a long-term, in-home SAR intervention that helped children with ASD to improve social skills such as perspective-taking and joint attention with adults.
SAR systems must engage users to be effective. Robot perception of user engagement is a key HRI capability that makes it possible for the robot to achieve the specified goals of the interaction and, in the SAR context, the intervention . Past work has used rule-based methods to approximate the engagement of children with ASD. For instance, Kim et al. 
indirectly assessed engagement by estimating emotional states from audio data. Esteban et al. gauged engagement by using the frequency of measured variables, such as how many times the child looked at the robot. More recently in this journal, Rudovic et al. 
used supervised machine learning (ML) to model engagement in a single-session laboratory study. The study focused on developing post hoc personalized models using deep neural networks, and achieved an average agreement of 60% with human annotations of engagement on a continuous scale from -1 to +1.
This article addresses the feasibility and challenges of applying supervised ML methods to model user engagement in long-term, real-world human-robot interactions, with a focus on SAR interventions for children with ASD. The contributions of this work differ from past work in two key aspects.
First, the methods and results are based on data from month-long, in-home SAR interventions with 7 child participants with ASD. While single-session and short-term studies of SAR for ASD are numerous , the work by Scassellati et al.  in this journal is the only other long-term, in-home SAR for ASD study conducted to date. Long-term, in-home studies and data collections are important for many reasons: they more realistically represent real-world learning environments, they provide more opportunities for the user to learn and interact with the robot, and they produce more relevant training datasets . Furthermore, long-term, in-home settings present new modeling challenges, given the significantly larger quantity and variance in user data.
Second, this work emphasizes engagement models that are practical for on-line use in human-robot interactions. With a supervised ML approach, models require labeled data for training, which are often expensive or unfeasible to obtain. Previous works report on models trained and tested on randomly sampled subsets of each participant’s data . However, that approach is impractical for on-line use if labeled training data for a given user are obtained chronologically after the testing data. In contrast, this work presents, for each user: 1) generalized models trained on data from different users; and 2) individualized models trained on an early subset of the user’s data. As detailed in Materials and Methods, models were developed for different numbers of training users in generalized models and varying sizes of early subsets in individualized models. An early subset of the data is defined as the first % of a user’s data sorted chronologically. Furthermore, this work also analyzes the temporal structure of model predictions to examine the possibility of initiating re-engagement actions at appropriate times.
The presented engagement models are trained on data from month-long, in-home SAR interventions. During interventions, child participants with ASD played space-themed math games on a touchscreen tablet while a Stewart platform robot named Kiwi provided verbal and expressive feedback 
. The robot’s feedback and instruction challenge levels were personalized to each user’s unique learning patterns with reinforcement learning over the month-long intervention. All participants showed improvements in reasoning skills and long-term retention of intervention content[14, 15]. Over the month-long data collection, an average of three hours of multi-modal data were collected across multiple sessions for each participant, including video, audio, and performance on the games. As Figure 1
shows, a USB camera mounted on top of the game tablet recorded a front view of the user. Visual and audio features were extracted from the camera data and performance features were derived from the answers to game questions recorded on the tablet. As detailed in Materials and Methods, the open-source data processing tools used to extract these features are appropriate for on-line use in HRI contexts[5, 11, 7].
This work frames engagement modeling as a binary classification problem, similar to most previous relevant works . Participants were annotated as engaged or disengaged in each camera frame using standard definitions of engagement as a combination of behavioral, affective, and cognitive constructs . A participant was considered to be engaged when paying full attention to the interaction, immediately responding to the robot’s prompts, or seeking further guidance from others in the room. The binary labels simplify the representation of participants’ behavior, which may vary in degree of engagement and disengagement . However, temporal patterns in binary labels can provide additional context ; therefore, this work also analyzes the length of time a participant is continuously engaged and disengaged. Trained annotators labeled engagement, and an inter-rater reliability of (Fleiss’ Kappa) was achieved. The Materials and Methods section provides additional details about the data and the annotation process.
This article focuses on post hoc models of user engagement based on data from month-long, in-home SAR interventions. The presented approaches are suitable for on-line perception of engagement, and are intended to inform the design of more engaging and personalized HRI. The contributions of this work especially aim to improve SAR’s effectiveness in supporting learning by children with ASD.
This work presents two types of supervised ML models of user engagement in long-term, real-world HRI intended for on-line implementation: 1) generalized models trained on data from different users; and 2) individualized models trained on data from early subsets of the users’ interventions. On average, these models achieved area under the receiver operating characteristic (AUROC) values of approximately 90%. AUROC
is a commonly-used ML metric for binary classification problems; specifically, it measures the probability that the models would rank a randomly chosen engaged instance higher than a randomly chosen disengaged instance. In order to evaluate these two approaches, models trained on random samples of all users data were also implemented. Random sampling yielded significantly higher recall for disengagement compared to generalized and individualized models. This is likely because the month-long, in-home setting led to a large variance in both engagement states and recorded data. Variance in data manifested not only across participants but also within each participant, highlighting an important characteristic of real-world HRI in the ASD context. Despite the lower recall and higher variance for disengagement, temporal patterns in model predictions can be used to reliably initiate re-engagement actions at appropriate times.
Observed User Engagement
Over the course of the month-long, in-home intervention, participants were engaged an average of 65% of the time during the child-robot interactions. However, engagement varied considerably across participants and for each participant, as shown in Figure 2
. Average engagement for participants ranged from 48% to 84%, with a standard deviation of 14%. Analyzing each participant’s engagement chronologically over 10% increments also showed a standard deviation of 15%. Moreover, all participants had a significant (
) decrease in engagement over the month-long intervention, as determined by a regression t-test and shown by the plotted trend line. For example, Participant 2 was engaged 82% of the time in the first 10% and only 19% of the time in the last 10% of the month-long intervention.
This substantial variance in user engagement over the course of a long-term, real-world study indicates the need for on-line recognition of and response to disengagement. This study observed higher engagement for all participants shortly after the robot had spoken. Specifically, participants were engaged about 70% of the time when the robot had spoken in the previous minute, but less than 50% of the time when the robot had not spoken for over a minute. This validates the use of appropriately-timed robot speech as a tool for eliciting and maintaining user engagement.
Generalized and Individualized Model Results
This work presents generalized and individualized models of user engagement using data from long-term, in-home SAR interventions. As detailed in Materials and Methods, generalized models were developed by training on data from a given subset of users and then testing on different users. Individualized models were developed by sorting a user’s data chronologically and using an early subset for training and later subset for testing the model. These two approaches were designed to be feasible for on-line use in HRI; the labeled data required for supervised ML models are practical to obtain in both cases. In order to evaluate these approaches, models trained on random samples of all users’ data were also implemented, despite random sampling not being feasible for on-line use.
As shown in Figure 3A, generalized and individualized models achieved approximately 90% AUROC. For generalized models, the number of training users had little effect on AUROC; models trained on six users resulted in only a 3% improvement in AUROC over models trained on one user. On the other hand, individualized models performed better with additional data; models trained on the first 10% of a user’s data only achieved 77% AUROC, whereas models trained on the first 50% of a user’s data achieved 87% AUROC. Overall, both generalized and individualized models achieved comparable AUROC to models trained on random samples, which obtained 90% AUROC by training on as little as 30% of data across all users.
However, generalized and individualized models differ from models trained on random samples when considering other ML evaluation metrics such as precision and recall. For a given class (engagement or disengagement),precision measures the proportion of predictions of the class that are correct, and recall measures the proportion of actual instances of the class that are predicted correctly. As Figure 3B shows, there is an especially large difference between models in recall for disengagement. On average, training on random samples resulted in 82% recall for disengagement, whereas training on different users and early subsets resulted in only 61% and 45% recall, respectively. This indicates that generalized and individualized models would produce a high number of false negatives for detecting disengagement if implemented on-line in HRI. Supplementary Tables S4, S5, and S6 contain detailed model results for all approaches and evaluation metrics.
Variance in Data Across Users, Sessions, and Engagement States
This work’s long-term, real-world setting resulted in significantly different means and variances of data across participants, sessions, and engagement states, as shown in Figure 4
. The figure compresses recorded data with high face detection confidence to two dimensions using principal component analysis (PCA), a commonly used unsupervised dimensionality reduction technique. Plotting compressed data reveals limited overlap between two participants (Figure4A) and two sessions from the same participant (Figure 4B). Additionally, Figure 4C shows a higher variance in data when participants are disengaged, which may explain the low recall values for disengagement reported in the previous section. Supplementary Figures S1 and S2 show similar visualizations for all participants and all sessions for the same participant in Figure 4B.
Statistical analysis confirmed that both the means and variances of features differed significantly () across participants, sessions, and engagement states. A one-way analysis of variance (ANOVA) was used to test differences in means, and a F-test was used to test differences in variance. Tests were performed on the principal components of all data.
Detecting Disengagement Sequences Using Temporal Patterns
This study demonstrates the importance and feasibility of detecting longer sequences of disengagement using the temporal structure of model predictions. Engagement sequences (ES) are periods in the interaction when the user is continuously engaged, while disengagement sequences (DS) are periods when the user is continuously disengaged. As Figure 5 shows, the duration of ES had an interquartile range of 5.0 to 27.0 seconds (s) while the duration of DS had an interquartile range of 2.5s to 9.5s. This work defines long DS as having a duration greater than the upper quartile (9.5s) and short DS as having a duration less than the lower quartile (2.5s). Long DS accounted for 75% of the total time users were disengaged, whereas short DS accounted for only 5% of the total time users were disengaged. This suggests that re-engagement strategies should focus on long DS despite the presence of many shorter sequences.
The results and insights from the data suggest the following strategy for determining when to initiate re-engagement actions (RA): 1) average a model’s predicted probability of engagement over a given window, and then 2) initiate RA if the engagement probability is less than a given threshold. This approach should maximize long DS with RA, and minimize the percentage of ES with RA. Other considerations include the percentage of short DS with RA, the median duration of DS with RA, and the median elapsed time in DS before RA. The window length and threshold will affect these evaluation metrics so the choices for these parameters should depend on the intervention design and implemented RA.
Figure 6 presents a post hoc analysis of the proposed re-engagement strategy. For example, suppose this study initiated RA if the predicted engagement probability was less than 0.35 on average for a 3s window. This approach would have led to RA in 73% of long DS. The median duration of DS with RA would have been 25s, and RA would have occurred 2.5s into these sequences. However, RA would also have occurred in 5% of short DS and 15% of ES.
Varying the window lengths and thresholds highlights the trade-off between maximizing RA in DS and minimizing RA in ES. The window length was negatively correlated with the percentage of long DS () and ES () with RA for a fixed threshold, as shown in Figure 6A. On the other hand, the threshold was positively correlated with the percentage of long DS () and ES () with RA for a fixed window length, as shown in Figure 6B. The reported results are based on generalized models trained on six users. Supplementary Tables S7 and S8 contain results for both generalized and individualized models with additional window length and threshold combinations.
Different Modalities and Model Types
Over the month-long, in-home SAR interventions, a rich multi-modal dataset was collected from which visual, audio, and game performance features were derived to model engagement. To assess each modality’s importance, separate models were created using each feature group. As Figure 7A shows, all modalities together outperformed each individual modality. However, models created using only visual features outperformed those created using audio or game performance features by about 20% AUROC. These results support related work in this journal  that also found visual features as the most significant but multiple modalities as complimentary.
Moreover, analyzing individual features revealed that the results of this work could largely be replicated using only seven key features. Feature analysis was performed using Pearson’s correlation coefficient (), and key features were determined using a threshold of . The key features are: the elapsed time in a session, the number of people in the environment, the direction of the user’s eye gaze, the distance from the camera to the user, the elapsed time since the robot last talked, the count of incorrect responses to game questions, and the confidence value with which the user’s face is being detected in the camera frame. Models using only these seven key features achieved AUROC values within 5% of the results described above.
Additionally, this work explored several supervised ML model types, but found tree-based models to be the most successful. The following conventional model types were considered: Naive Bayes, K-Nearest Neighbors, Support Vector Machines, Neural Networks, Logistic Regression, Random Decision Tree Forests, and Gradient Boosted Decision Trees. Of these, Gradient Boosted Decision Trees had the highest AUROC, as shown in Figure7
B, and are the basis for model evaluation metrics reported in previous sections. This work also explored sequential models such as Hidden Markov Models, Conditional Random Fields, and Recurrent Neural Networks, but found these to be less effective than conventional static models.
A few alternative modeling approaches were considered as well. Interestingly, an ensemble of generalized and individualized models did not lead to better results than those approaches applied separately. Other explored approaches included: 1) rule-based models with the key features, 2) deep neural networks with re-weighting techniques , and 3) synthetic oversampling of the disengaged class . None of these approaches outperformed Gradient Boosted Decision Trees.
Robot perception of user engagement is a crucial HRI capability previously unexplored in the context of long-term, in-home SAR interventions for children with ASD. This study is the first to model engagement in this complex setting, and it also differs from previous work by developing supervised ML models intended for real-world deployments. The discussion below focuses on how this work highlights the feasibility and challenges of on-line recognition and response to disengagement in long-term, in-home SAR contexts. These contributions aim to inform the design of more engaging and personalized HRI, and improve SAR’s effectiveness in augmenting learning by children with ASD.
Feasibility of On-line Closed-Loop Implementation
This study presents supervised ML models that are feasible for use in on-line robot perception of and closed-loop response to disengagement. Two types of models are presented for each user: 1) generalized models trained on data from different users; and 2) individualized models trained on data from early subsets of users’ interventions. The visual, audio, and performance features along with the labeled training data used in these models can be obtained in on-line deployments, as discussed further in Materials and Methods. Generalized and individualized models achieved approximately 90% AUROC in this work’s post hoc analysis. Individualized models performed better with additional data, likely because participants had higher engagement in early subsets used for training. Overall, the similar performance of these approaches indicates the possibility of having one model for multiple users. Generalized and individualized models also attained comparable AUROC to models trained on random samples of all participants’ data, which is an ideal but impractical method.
A shortcoming of generalized and individualized models is revealed through a 50% false negative rate for detecting disengagement. This effect is likely due to the higher variance in features when participants were disengaged compared to when they were engaged. Despite the low recall values, a SAR system that accurately recognizes some instances of disengagement can still considerably enhance the interaction by attempting to re-engage the user at appropriate times. Analyzing disengagement sequences (DS), or periods in the interaction when the user is continuously disengaged, shows that 75% of the total time participants were disengaged occurred during DS that were 10 seconds long or longer. Some examples of participant behavior during these long DS included playing with toys, interacting with siblings, and even abruptly leaving the intervention setting. Shorter DS typically involved brief shifts in participant focus to other aspects of the environment. Moreover, most long DS required caregivers to re-engage participants, whereas participants re-engaged on their own in short DS. This suggests that re-engagement strategies should focus on counteracting longer DS.
This work’s post hoc analysis shows generalized and individualized models could be used to reliably initiate re-engagement actions (RA) during long DS. The presented re-engagement strategy would have initiated RA if the average predicted engagement probability over a window of time fell under a set threshold. Using a 0.35 threshold and a 3-second window would have resulted in RA for about 75% of long DS, with the first RA occurring 2.5 seconds into these sequences on average; however, RA would also have been erroneously initiated in 15% of engagement sequences (ES). An exploration of various window lengths and thresholds reveals the trade-off between maximizing RA in DS and minimizing RA in ES. This balance is important for maintaining RA effectiveness, and choices for these parameters should depend on the implemented RA and overall intervention design.
The presented models are also readily interpretable, an important characteristic for facilitating implementation. Interpretability of ML is especially important in the ASD context, where therapists and caregivers need an understanding of the system’s behavior in order to trust and adopt it . The described models achieved interpretability in two ways: 1) through a simplified feature set and 2) through the selected model types. First, this work replicated the described model accuracies to within 5% using seven key features, as described in Results. This shows the problem of modeling engagement to be more tractable and provides insights for the design of future HRI studies in similar contexts. Second, this work found tree-based ML models to be the most effective. Such methods are comparatively easier to train and interpret compared to more complex ML models . Nevertheless, it is unknown whether the effectiveness of tree-based models in this work would generalize to other contexts. The interpretability of this work further demonstrates the feasibility of applying supervised ML to model user engagement on-line in closed-loop HRI.
Challenges of the Real-World In-Home Context
A long-term, real-world HRI setting raises many modeling challenges due to the significant noise and variance in data. The unconstrained home environment in particular presented several unforeseen problems. First, the camera was attached to the top of the game tablet, but it’s position was frequently disturbed by both caregivers and child participants. For example, some caregivers temporarily turned the camera towards the ground or toward the robot when child participants were taking a break, instead of using the system’s power switch as instructed. As a result, the camera position varied throughout the study adding noise to the extracted visual features. The audio data in this study also contained a high level of background noise, including sounds from television, pets, kitchen appliances, and lawn mowers. All participating families chose to place the SAR system in their living rooms, so external parties such as siblings, friends, and neighbors regularly interrupted sessions. The camera sometimes failed to capture all individuals in the environment, as the system was not designed for multi-party interactions. Finally, the variance in data was higher in this study given participants were children with ASD, who display atypical and highly diverse behaviors .
Substantial variance in data leaves supervised ML models vulnerable to overfitting. To mitigate this risk, standard ML practices such as bagging, boosting, and early stopping were used, as discussed below in Materials and Methods. In spite of the challenges of a real-world setting, generalized and individualized models achieved AUROC values around 90% and could reliably initiate re-engagement actions in long sequences of disengagement. However, these models only had 50% recall for disengagement. Further improving the false negative rate is a key area of future work.
Limitations and Future Work
As discussed in the previous section, a key modeling challenge in real-world HRI is the substantially increased variance in data, especially when users are disengaged. The solution to this problem in traditional ML is to obtain a large sample of labeled training data. This is not always feasible in HRI and is especially complex for atypical user populations. Moreover, this challenge is especially acute in the ASD context, where high variance in behaviors manifest not only between individuals but also within each individual.
Active learning (AL) is a promising approach to this challenge, as it seeks to automatically select the most informative instances that need labeling . Preliminary work has shown AL to successfully train models of user engagement with a small amount of data . However, AL is yet to be validated in long-term, real-world settings, as discussed previously in this journal . A future approach could first use supervised ML to train base models on available labeled data from other users or a user’s beginning sessions, as done in this work. Then, AL could be applied to decide when to request a label for unseen data. A therapist or caregiver could provide the labels off-line after sessions, allowing the model to iteratively improve in a long-term setting.
Ultimately, the most important direction for future work is to deploy ML frameworks on-line in real-world HRI and SAR. Such deployments are critical for understanding how well models recognize disengagement under realistic constraints of noise, uncertainty, and variance in data. When implemented on-line, these models could inform the activation of robot re-engagement actions; specifically, these could entail verbal and non-verbal robot responses such as socially supportive gestures and phrases . Overall, on-line recognition of and response to disengagement will enable the design of more engaging, personalized, and effective HRI, especially in SAR for the ASD community.
Materials and Methods
The presented engagement models are based on data from month-long, in-home SAR interventions with children with ASD. During child-robot interactions, participants played a set of space-themed math games on a touchscreen tablet while a 6 degrees of freedom Stewart-platform robot named Kiwi provided verbal and expressive feedback, as shown in Figure1. The study was approved by the Institutional Review Board of the University of Southern California (UP-16-00755), and informed consent was obtained from the children’s caregivers. The 7 child participants in this work had a clinical diagnosis of ASD in mild to moderate ranges as described in the Diagnostic and Statistical Manual of Mental Disorders . Supplementary Table S1 reports the ages and genders of the participants: ages were between 3 years, 11 months and 7 years, 2 months; 3 were female and 4 were male. An earlier article provides further details about the SAR system and intervention design, with a focus on how the robot’s feedback and instruction challenge levels were personalized to each user’s unique learning patterns using on-line reinforcement learning .
Over the course of the month-long, in-home study, a large multi-modal dataset was collected from which visual, audio, and game performance features were derived to model engagement. Due to numerous technological challenges commonly encountered in noisy real-world studies, this work only considers approximately 21 hours of interaction from 7 participants who had sufficient multi-modal data. Participant 4 had the maximum interaction time analyzed (3 hours, 48 minutes), and Participant 6 had the minimum interaction time analyzed (1 hour, 52 minutes). Data collected in individual sessions were truncated to only use the content between the first and last game, because session data often included unstructured interactions before and after the games. Each participant was given a tutorial session as an introduction to the SAR system; data from that session were not included in the analysis.
A USB camera mounted at the top of the game tablet recorded a front view of the participants. Visual and audio features were extracted from this camera data using OpenFace , OpenPose , and Praat , open-source data processing tools feasible for on-line use in HRI. Visual features derived from OpenFace included: 1) the face detection confidence value, 2) eye gaze direction, 3) head position; and 4) facial expression features. OpenPose was only used to estimate the number of people in the environment, since the camera’s field of view centered on the user’s face. Audio features derived from Praat included pitch, frequency, intensity, and harmonicity. Game performance features were also derived from system recordings and included the challenge level of the game being played, the count of incorrect responses to game questions, and the elapsed time in a session, game, and since the robot last talked. Supplementary Note S1 lists all visual, audio, and game performance features used for modeling engagement.
In this work, a participant was annotated to be engaged or disengaged using standard definitions of engagement as a combination of behavioral, affective, and cognitive constructs . Specifically, a participant was considered to be engaged when paying full attention to the interaction, immediately responding to the robot’s prompts, or seeking further guidance from others in the room. The lead author of this work annotated whether a participant was engaged or disengaged for the seven participants. To verify the absence of bias, two additional annotators independently annotated 20% of the data for each participant; inter-rater reliability was measured using Fleiss’ Kappa, and a reliability of was achieved between the primary and verifying annotators. Supplementary Table S2 contains the specific criteria followed by all annotators.
This work applied and evaluated conventional supervised ML methods to model user engagement in month-long, in-home SAR interventions for children with ASD. First, a few preprocessing techniques were applied to the multi-modal dataset described in the previous section to address missing data and possible errors in the fusion of modalities. While data were obtained for each camera frame at a standard rate of 30 frames per second, this work considered the median value of features and annotations in overlapping one second intervals (i.e., 0 to 1 second, 0.5 to 1.5 seconds, 1 to 2 seconds, etc.). The following features were added for each interval: 1) the variance of continuous-valued features in the interval; and 2) an indicator for whether discrete-valued features changed in the interval. This also addressed the problem of low OpenFace confidence for detecting the user’s face; low confidence occurred in 38% of camera frames overall, but in only 3 continuous frames on average. Furthermore, all features were standardized to have zero mean and unit variance since raw values were measured on different scales. The means and variances of each feature needed for standardization were obtained with respect to the training set in order to maintain the feasibility of on-line implementation.
To model user engagement, this work used two supervised ML approaches that are practical for on-line implementation in closed-loop HRI: 1) generalized models trained on data from different users; and 2) individualized models trained on data from early subsets of the users’ interventions. Generalized models were implemented by training on data from a given subset of participants. The models were then tested on the remaining users not in the training subset. All possible combinations of participants were considered; since there were 7 participants, values of and ranged from 1 to 6. Individualized models were developed by sorting a user’s data chronologically and using an early subset for training and later subset for testing the model. In particular, this evaluation was applied to training sets from the first 10% to the first 90% of a user’s data, in increments of 10%. This approach was used in order to standardize the training sets across differences in participant interaction times; future implementations could use beginning sessions as training data instead. The generalized and individualized approaches are both feasible for on-line use in HRI deployments; the labeled training data required for supervised ML models can be obtained in both cases. In order to evaluate these approaches, models trained and tested on distinct random samples of all users data were also implemented despite being impractical for on-line use. The proportions of training data evaluated in the random sampling approach also ranged from 10% to 90%, in increments of 10%.
Using the generalized, individualized, and random sampling approaches, this work implemented several supervised ML model types. All considered model types are reported in the Results section; Gradient Boosted Decision Trees were the most successful. Specifically, this work implemented Gradient Boosted Decision Trees with early stopping and bagging 
. Boosting algorithms train weak learners sequentially, with each learner trying to correct its predecessor. Early stopping partitions a small percentage of the training data for validation, and ends training when performance on the validation set stops improving. Bagging fits several base classifiers on random subsets of the training data, and then aggregates the individual predictions to form a final prediction. These techniques were adopted to mitigate the increased risk of overfitting in high variance datasets, as was the case in this long-term, in-home study.
The ML models were implemented in Python using the following libraries: Scikit-learn version 0.21.3 
, XGBoost version 0.90, Hmmlearn version 0.2.1 , CRFsuite version 0.12 
, TensorFlow version 1.15.0
, and Keras version 2.2.4
. All models were implemented with default hyperparameters, as specified in Supplementary TableS3. Default hyperparameters were used because the variance in data caused commonly used strategies such as cross validation and grid search to overfit to the training data. All reported model results are from Scikit-learn implementations, except for Gradient Boosted Decision Trees, which were implemented using XGboost for improved computational performance. Neural Networks were also implemented in TensorFlow and Keras, and had similar performance to the reported results from Scikit-learn. Sequential models were explored using Hmmlearn, CRFsuite, and Keras.
The authors sincerely thank K. Mahajan and L. Mathur for their help with data analysis, K. Peng and J. Keller for their assistance with annotations, and T. Groechel, L. Klein, and R. Pakkar for their advice and support. In addition, the authors are very grateful to G. Ragusa for her key role in the original study design, recruitment, assessments, and more. The entire research team especially thanks the children and families who participated in the study that generated the dataset.
This research was supported by the National Science Foundation Expedition in Computing Grant NSF IIS-1139148.
S.J. led this work’s conceptualization and investigation. S.J., B.T., and Z.S. processed the data, implemented the methods, and analyzed the results. C.C. designed the study and managed the deployments that generated the datasets used in this work. M.M. was the leading faculty advisor for the overarching study on SAR for ASD, and all conducted research, data collection, and analysis. S.J., B.T., Z.S., and M.M. all contributed to the writing of this article.
M.M. is a co-founder of Embodied, Inc. but has not been involved with the company since December 2016. C.C. is now a full-time employee of Embodied, Inc. but was not involved with the company while the reported work was done.
Data and materials availability:
All data needed to evaluate the conclusions that can be released under USC IRB policies are included in the article or the Supplementary Materials. Please contact S.J. and M.M. for questions about other materials.
File S1 (.csv format). Dataset for engagement models. Google Drive Link.
File S2 (.csv format). Descriptions of columns in File S1. Google Drive Link.
Note S1. List of multi-modal features.
Figure S1. Comparing data across users.
Figure S2. Comparing data across sessions.
Table S1. Participant demographic information.
Table S2. Engagement annotation criteria.
Table S3. Model hyperparameters.
Table S4. Generalized model results.
Table S5. Individualized model results.
Table S6. Random sampling model results.
Table S7. Re-engagement strategy evaluation using generalized models.
Table S8. Re-engagement strategy evaluation using individualized models.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: Modeling Approaches , Table S3.
-  (2013) Diagnostic and statistical manual of mental disorders (dsm-5®). American Psychiatric Pub. Cited by: Multi-Modal Dataset , Table S1.
-  (2018) Autism spectrum disorders. Technical report World Health Organization. External Links: Cited by: Introduction.
-  (2015-05) Cross-dataset learning and person-specific normalisation for automatic action unit detection. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 06, pp. 1–6. External Links: Cited by: 4th item.
-  (2018-05) OpenFace 2.0: facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), Vol. , pp. 59–66. External Links: Cited by: Introduction, Multi-Modal Dataset , Note S1. List of multi-modal features. .
-  (2017) UE-hri: a new dataset for the study of user engagement in spontaneous human-robot interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI ’17, New York, NY, USA, pp. 464–472. External Links: Cited by: Introduction.
-  (2002) Praat, a system for doing phonetics by computer. Cited by: Introduction, Multi-Modal Dataset , Note S1. List of multi-modal features. .
-  (2011) SMOTE: synthetic minority over-sampling technique. CoRR abs/1106.1813. External Links: Cited by: Different Modalities and Model Types .
-  (1997-07) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 30 (7), pp. 1145–1159. External Links: Cited by: Results.
-  (2013-10) Applying behavioral strategies for student engagement using a robotic educational agent. In 2013 IEEE International Conference on Systems, Man, and Cybernetics, Vol. , pp. 4360–4365. External Links: Cited by: Limitations and Future Work .
OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, Cited by: Introduction, Multi-Modal Dataset , Note S1. List of multi-modal features. .
-  (2016) XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 785–794. External Links: Cited by: Modeling Approaches , Modeling Approaches , Table S3.
-  (2015) Keras. Note: https://keras.io Cited by: Modeling Approaches , Table S3.
-  (2018) Attentiveness of children with diverse needs to a socially assistive robot in the home. In 2018 International Symposium on Experimental Robotics (ISER), Cited by: Introduction, Introduction.
-  (2019) Long-term personalization of an in-home socially assistive robot for children with autism spectrum disorders. Frontiers in Robotics and AI. Cited by: Introduction, Multi-Modal Dataset , Table S1.
-  (2018) Robots for the people, by the people: personalizing human-machine interaction. Science Robotics 3 (21). External Links: Cited by: Limitations and Future Work .
-  (2019) Escaping oz: autonomy in socially assistive robotics. Annual Review of Control, Robotics, and Autonomous Systems 2 (1), pp. 33–61. External Links: Cited by: Limitations and Future Work .
-  (2019) Escaping oz: autonomy in socially assistive robotics. Annual Review of Control, Robotics, and Autonomous Systems 2 (1), pp. 33–61. External Links: Cited by: Introduction.
-  (2019) Data and statistics on autism spectrum disorder. Technical report Centers for Disease Control and Prevention. External Links: Cited by: Introduction.
-  (2018-05) Some brief thoughts on the past and future of human-robot interaction. ACM Trans. Hum.-Robot Interact. 7 (1), pp. 4:1–4:3. External Links: Cited by: Introduction.
-  (2012) The clinical use of robots for individuals with autism spectrum disorders: a critical review. Research in autism spectrum disorders 6 (1), pp. 249–262. Cited by: Introduction.
-  (2005) Defining socially assistive robotics. In Rehabilitation Robotics, 2005. ICORR 2005. 9th International Conference on, pp. 465–468. Cited by: Introduction.
-  (2017-05) How to build a supervised autonomous system for robot-enhanced therapy for children with autism spectrum disorder. Paladyn, Journal of Behavioral Robotics 8, pp. 18–38. External Links: Cited by: Introduction.
-  (2015) Hmmlearn: hidden markov models in python. External Links: Cited by: Modeling Approaches , Table S3.
-  (2014) Learning-based modeling of multimodal behaviors for humanlike robots. In Proceedings of the 2014 ACM/IEEE International Conference on Human-robot Interaction, HRI ’14, New York, NY, USA, pp. 57–64. External Links: Cited by: Introduction.
-  (2017-06) Audio-based emotion estimation for interactive robotic therapy for children with autism spectrum disorder. In 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), Vol. , pp. 39–44. External Links: Cited by: Introduction.
-  (2019) Application of interpretable machine learning models for the intelligent decision. Neurocomputing 333, pp. 273 – 283. External Links: Cited by: Feasibility of On-line Closed-Loop Implementation .
-  (2016) Socially assistive robotics. In Springer Handbook of Robotics, pp. 1973–1994. Cited by: Introduction.
-  (2017) Socially assistive robotics: human augmentation versus automation. Science Robotics 2 (4). External Links: Cited by: Introduction.
-  (2018) Science nation: socially assistive robots for children on the autism spectrum. Youtube. External Links: Cited by: Figure 1.
-  (2007) CRFsuite: a fast implementation of conditional random fields. External Links: Cited by: Modeling Approaches , Table S3.
-  (2008) Behavioural and developmental interventions for autism spectrum disorder: a clinical systematic review. PloS one 3 (11), pp. e3755. Cited by: Introduction.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Modeling Approaches , Table S3.
Learning to reweight examples for robust deep learning. CoRR abs/1803.09050. External Links: Cited by: Different Modalities and Model Types .
-  (2018) Personalized machine learning for robot perception of affect and engagement in autism therapy. Science Robotics 3 (19). External Links: Cited by: Introduction, Introduction, Introduction, Introduction, Different Modalities and Model Types , Feasibility of On-line Closed-Loop Implementation , Challenges of the Real-World In-Home Context .
-  (2019) Multi-modal active learning from human data: A deep reinforcement learning approach. CoRR abs/1906.03098. External Links: Cited by: Limitations and Future Work .
-  (2012) Robots for use in autism research. Annual review of biomedical engineering 14, pp. 275–294. Cited by: Introduction, Introduction, Multi-Modal Dataset .
-  (2018) Improving social skills in children with asd using a long-term, in-home social robot. Science Robotics 3 (21). External Links: Cited by: Introduction, Introduction.
-  (2016) Towards autonomous moderation of an assembly game. In RO-MAN 2016, Cited by: Introduction.
Note S1. List of multi-modal features.
This work is based on a large multi-modal dataset from month-long, in-home SAR interventions for children with ASD. Video and audio features were extracted from camera data using OpenFace , OpenPose , and Praat . Game performance features were derived from the tablet interactions. A list of all features used for modeling engagement is included here.
Visual Features :
Face Detection: 1) confidence value for face detection, 2) binary success value for face detection, and 3) elapsed time since last successful detection;
Eye Gaze: 1) eye gaze direction vector for left eye, right eye, and both eyes (average), and 2) Euclidean distance from camera to endpoint of participant’s gaze (using intersection of gaze with vertical camera plane);
Head Position: 1) orientation of head in terms of pitch, roll, and yaw movements, and 2) Euclidean distance from camera to location of the participant’s head;
Facial Expression: The following OpenFace Action Units  were used: 1) inner brow raiser, 2) outer brow raiser, 3) brow lowerer, 4) upper lid raiser, 5) cheek raiser, 6) lid tightener, 7) nose wrinkler, 8) upper lip raiser, 9) lip corner puller, 10) dimpler, 11) lip corner depressor, 12) chin raiser, 13) lip stretcher, 14) lip tightener, 15) lips part, 16) jaw drop, 17) lip suck, and 18) blink;
Environment: 1) estimated number of people in the environment.
Audio Features: 1) harmonicity: measure of voice quality, 2) intensity: power carried by sound waves per unit area, 3) frequency: Mel-frequency cepstral coefficients, and 4) pitch frequency and periodicity.
Game Performance Features: 1) challenge level of game being played, 2) count of incorrect responses to game questions in current game and overall session, 3) number of games played in the session, 4) game type, and 5) elapsed time in a session, game, and since the robot last talked.
|Male||4 years, 6 months|
|Female||4 years, 11 months|
|Male||4 years, 5 months|
|Male||7 years, 2 months|
|Female||3 years, 11 months|
|Male||4 years, 6 months|
|Female||7 years, 2 months|
|paying full attention to the interaction||not paying full attention to the interaction|
|present in the camera frame and in front of the system||not present in the camera frame or away from the system|
|attention directed toward the tablet or robot; listening or responding to the robot’s prompts, interacting with the tablet||when prompted, not listening or responding to the robot, not interacting with the tablet|
|looks or turns away from the system for further guidance or feedback from others in the room||looks or turns away from the system aimlessly or because of a distraction, not for help|
|very interested with minimal or no incentive from others in the room||only interested if prompted by others in the room or unresponsive to others in the room|
|talking or arguing about the games||talking or arguing not about the games|
|Naive Bayes||type = Gaussian||
assume Gaussian distributed features
|var_smoothing = 1e-09||variance proportion added for stability|
|K-Nearest Neighbors||n_neighbors = 5||number of neighbors considered|
|weights = uniform||neighbors are weighted equally|
|penalty = l2||L2 norm used for penalization|
|loss = squared_hinge||
square of hinge loss function
|Logistic Regression||penalty = l2||L2 norm used for penalization|
|solver = liblinear||algorithm for optimization problem|
|n_estimators = 100||
100 decision trees in random forest
|max_depth = None||nodes expanded until all leaves pure|
|max_features = None||all features considered for node split|
|Gradient Boosted Decision Trees||n_estimators = 100||100 decision trees in aggregate|
|max_depth = 6||maximum depth of the tree|
|booster = gbtree||algorithm for gradient boosting|
|Neural Networks||hidden_layer_sizes=(100)||one hidden layer with 100 nodes|
activation = relu
ReLU activation function
|solver = adam||algorithm for optimization problem|
|Hidden Markov Models||type = Gaussian||assume Gaussian emissions|
|n_components = 2||2 hidden states: engaged, disengaged|
|algorithm = viterbi||algorithm for optimization problem|
|Conditional Random Fields||solver = lbfgs||algorithm for optimization problem|
|regularizer = elastic_net||elastic net (L1 and L2) regularization|
|Recurrent Neural Networks||hidden_layer_sizes=(100)||one LSTM layer with 100 nodes|
|activation = relu||ReLU activation function|
|solver = adam||algorithm for optimization problem|
|Window||Threshold||Long DS||ES||Short DS||DS Length||Re-engage Point|
|Window||Threshold||Long DS||ES||Short DS||DS Length||Re-engage Point|