The development of socially interactive robots has inspired research into various applications for these tools. One application is in therapy and care, where robots can be used to provide daily support to patients, and as tools to augment interventions and provide quantitative data for clinicians . We specifically consider the use of robots in interventions for children with Autism Spectrum Disorder (ASD). The Diagnostic and Statistical Manual of Mental Disorders (DSM-V) defines ASD as a neuro-developmental disorder characterized by persistent deficits in social communication and interaction, and restricted or repetitive behaviours and interests . Diagnosing ASD involves the subjective interpretations by experts of observations of a child’s behaviour made by clinicians and caregivers . This subjectivity, and the clinical heterogeneity typical between ASD cases , means that the diagnostic process could be improved through the use of more quantitative, objective measures of child behaviour. This can be achieved using behaviour classification systems.
Developing an artificial system to recognize ASD symptoms is not a straight-forward task due to the social nature of ASD. This is because correct classification of social and interaction behaviour often requires the ability to infer the internal-states (e.g. intentions, emotions) of the observed individual. For example, identifying when a child fails to ask for comfort when needed (a symptom of ASD ) requires that the observer recognize that the child is experiencing a negative internal state. However, endowing robots with this skill would provide numerous benefits for ASD interventions. For instance, if an intervention involves regular interaction with a social robot, it would be useful to have the robot able to report quantitative diagnostic information. Firstly, clinicians could use this information to track their patient’s progress through the intervention, or to support their initial diagnostic decision. Secondly, the robot itself could use internal-state and diagnostic information to autonomously decide on appropriate behaviours to perform.
In approaching the problem of developing artificial systems able to recognize human internal states, there are two key questions which must be addressed: (1) what internal state information is available in behaviours which can be assessed and quantified by artificial systems, and (2) how can these states be represented by a classification system to provide both detailed assessments and flexible behavioural responses from a social robot. The rest of this paper discusses possible answers to these questions in the context of quantifying the diagnostic behaviours of children with ASD. We present two studies carried out as a proof-of-concept to demonstrate that the internal state of task engagement could be classified based on observable human movement information, and that this classification could be done by a system able to represent internal states as points along a continuous dimension. The logic behind our choice of internal state and its desired representation is described, where relevant, in the introductions to each experiment.
Ii Experiment 1
Whilst most ASD symptoms cannot be described as wholly overt, many have been linked with directly observable behaviours. For example, motor skills have been shown to be predictive of social communication skills for children with ASD . Additionally, an increased tendency to orient towards non-social contingencies rather than biological motion is indicative of ASD 
. These and other studies have linked movement and gaze behaviours to several ASD characteristics. Movement and gaze information can be measured or estimated by observing body movements or poses, which can be easily made accessible to artificial systems, e.g. by converting the position of an individual’s joints to coordinates in space. Consequently, we argue that such observable information can be useful for social robots designed to make inferences about human internal states pertaining to ASD symptoms.
Designing a system to recognize this kind of diagnostic information, however, is non-trivial. We would need to have identified how observable behaviours relate to symptomology, and define which symptoms we are best able to recognize and describe in terms of severity based on behavioural data. Given the complexity of obtaining and labelling such data, we chose to perform a proof-of-concept study demonstrating the feasibility of our approach using data from a non-clinical population. We therefore chose to examine whether the internal state of task engagement could be identified and classified into different classes, based on the ‘intensity’ of the experienced state. That is, we aimed to train a classifier to distinguish between ‘high’, ‘intermediate’ and ‘low’ task engagement based on the behaviour of typically developing children. Before a classifier could be implemented, however, we first needed to verify that the internal state of interest (task engagement) was recognizable from the movement information available in our data set.
For this study, the desired data set was defined as one which contained the movement information of humans experiencing, but not intentionally communicating, different levels of a non-emotional internal state. To ensure that the internal state was not being communicated we decided that the subject should not be interacting with another human. With these considerations in mind, the data set for this experiment was taken from the openly available PInSoRo data set111https://freeplay-sandbox.github.io. This data set comprises videos of child-robot pairs interacting with each other and a touch-screen table-top (the sand-tray). We argue that these videos meet the requirements of showing humans experiencing internal states which could be described along a continuum (i.e. engagement with the touch-screen) which were not being actively communicated (i.e. due to the lack of a human interaction partner). The videos have been annotated for a number of behaviors including whether the child was engaged in “goal oriented”, “aimless” or “no” play. We believe these annotations are analogous to “high”, “intermediate” and “low” levels of task engagement respectively. A preliminary study was designed to validate this assumption.
Five participants (students and employees) were recruited from the University of Plymouth’s School of Computing, Electronics and Mathematics on a volunteer basis. Demographic information was not collected.
A total of forty-five video clips were extracted from the data set for this study. We selected fifteen clips with the annotation “goal-oriented play”, fifteen with the annotation “aimless play” and fifteen with the annotation “no play”. Clip lengths ranged from 12-30 seconds.
After clips were selected we extracted both the full visual scene versions and the movement-alone versions. The movement-alone versions were processed such that they depicted the children’s joint-points, connected by coloured lines, against a black background. These videos act as visual representations of the data used as input for the conceptor-based system in that they depict only movement and pose information by showing the position of the child’s body in each frame.
For each participant the experiment was conducted over two days. Participants watched the full visual scene videos on the first day and were then asked to return the next day when they would watch the movement-alone videos. Participants all received the following instructions before beginning the experiment:
You’re about to watch several videos of children interacting with a touch-screen table-top. The children were able to either play a specific game on the touch-screen, or to do whatever they want. After each clip you will be asked to judge the child’s level of task engagement.
Participants were then given the opportunity to ask any questions they may have had and were instructed about their right to withdraw before beginning the experiment.
This study was created using JSPsych and presented on a desktop computer. Participants were positioned a comfortable distance away from the screen where they could still reach the keyboard and mouse to provide responses. At the beginning of the experiment, the instructions were reiterated. Participants were then presented with a consent form within the experiment script and given two response options. If participants selected the “I consent” option, the experiment proceeded as normal. If participants selected “I do not consent” the experiment was terminated. Participants then viewed nine of each type of clip (a total of twenty-seven clips) presented in a random order. Following each clip, participants were presented with the question “How engaged was the child with their task on the touch screen table-top?”. This question was accompanied by a 7-point Likert scale ranging from 1 = “Not at all Engaged” to 7 = “Highly Engaged”. Participants used this scale to report how engaged they thought the child in the clip had been and then continued on to the next clip.
At the end of the experiment on the first day, participants were given the opportunity to ask any questions they had and were asked to return the next day to complete the second half. On the second day, the experiment proceeded in the same way except participants were shown the movement-alone videos instead of the full visual scene videos. Each participant saw the same twenty-seven clips in both sessions. At the end of the second session participants were fully debriefed on the nature and purpose of the study and were thanked for their participation. Each session took approximately 10-15 minutes to complete.
The following analyses were run using RStudio.
Ii-B1 Inter-Rater Agreement
The data were analyzed in two main ways. We firstly examined inter-rater agreement by calculating Krippendorff’s alpha for the responses. We initially checked whether participants gave similar responses for each of the three types of videos. To do this, Krippendorff’s alpha was calculated for responses to all of the videos of each type. The alpha scores have been interpreted in terms of the benchmarks outlined by Landis and Koch . Responses showed “fair” agreement for the goal-oriented (high engagement) clips (Krippendorff’s alpha = 0.269) and the no-play (low engagement) clips (Krippendorff’s alpha = 0.267). Responses for aimless (intermediate engagement) clips showed “slight” agreement (Krippendorff’s alpha = 0.171). The low levels of agreement can partially be explained by the fact that there were very few raters (2-4) per clip. As such we did not expect perfect levels of agreement and argue that the levels obtained suggest a sufficient degree of similarity in participants’ ratings.
We then examined whether participants had higher agreement when viewing the full visual scene clips compared to the movement-alone clips for each clip type. The results of this analysis are reported in Table 1. For the goal-oriented and no-play clips, participants tended to show similar levels of agreement in each condition. However, for the aimless clips, participants demonstrated poor agreement when viewing the movement-alone clips.
|Clip Type||Krippendorff’s Alpha (3 d.p.)|
|Goal Oriented||0.382 (fair)||0.368 (fair)|
|Aimless||0.247 (fair)||-0.022 (poor)|
|No Play||0.126 (slight)||0.202 (fair)|
The second set of analyses looked at the how participants rated each type of video. Overall mean rating was 4.81 (SD 1.25) for goal-oriented clips, 4.16 (SD 1.52) for aimless clips, and 2.43 (SD 1.54) for no-play clips. An ANOVA revealed a significant main effect of clip-type on ratings (F(2,267)=64.99, p0.001). Importantly, a post hoc Tukey test revealed significant differences between all conditions (Tukey’s HSD: all differences 0.6, all p’s 0.007; see Table 2).
|Comparison||Difference||Significance (p adj)|
|Goal Oriented Aimless||0.656|
|Goal Oriented No Play||2.348|
|Aimless No Play||1.722|
These results demonstrate that participants rated the clips in terms of engagement such that goal-oriented clips showed the highest levels of engagement, no-play clips showed the lowest levels, and aimless clips fell in-between these two extremes. Consequently, we feel our assumption that these annotations reflect different levels of engagement is sufficiently supported for these data to be used to train and test a conceptor-based classifier to recognize engagement based on observable behaviour. The remainder of this paper describes the design and initial tests of such a classifier.
Iii Experiment 2
In addressing the second question of how to represent internal states, we consider that ASD diagnosis involves ranking behaviours in terms of severity 
. In this way, behaviours important to ASD diagnosis can be thought of as lying along a continuum of severity. To emulate this we need a classification technique which can identify different ‘levels’ along a continuous dimension. This can be achieved using classical machine learning techniques by training a classifier on examples of each severity level. However, obtaining a large enough training data set for this would be very time-consuming and difficult, owing to the need to have expert commentators provide a severity label for each example. We therefore require a method which can learn several classification categories for each behaviour of interest, using a limited training data set. One approach which is suited to this task is conceptors.
Conceptors are neuro-computational mechanisms that can be used for learning a large number of dynamical patterns based on learned prototypical extremes . This approach assumes that there is a continuum underlying the behavior. New patterns can be generated by combining and morphing the learned extremes. As such, we argue that conceptors may be appropriate for classifying human internal states. The second study described here tested this hypothesis by designing a conceptor-based system to recognize task engagement from observable human movements.
The data set for this study was again taken from the PInSoRo data set. All of the clips annotated with the labels “goal-oriented play” (high engagement) and “no play” (low engagement) were extracted (total of 354 clips). Clips were preprocessed such that the xyz coordinates of the child’s joints in each frame were taken as the input for the conceptor-based classifier. A subset of “high” (62 clips) and “low” (115 clips) engagement clips made up the training data set. The remaining 177 clips made up the test data set.
Iii-A2 Conceptor-Based Classifier
The conceptor-based approach is based on a key dynamical phenomenon in Recurrent Neural Networks; “if a ‘reservoir’ is driven by a pattern, the entrained network states are confined to a linear subspace of network state space which is characteristic of the pattern”. In this way the dynamics of a pattern (in our case an overt behavior for a classifiable activity like engagement) will occupy different regions of the state space, and can be encoded in a conceptor. A conceptor () acts as a map associated with a pattern (). To build a conceptor-based classifier we computed conceptors, one for each class in our classifier. To obtain the conceptors an echo state network (ESN) was first created with an input layer of input units and a hidden layer reservoir of neurons. For each class the network will be driven, independently, with all training samples in each class , according to the ESN state update equation:
This yielded a set of network states where is the number of time-steps in from which a state correlation matrix is obtained, where is the total number of samples for class . Next we computed conceptor through the equation:
Where is a correlation matrix and in an “apperture” parameter. For more see .
Once we computed a conceptor matrix for each class we were able to classify a new sample
from the test set by feeding it into the ESN reservoir to obtain a new state vector. then, for each conceptor, the “positive evidence” quantity was computed. This led to a classification by deciding for as the class to which the sample belongs. The procedure for the conceptor-based classifier is summarize in Table III.
|Algorithm: Conceptor-based classification.|
|Input: A test sample belonging to one a class .|
|Output: Class sample belongs to.|
The resultant conceptors were tested using previously unseen samples from the high and low engagement categories. The results of this test are shown in Figure 1. Performance is above chance for both classes (high engagement: 60%, low engagement: 75%).
This study demonstrates that it is possible to train a conceptor-based system, on real non-periodic data, to classify between high and low engagement based on observable human behavior. The conceptor-based system successfully learned to recognize high and low engagement from observable human movement. Future work will construct new conceptors by linearly combining these learned conceptors. We will then test whether these new conceptors can be used to recognize intermediate levels of engagement identified in the PInSoRo data set.
If new conceptors can be generated, this method will show promise for use in providing diagnostic information for clinicians assessing children with ASD. The ability to interpolate between extremes along a continuum means that such a system could be trained on a smaller dataset, whilst still achieving a high level of detail through the generation of multiple intermediate classification categories.
This work is part of the EU FP7 project DREAM project (www.dream2020.eu, grant nr. 611391) and the H2020 L2TOR project (www.l2tor.eu, grant nr. 688014).
-  H. M. Van der Loos, D. J. Reinkensmeyer, AND E. Guglielmelli, “Rehabilitation and health care robotics.” In Springer handbook of robotics, pp. 1685-1728, 2016.
-  American Psychiatric Association, “Diagnostic and statistical manual of mental disorders: DSM-5.” Autor, Washington DC, 5th ed, 2013.
-  C. L. Rogers, L. Goddard, E. L. Hill, L. A. Henry, and L. Crane, “Experiences of diagnosing autism spectrum disorder: a survey of professionals in the United Kingdom.” Autism, vol. 20(7), pp. 820–831, 2016.
-  B. Scassellati, H. Admoni, and M. Matarić, “Robots for use in autism research.” Annual review of biomedical engineering, vol. 14, pp. 275–294, 2012.
-  J. Bradshaw, C. Klaiman, S. Gillespie, N. Brane, M. Lewis, and C. Saulnier, “Walking Ability is Associated with Social Communication Skills in Infants at High Risk for Autism Spectrum Disorder.” Infancy, 2018.
-  A. Klin, D. J. Lin, P. Gorrindo, G. Ramsay and W. Jones, “Two-year-olds with autism orient to non-social contingencies rather than biological motion.” Nature, vol. 459(72440, pp. 257–261, 2009.
-  S. Lemaignan, C. Edmunds, E. Senft, and T. Belpaeme, “The Free-play Sandbox: a Methodology for the Evaluation of Social Robotics and a data set of Social Interactions.” arXiv preprint arXiv:1712.02421, 2017.
-  J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data.” Biometrics, pp. 159–174, 1977.
-  C. Lord, S. Risi, L. Lambrecht, E. H. Cook Jr, B. L. Leventhal, P. C. DiLavore, A. Pickles and M. Rutter, “The Autism Diagnostic Observation Schedule–Generic: A Standard Measure of Social and Communication Deficits Associated with the Spectrum of Autism.” Journal of Autism and Developmental Disorders, vol. 30(3), 2000.
-  H. Jaeger, “Controlling recurrent neural networks by conceptors.” arXiv preprint arXiv:1403.3369, 2014.