Log In Sign Up

A Deep Learning Approach for Multi-View Engagement Estimation of Children in a Child-Robot Joint Attention task

In this work we tackle the problem of child engagement estimation while children freely interact with a robot in their room. We propose a deep-based multi-view solution that takes advantage of recent developments in human pose detection. We extract the child's pose from different RGB-D cameras placed elegantly in the room, fuse the results and feed them to a deep neural network trained for classifying engagement levels. The deep network contains a recurrent layer, in order to exploit the rich temporal information contained in the pose data. The resulting method outperforms a number of baseline classifiers, and provides a promising tool for better automatic understanding of a child's attitude, interest and attention while cooperating with a robot. The goal is to integrate this model in next generation social robots as an attention monitoring tool during various CRI tasks both for Typically Developed (TD) children and children affected by autism (ASD).


page 1

page 2

page 3

page 4


Social Engagement of Children with Autism during Interaction with a Robot

Imitation plays an important role in development, being one of the precu...

Understanding Factors that Shape Children's Long Term Engagement with an In-Home Learning Companion Robot

Social robots are emerging as learning companions for children, and rese...

Could Interaction with Social Robots Facilitate Joint Attention of Children with Autism Spectrum Disorder?

This research addressed whether interactions with social robots could fa...

Personalized Machine Learning for Robot Perception of Affect and Engagement in Autism Therapy

Robots have great potential to facilitate future therapies for children ...

DeepFuse: An IMU-Aware Network for Real-Time 3D Human Pose Estimation from Multi-View Image

In this paper, we propose a two-stage fully 3D network, namely DeepFuse,...

ACRNet: Attention Cube Regression Network for Multi-view Real-time 3D Human Pose Estimation in Telemedicine

Human pose estimation (HPE) for 3D skeleton reconstruction in telemedici...

Introducing an innovative robot-based mobile platform for programming learning

The present study introduces an Android based application that focuses o...

I Introduction

As robots will become more integrated in modern societies, the cases of interacting with humans during daily life activities and tasks are increasing [1]. Human-Robot Interaction (HRI) refers to the communication between robots and humans. This communication can be verbal or non-verbal, remote or proximal. A special case of HRI is CRI [2]. Robots enter children’s lives as companions, entertainers or even educators [3, 4, 5, 6]. Children are very adaptive, quick learners and familiarized with new technologies. They have unique communication skills, as they can easily convey or share complex information with little spoken language. In [7, 8, 9] developed and evaluated systems a proposed, which employ multiple sensors and robots, for childrens’ speech, gesture and action recognition during CRI scenarios. However, a major challenge is to acquire and maintain the child’s engagement and attention in a CRI task [10].

Robots assisting children is of particular importance in modern research, especially for ASD mediated therapy towards the development of their social skills, [11]. Children affected by autism spectrum disorder (ASD) can benefit from interacting with robots, since such a CRI may help them overcome the impediments posed by face-to-face interaction with humans. Moreover, it is important that the robot’s behaviour can adapt to the special needs of each specific child and maintain an identical behaviour for as long as needed in the intervention process [12].

(a) Class 1
(b) Class 2
(c) Class 3
Fig. 1: Examples of three levels of engagement: (a) Limited attention (class 1), (b) Attention but no cooperation (class 2), (c) Active cooperation (class 3).

One key issue for social robots is the development of their ability to evaluate several aspects of interaction, such as user experience, feelings perceptions and satisfactions [13]. Human engagement in Human-Robot Interaction (HRI) according to [14] “is a category of user experience characterized by attributes of challenge, positive affect, endurability, aesthetic and sensory appeal, attention, feedback, variety/novelty, interactivity, and perceived user control”. Poggi in [15] specified more by adding that engagement is the level at which a participant attributes to the goal of being together with other participants within a social interaction and how much they continue this interaction. Given this rich notion of engagement, many studies have explored human-robot engagement [13]. Lemaignan et. al explored the level of “with-me-ness”, by measuring to what extent the human is with the robot during an interactive task, for assessing the engagement level.

Research in HRI has shown a growing interest in modeling human engagement, evaluating speech and gaze [16], based solely on gaze in a human-robot cooperative task [17], and human pose with respect to robot from static positions [18, 13].

Engaging children in CRI tasks is of great importance. The social characteristics that a robot should have when performing as tutors were examined in [19, 20]. Specific focus is given in estimating the engagement of children with ASD interacting with adults [21] or robots [22]. A study analyzing the engagement of children participating in a robot-assisted therapy can be found in [23]

. A method for the automatic classification of engagement with a dynamic Bayesian network using visual and acoustic cues and support vector machine classifiers is described in

[24]. Another approach considers the facial expressions of children with ASD to evaluate their engagement [25]. A robot-mediated joint attention intervention system using as input the child’s gaze is presented in [26]. A deep learning framework for estimating the child’s affective states and engagement is presented in [27]

. These can then be used to optimize the CRI and monitor the therapy progress. In our previous work we have used reinforcement learning for adapting the robot’s behaviors and actions according to the child’s engagement, evaluated in real-time by an expert observing the child, for achieving joint attention on collaborative tasks


In this paper we are exploring a deep learning based approach for estimating the engagement of Children in a CRI collaborative task aiming to establish joint attention between the child and the robot. The robot tries to elicit behaviors on children while interacting with them. The robot aims to achieve joint attention with the child through an experiment that tests a primitive social skill that includes attention detection from both agents, attention manipulation from the robot-agent to the child and social coordination in terms of engaging the child-agent on a handover task, resulting in the intentional understanding of the robot-agent’s intentions and ultimately a successful collaboration.

Fig. 2: Setup for recording joint attention experiments.

Our method incorporates a multi-view deep-based estimation of the child’s pose, when the child is inside a specially arranged room (Fig. 1), interacts with the robot and has the ability to move freely, i.e. the child is not in a stationary position in front of the robot and the sensor. Thanks to the network of cameras, placed elegantly in the room, the child gets the feeling of being in its room and is left free to interact with the robot, without being restricted in front of a camera as in other works in literature. The multi-view fusion helps in confronting cases of body part occlusions and provides better pose estimations. An LSTM-based classifier which uses as input the child’s pose is trained using as control targets observations by experts and classifies the engagement of the child to the task. We experimentally validate our algorithms exploiting the RGB-D data of children who participated in the experimental scenario of the interaction task.

The ultimate goal is to use this framework for estimating the engagement of children in various CRI tasks. We aim to use the engagement information in a robot reinforcement learning framework which uses the engagement monitoring estimates as a reward signal during the non-verbal social interaction, for adapting the robot’s motion combinations and their level of expressivity towards maximizing the child-robot joint attention [29] in various collaborative tasks [30] both for TD and ASD children.

Fig. 3: Multi-view pose estimation overview.

Ii Method

The main problem addressed in this work is to estimate the engagement level of children from visual cues. The problem is cast as one of multi-class classification, where each class corresponds to a different level of engagement. Specifically, we designate three distinct levels of engagement: the first (class 1) signifies that the child is disengaged, meaning that they are paying limited or no attention to the robot; the second (class 2) refers to a partial degree of engagement, where the child is attentive but not cooperative; the final level (class 3) means that the child is actively cooperating with the robot to complete the handover task. The task details are described in section III-A. Of course, during the course of an interactive session, the engagement level varies. The goal is therefore to perform this classification across a number of fixed-length time segments during the session, rather than producing a single estimate for the entire interaction. In the remainder of this section we describe the proposed method to perform this classification.

Ii-a Child pose estimation

Perhaps the most informative data for recognizing the engagement level in joint attention tasks is that of the child’s pose. The problem of detecting human pose keypoints in images is a challenging one, due to occlusions and widely varying articulations and background conditions. Only recently has the problem been solved to a satisfactory degree, especially with the introduction of the Open Pose library [31, 32, 33] for 2D keypoint detection. Works on 3D pose estimation are fewer, with most focusing only on color images [34, 35]. One idea to incorporate depth information would be to estimate the 2D keypoints and take the depth values at the corresponding pixels, thus retrieving 3D coordinates. This method, however, ignores potential synergy between the two streams, and is susceptible to errors from noisy depth measurements.

In [36], an end-to-end 3D pose estimator from RGB-D data is proposed, which alleviates these problems and performs better than methods which operate solely on color images. The Open Pose library [31, 32, 33] is used to detect 2D keypoint score maps from the color image. These maps are then fed to a deep neural network, along with a voxel occupancy grid derived from the depth image. The network is trained to produce 18 keypoint estimates in the 3D space: two for each wrist, elbow, shoulder, hip, knee and ankle, one for the neck and five facial points, consisting of the ears, the eyes and the nose. We employ this system in our work, to estimate the child poses during their interaction with the robot.

Fig. 4: The detected pose is shown along with the detected bounding box surrounding the robot’s head.

Ii-B Multi-View Fusion

Fig. 5: Extracted features shown from an overhead view of the room. The detected keypoints are shown as black circles.

When using multiple cameras, the keypoints can be extracted for each view and fused to produce the final estimates (Fig. 3). The first step to achieve this is to register the points of each camera reference frame to a single common frame. The registration parameters were found using the ICP algorithm [37], which provides the transformation that best fits the point cloud of one camera to that of another, given an initial transformation that we set manually.

After transforming the keypoint coordinates of all cameras to a common reference frame, the next step is to determine which keypoints are valid from each view. The pose estimation algorithm occasionally fails, either when some of the child’s joints are hidden, or when the pose differs substantially from those used to train the algorithm. In such cases, the system produces noisy estimates or no detections at all for certain keypoints. Another problem is that the algorithm sometimes outputs multiple poses, when another person is in view or occasionally when the network is confused by some background artifact. To tackle such problems, we only average the points that are sufficiently close to those detected in the previous frame. If no such points exist for a certain joint, we mark the joint as missing in the current frame.

Having fused the pose detections of multiple views, we interpolate the missing values using the previous estimates and then smooth the output using a simple low-pass filter. The points then undergo a final rotation, so that the coordinate axes co-align with the edges of the room.

Ii-C Feature Extraction

Naturally, we aren’t interested in the child’s pose in itself, but rather in relation to the task at hand. Specifically, we want to estimate the 3D keypoints in relation to the robot’s position. Since the Nao robot isn’t equipped with any localization sensors, we must estimate its position with respect to the world coordinates through other means. Therefore, we detect the robot in the color stream of one of the cameras, and infer its 3D position via an inverse camera projection.

We fine-tuned the YOLOv3 detection network [38] to detect the robot’s head on a set of manually annotated images. Using this network, we then detected the robot position in all video frames. An example is shown in Fig. 4. Paired with the depth images, we converted the detections to 3D points. We then subtract the resulting values from the child pose estimates for each joint, thus making our features invariant to the position of the cameras within the experimental setup. The robot detections also contained noise and missing values, and were subjected to a similar procedure as the keypoints, ie. missing value interpolation and smoothing. We also rejected erroneous detections if they lay outside a certain expected range, based on the limitations of the robot’s movement.

(a) The neural network comprises the fully connected layers FC1, FC2, FC3, a single LSTM layer and a final fully connected layer FC4 coupled with a softmax function on the output.
Layer Output Includes
FC1 (N,L,2C)

Dropout + ReLU

FC2 (N,L,2C) Dropout + ReLU
FC3 (N,L,2C) Dropout + ReLU
LSTM (N,L,C) -
FC4 (N,L,K) -
(b) The layer outputs sizes are shown, where N is the batch size, L is the sequence length, C is the hidden states size and K is the number of classes.
Fig. 6: (a) Architecture of neural network used to classify child engagement. (b) Network layer details.

We produce a number of high-level features that are expected to assist the classification process (Fig. 5). These include the angle between the child’s gaze and the robot, the angle between the child’s body facing and the robot and the distance of the hands from the respective shoulders. The gaze direction is calculated from the detected facial keypoints, by taking the ear-to-ear vector in the 2D plane and rotating it . The body facing is calculated in a similar fashion from the shoulder keypoints. From the two resulting angles, we subtract the robot-to-child angle, which is calculated using the keypoint center of mass and the detected robot position. The high-level features are concatenated with the keypoint values relative to the robot, mentioned above, to form the input data to the classifier.

Ii-D Engagement Estimation

A key observation worth noting is that the degree of engagement heavily depends on temporal information. One reason for this is that the child tends to display the same level of interest over the course of a few-second interval. More importantly, however, the child’s movement and actions contain rich information which can be exploited. For example, if the child is constantly shifting their gaze, this is usually an indication of disengagement, whereas a steady focus signifies a higher level of interest. By choosing suitable machine learning algorithms capable of capitalizing on temporal data, we can expect a notable improvement over simply classifying each segment individually.

We use a deep neural network to classify the engagement level over time, the architecture of which can be seen in Fig. 5(a)

. The network consists of three fully connected (FC) layers, a Long Short-Term Memory (LSTM) network


and a final fully connected layer, with a softmax function applied to the output to produce a probability score for each class. LSTM networks are a certain type of recurrent neural network that are known to be well-suited to dealing with time-varying data.

The network is fed a sequence of inputs. We group the input features described earlier into segments of

frames, over which we compute the mean and standard deviation, thus further reducing noise and avoiding training on spurious data points, which can cause over-fitting. This gives us an input vector

for each segment , with a dimension of


The output is a function of the previous inputs in the sequence and the weights W of the network, with a dimension equal to the number of classes (

). The fully connected layers produce linear combinations of their inputs, operating on each sequence point individually. The softmax layer ensures that the elements of

are positive and sum to one. The data are fed in batches of size . The output dimensions of each layer are shown in the table in Fig. 6. The variable is the size of the LSTM’s hidden state. We set the output sizes of the first three fully connected layers to

after trial and error. The first three fully connected layers are each followed by a dropout layer, with a dropout probability of 0.5, and a Rectified Linear Unit (ReLU).

Iii Experimental Analysis & Results

Iii-a Experimental setup

1) Setup: The experimental setup can be seen in Fig. 2. As shown, Kinect devices are placed at various angles to capture multiple views of the room. Devices 1 and 2 capture the room from either side, device 3 from above and device 4 from the front. The Nao robot [40] was chosen for the task because it is capable of a wide range of motion, and its human-like features make it suitable for child-robot interaction. One or more bricks were placed either on a table in front of the robot, or on the floor close by. The child is free to move around the room as they please.

2) Session Description: The experiments evolved as follows. The robot approached one of the bricks and displayed the intention of picking up the brick, without being able to actually grasp it. With a series of motions it attempted to capture the child’s attention and prompt the child to hand over the brick. These motions included pointing at the brick, opening and closing its hand, alternating its gaze between the child and the brick and a combination of head turning and hand movement. If after a certain length of time the child failed to understand the robot’s intent, the robot proceeded to ask the child verbally. Once the brick was successfully handed over, the robot thanked the child and in some cases looked for another brick to grasp. Aside from the robot’s ability to communicate its intention, the success of the handover task depends on the child’s visual perceptiveness, willingness to cooperate and more generally their social skills. It is evident why understanding the level of engagement that the child displays is of crucial importance, both in evaluating the child’s abilities and in allowing the robot to choose more effective motions based on the child’s responses.

3) Data Collection: We recorded a total of 25 sessions. The children were aged 6-10 years old, 15 male and 10 female. The videos were then given to experts for annotation, according to the scheme described earlier. Dividing the recordings into 1 second segments, we derived 281 segments belonging to class 1, 2578 to class 2 and 745 to class 3. As we can see, the classes were significantly unbalanced: class contained around times more samples than class and around times more than class .

Iii-B Experimental Validation


Mean F-Score

Accuracy Balanced Accuracy
Majority class 27.90 71.97 33.33
3FC+LSTM 62.18 77.11 61.88
SVM 54.79 68.27 58.61
RF 56.41 68.60 61.78
TABLE I: Performance results of different algorithms on the data. Results are averaged across folds of leave-one-out cross-validation.

We evaluate the method described above trained on the recordings of the children. Since we only have videos, rather than splitting the set into training and testing subsets, we carry out the evaluation via leave-one-out cross-validation.

1) Implementation: We implemented the neural network described in section II

using the PyTorch library. The network was trained from scratch, with an initial learning rate of 0.1, momentum 0.5 and weight decay

. We used early stopping on a random subset of the training data, with a patience level of 10 epochs. When the training converged, we dropped the learning rate by a factor of 10. We chose a training batch size of

and set the hidden state size to . The sequence length was set to 30. These values were chosen after an extensive hyper-parameter search.

Since LSTM networks generally require a large amount of data to train successfully and avoid over-fitting, we employ a few methods of data augmentation. Namely, we add a small amount of Gaussian noise to the mean value of each segment and randomly choose the starting point of each sequence within a range of

seconds. We observed a further improvement when training the fully connected layers first, and then freezing their weights, adding the LSTM module and training the remaining network. This forces the initial layers to produce informative outputs with regard to each individual segment, which the LSTM can then utilize to extract meaningful temporal information. Additionally, since we observed occasional spikes in the network’s gradients, we performed gradient clipping on the LSTM layer by capping the gradient norms at the value of 0.1.

The final classification is performed on -second segments, by process of a majority vote within the segment. At a frame rate of fps, each second contains smaller segments. This seemed a logical compromise between over-sampling the data points and segmenting the temporal stream too crudely to be of use.

Net Architecture Mean F-Score Accuracy Balanced Accuracy
3 FC + LSTM 62.18 77.11 61.88
2 FC + LSTM 56.23 71.86 58.46
3 FC + 2 LSTM 54.78 70.60 56.30
2 FC + 2 LSTM 54.45 69.71 56.91
TABLE II: Cross-validation results for different network architectures.
Parameters Mean F-Score Accuracy Balanced Accuracy
N=8, L=30 58.75 74.41 58.40
N=16, L=30 62.18 77.11 61.88
N=32, L=30 55.36 69.34 58.68
N=16, L=10 47.21 61.68 51.58
N=16, L=60 48.25 57.32 56.86
TABLE III: Results for different hyper-parameter values.

As mentioned, the classes are highly imbalanced. Though we also experimented with under-sampling and over-sampling, the best results were achieved using a weighted cross-entropy loss during training:


where denotes the class of the -th sample in the minibatch and w is a vector containing the weights for each class. We set , based on the appearance frequencies of each class in the dataset.

2) Evaluation Metrics:

Due to the large class imbalance, the standard accuracy measure is not very informative. Therefore, we use two other measures of performance. The first is the average F-Score across all three classes, which is high only when both the precision and recall of each class is high. The second is the balanced accuracy of

[41], given by:


where and denote the true and false positives respectively of class , and is the number of classes.

3) Results: In Table I

we compare the LSTM-based network against other popular classifiers, in particular a Support Vector Machine (SVM) and a Random Forest (RF). The SVM uses an RBF kernel with a regularization weight of

and a kernel coefficient of . The RF consists of 10 trees with a maximum depth of 10. The hyper-parameters of both classifiers were tuned via a grid search. As a baseline we also include the results if the majority class (class ) is always predicted.

Notice that the proposed method outperforms all other classifiers, confirming our belief that exploiting temporal relations in the input data can lead to better results. The SVM and the RF both perform significantly better than simply predicting the majority class, however, meaning that even stationary pose information is partially descriptive of the engagement level.

In Table II we evaluate some other network architectures that we also tried. We experimented with the removal of the third fully connected layer (rows 2 and 4) and with the addition of a second LSTM layer (rows 3 and 4). As shown, the chosen architecture provides notably better results across all metrics. The additional LSTM layer causes over-fitting, rather than learning any deeper information in the data. The use of dropout allows a deeper network, with the addition of the third fully connected layer boosting the performance by a large margin.

Finally, we provide a comparison of different hyper-parameter values in Table III. The optimal sequence length is small enough to allow the training set to be divided into enough sequences to avoid over-fitting, but large enough to capture long term dependencies in the data. A batch size of 16 also provides a compromise between finely sampling the training data in each iteration and avoiding local minima while training. It’s worth noting that the algorithm is quite sensitive to these parameter values, possibly due to the relatively small size of the training dataset.

As we see from the ablation study and the comparison with the other baseline methods the proposed deep network architecture can learn and track accurately the child’s engagement based on its pose variation during the proposed freely interaction task. The developed system can be further improved with the presence of more annotated data and can become a useful tool for monitoring the childrens’ behavior while they actively interact with robots. Note that the proposed system is designed to allow children play and interact with no motion constrains in a whole room rather than sitting in front of a robotic agent. The whole engagement module can be integrated alongside with child’s speech, action or emotion recognition modules in order to create next generation social robots that can feel and understand the childrens’ behavior.

Iv Conclusions & Future Work

In this work we proposed, by taking advantage of recent progress in deep learning, an end-to-end method of child engagement estimation during child-robot collaboration without restricting their movement or requiring them to be tethered to the robot. The use of child pose data, in conjunction with an LSTM-based neural network, proved to be effective towards this goal. This is especially important considering the difficulty of the problem. Differences in child behavior and personality, a wide range of possible motions and actions and various technical challenges all contribute to this difficulty. The concept of engagement is not rigidly defined, with making the task hard even for humans. Dispite this, we achieve relatively high evaluation metrics across a dataset of 25 children.

An important direction for future work will be to test the system on children affected by autism spectrum disorder (ASD). Such children exhibit social and communicative difficulties, but research has shown that they can benefit from interacting with robots. An important part of interaction between ASD children and robots would be to understand their degree of engagement, so the robots could monitor the children and adapt it’s behavior to each child individually. Naturally, this imposes a further challenge, as ASD children act very differently to children in typical development.


The authors would like to thank the psychologists Asi- menia Papoulidi kai Christina Papailiou for annotating the engagement levels, and our colleagues Niki Efthymiou and Panagiotis Filntisis for helping carry out the experiments and for their technical assistance.


  • [1] M. A. Goodrich and A. C. Schultz, “Human-robot interaction: A survey,” Found. Trends Hum.-Comput. Interact., vol. 1, no. 3, pp. 203–275, Jan. 2007. [Online]. Available:
  • [2] G. Gordon, C. Breazeal, and S. Engel, “Can children catch curiosity from a social robot?” in Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, ser. HRI ’15.   New York, NY, USA: ACM, 2015, pp. 91–98. [Online]. Available:
  • [3] J. Kennedy, P. Baxter, E. Senft, and T. Belpaeme, “Higher nonverbal immediacy leads to greater learning gains in child-robot tutoring interactions,” in Social Robotics, A. Tapus, E. André, J.-C. Martin, F. Ferland, and M. Ammi, Eds.   Cham: Springer International Publishing, 2015, pp. 327–336.
  • [4] M. Saerbeck, T. Schut, C. Bartneck, and M. D. Janse, “Expressive robots in education: Varying the degree of social supportive behavior of a robotic tutor,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ser. CHI ’10.   New York, NY, USA: ACM, 2010, pp. 1613–1622. [Online]. Available:
  • [5] F. Kirstein and R. V. Risager, “Social robots in educational institutions they came to stay: Introducing, evaluating, and securing social robots in daily education,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), March 2016, pp. 453–454.
  • [6] D. Davison, “Child, robot and educational material: A triadic interaction,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), March 2016, pp. 607–608.
  • [7] A. Tsiami, P. Koutras, N. Efthymiou, P. P. Filntisis, G. Potamianos, and P. Maragos, “Multi3: Multi-sensory perception system for multi-modal child interaction with multiple robots,” in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2018.
  • [8] A. Tsiami, P. P. Filntisis, N. Efthymiou, P. Koutras, G. Potamianos, and P. Maragos., “Far-field audio-visual scene perception of multi-party human-robot interaction for children and adults,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 2018.
  • [9] N. Efthymiou, P. Koutras, P. P. Filntisis, G. Potamianos, and P. Maragos., “Multi-view fusion for action recognition in child-robot interaction,” in Proc. IEEE Int. Conf. on Image Processing (ICIP), 2018.
  • [10] J. M. K. Westlund and C. Breazeal, “Transparency, teleoperation, and children’s understanding of social robots,” in 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), March 2016, pp. 625–626.
  • [11] A. Othman and M. Mohsin, “How could robots improve social skills in children with autism?” in 2017 6th International Conference on Information and Communication Technology and Accessibility (ICTA), Dec 2017, pp. 1–5.
  • [12] C. A. G. J. Huijnen, M. A. S. Lexis, R. Jansens, and L. P. de Witte, “How to implement robots in interventions for children with autism? a co-creation study involving people with autism, parents and professionals,” Journal of Autism and Developmental Disorders, vol. 47, no. 10, pp. 3079–3096, Oct 2017. [Online]. Available:
  • [13] S. M. Anzalone, S. Boucenna, S. Ivaldi, and M. Chetouani, “Evaluating the engagement with social robots,” International Journal of Social Robotics, vol. 7, no. 4, pp. 465–478, Aug 2015.
  • [14] H. L. O’Brien and E. G. Toms, “What is user engagement? a conceptual framework for defining user engagement with technology,” Journal of the American Society for Information Science and Technology, vol. 59, no. 6, pp. 938–955. [Online]. Available:
  • [15] I. Poggi, Mind, Hands, Face and Body: A Goal and Belief View of Multimodal Communication, ser. Körper, Zeichen, Kultur.   Weidler, 2007. [Online]. Available:
  • [16] S. Ivaldi, S. Lefort, J. Peters, M. Chetouani, J. Provasi, and E. Zibetti, “Towards engagement models that consider individual factors in hri: On the relation of extroversion and negative attitude towards robots to gaze and speech during a human–robot assembly task,” International Journal of Social Robotics, vol. 9, no. 1, pp. 63–86, Jan 2017. [Online]. Available:
  • [17] J.-D. Boucher, U. Pattacini, A. Lelong, G. Bailly, F. Elisei, S. Fagel, P. Dominey, and J. Ventre-Dominey, “I reach faster when i see you look: Gaze effects in human–human and human–robot face-to-face cooperation,” Frontiers in Neurorobotics, vol. 6, p. 3, 2012. [Online]. Available:
  • [18] M. E. Foster, A. Gaschler, and M. Giuliani, “Automatically classifying user engagement for dynamic multi-party human–robot interaction,” International Journal of Social Robotics, vol. 9, no. 5, pp. 659–674, Nov 2017. [Online]. Available:
  • [19] C. Zaga, M. Lohse, K. P. Truong, and V. Evers, “The effect of a robot’s social character on children’s task engagement: Peer versus tutor,” in Social Robotics, A. Tapus, E. André, J.-C. Martin, F. Ferland, and M. Ammi, Eds.   Cham: Springer International Publishing, 2015, pp. 704–713.
  • [20] T. Schodde, L. Hoffmann, and S. Kopp, “How to manage affective state in child-robot tutoring interactions?” in 2017 International Conference on Companion Technology (ICCT), Sept 2017, pp. 1–6.
  • [21] A. Chorianopoulou, E. Tzinis, E. Iosif, A. Papoulidi, C. Papailiou, and A. Potamianos, “Engagement detection for children with autism spectrum disorder,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 5055–5059.
  • [22] A. Tapus, A. Peca, A. Aly, C. Pop, L. Jisa, S. Pintea, A. S. Rusu, and D. O. David, “Children with autism social engagement in interaction with nao, an imitative robot: A series of single case experiments,” Interaction Studies, vol. 13, no. 3, pp. 315–347, 2012. [Online]. Available:
  • [23] O. Rudovic, J. Lee, L. Mascarell-Maricic, B. W. Schuller, and R. W. Picard, “Measuring engagement in robot-assisted autism therapy: A cross-cultural study,” Frontiers in Robotics and AI, vol. 4, p. 36, 2017. [Online]. Available:
  • [24] Y. Feng, Q. Jia, M. Chu, and W. Wei, “Engagement evaluation for autism intervention by robots based on dynamic bayesian network and expert elicitation,” IEEE Access, vol. 5, pp. 19 494–19 504, 2017.
  • [25] H. Javed, M. Jeon, and C. H. Park, “Adaptive framework for emotional engagement in child-robot interactions for autism interventions,” in 2018 15th International Conference on Ubiquitous Robots (UR), June 2018, pp. 396–400.
  • [26] Z. Zheng, H. Zhao, A. R. Swanson, A. S. Weitlauf, Z. E. Warren, and N. Sarkar, “Design, development, and evaluation of a noninvasive autonomous robot-mediated joint attention intervention system for young children with asd,” IEEE Transactions on Human-Machine Systems, vol. 48, no. 2, pp. 125–135, April 2018.
  • [27] O. Rudovic, J. Lee, M. Dai, B. Schuller, and R. W. Picard, “Personalized machine learning for robot perception of affect and engagement in autism therapy,” Science Robotics, vol. 3, no. 19, 2018. [Online]. Available:
  • [28] M. Khamassi, G. Chalvatzaki, T. Tsitsimis, G. Velentzas, and C. S. Tzafestas, “An extended framework for robot learning during child-robot interaction with human engagement as reward signal,” in 3rd Workshop on Behavior Adaptation, Interaction and Learning for Assistive Robotics (BAILAR), in the 27th International Conference on Robot and Human Interactive Communication (ROMAN), Aug 2018.
  • [29] M. Khamassi, G. Velentzas, T. Tsitsimis, and C. Tzafestas, “Robot fast adaptation to changes in human engagement during simulated dynamic social interaction with active exploration in parameterized reinforcement learning,” IEEE Transactions on Cognitive and Developmental Systems, pp. 1–1, 2018.
  • [30] J. Hadfield, P. Koutras, N. Efthymiou, G. Potamianos, C. Tzafestas, and P. Maragos, “Object assembly guidance in child-robot interaction using rgb-d based 3d tracking,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2018.
  • [31] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.
  • [32] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint detection in single images using multiview bootstrapping,” in CVPR, 2017.
  • [33] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in CVPR, 2016.
  • [34] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt, “Vnect: Real-time 3d human pose estimation with a single rgb camera,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, p. 44, 2017.
  • [35] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis, “Learning to estimate 3d human pose and shape from a single color image,” in

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [36] C. Zimmermann, T. Welschehold, C. Dornhege, W. Burgard, and T. Brox, “3d human pose estimation in rgbd images for robotic task learning,” in IEEE International Conference on Robotics and Automation (ICRA), 2018. [Online]. Available:
  • [37] P. J. Besl and N. D. McKay, “A method for registration of 3-D shapes,” IEEE Transanctions on Pattern Analysis and Machine Intelligence, vol. 14, no. 2, pp. 239––256, 1992.
  • [38] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [39] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [40] D. Gouaillier, V. Hugel, P. Blazevic, C. Kilner, J. Monceaux, P. Lafourcade, B. Marnier, J. Serre, and B. Maisonnier, “Mechatronic design of NAO humanoid,” in Proc. International Conference on Robotics and Automation (ICRA), 2009, pp. 769–774.
  • [41] K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann, “The balanced accuracy and its posterior distribution,” in Pattern recognition (ICPR), 2010 20th international conference on.   IEEE, 2010, pp. 3121–3124.