Action similarity judgment based on kinematic primitives

08/30/2020 ∙ by Vipul Nair, et al. ∙ 0

Understanding which features humans rely on – in visually recognizing action similarity is a crucial step towards a clearer picture of human action perception from a learning and developmental perspective. In the present work, we investigate to which extent a computational model based on kinematics can determine action similarity and how its performance relates to human similarity judgments of the same actions. To this aim, twelve participants perform an action similarity task, and their performances are compared to that of a computational model solving the same task. The chosen model has its roots in developmental robotics and performs action classification based on learned kinematic primitives. The comparative experiment results show that both the model and human participants can reliably identify whether two actions are the same or not. However, the model produces more false hits and has a greater selection bias than human participants. A possible reason for this is the particular sensitivity of the model towards kinematic primitives of the presented actions. In a second experiment, human participants' performance on an action identification task indicated that they relied solely on kinematic information rather than on action semantics. The results show that both the model and human performance are highly accurate in an action similarity task based on kinematic-level features, which can provide an essential basis for classifying human actions.



There are no comments yet.


page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human vision is highly sensitive to the biological motion patterns created by the movement of other individuals (e.g., [25, 34]). From a developmental perspective, this sensitivity in terms of visual preference is present in newborns [23] and significantly increases longitudinally from 3 to 24 months [22]. Learning to distinguish between different action categories and exemplars reflects this sensitivity and visual preference [11].

Judging action similarity, i.e., judging whether two actions are the same or not, is an essential part of learning action categories and a step towards action understanding. Indeed, in most behavioral studies, it has been addressed as a form of measure toward understanding action semantics [32], action prototypes [7], and imitation [5].

From a computational viewpoint, judging action similarity is paramount in social robotics, industrial-robot collaboration, and video surveillance [15]. Action similarity can be complicated in a realistic setting, such as action class ambiguity in multi-class action recognition [19]. To address this ambiguity problem, action similarity labeling (same or different) was first introduced by Kliper-Gross et al. [14] as a critical task in action recognition. According to Kliper-Gross et al. [14], action similarity labeling aims to determine if the actors in two video sequences are performing the same or different actions. The labeling algorithms rely primarily on creating a suitable metric for the differences between the actions from the extracted kinematic features (see [19] for a detailed review of the approaches). Kliper-Gross et al. showed a considerable gap (around 65%) between the state-of-the-art methods and the success rate of humans on action similarity labeling and argued towards a principled understanding of what makes actions similar or different [14].

The work presented in this paper attempts to reduce this gap by using a computational model that derives action primitives based on kinematic features (from the biological motion regularities) [29]. The model is used to perform an action similarity task (AST), i.e., to judge whether actions are the same or different. The model performs AST by learning to classify actions (using dictionary learning) based on a linear combination of kinematic primitives (sparse coding technique). In particular, we assess how (the extent) this representation of actions can produce successful action classification – by comparing the model’s performance with human visual performance based on the same AST.

As a further comparison between the model and human biological motion perception, we conducted a second experiment with an action identification task (AIT) to validate the use of kinematic features in action similarity judgments made by humans. In other words, are humans relying on high-level semantic features for their similarity judgments rather than on low-level kinematics?

Previous studies have shown that humans identify action primitives based on kinematic features in an action segmentation task [10]. Consistent with these results, the model derives and uses combinations of different visual body motion patterns (action primitives) to distinguish between different human actions. Besides, action primitives is an area of focus in modeling the recognition and categorization of human actions by artificial systems [16].

In summary, this paper addresses three questions. What is the extent to which the computational model based on kinematic primitives can determine action similarity among a group of actions? To what extent does the model’s performance relate to human similarity judgments of the same actions? Do the human action similarity judgments rely mainly on the kinematic features of the actions rather than higher-level action semantics?

Ii Hand action stimuli

The stimuli used in this study are taken from Multimodal Cooking Actions (MoCA) 111The dataset is available for download at dataset. The full dataset includes motion capture data, and videos (from multiple viewpoints) of upper body actions executed by one actor in a cooking scenario. (For more details about the dataset, see [17].) The actions are hand based and manipulative, i.e., actions intended to modify or displace an object. This dataset was chosen for testing action similarity as hand-based actions cover a wide range of complexity with various movements, and most day-to-day activities involve hand actions.

For this study we chose 19 actions from the set; namely Carrot: Grating a carrot, Cut: Cutting a loaf of bread, Dish: Cleaning a dish, Eat: Eating a slice of bread, Eggs: Beating eggs, Lemon: Squeezing a lemon, Mezzaluna: Using a mezzaluna knife, Mixing: Stirring a mixture, OpenBottle: Opening a bottle, Pan: Pan flip, Pestare: Crushing leaves, Pouring: Pouring water, Reaching: Reaching an object, Rolling: Rolling dough, Salad: Rotating salad chopper, Salt: Using a salt shaker, Spread: Spreading cheese on bread, Table: Cleaning table, and Transport: Transporting an object (all the actions will be referred to by their capitalized term). Most of the actions are carried out by the right hand, whereas some involve both hands (e.g., Mezzaluna or Rolling). To investigate action similarity, selecting a single viewpoint was necessary to avoid the excessive duration of the experiment with human participants. Therefore we opted for the frontal viewpoint, which is familiar and natural for interaction, especially during the early stages of child development. However, it has been shown that the model can perform action recognition with multiple viewpoints [28], paving the way for future investigation of human perception. See Fig. 1 to see an example frame for Eggs, and its point-light display (PLD). Since this study focuses just on the low-level kinematic features of actions, the human participants were shown PLDs limiting their leverage from contextual information. Alternatively, the model was designed to extract only kinematic information directly from the videos (see Section III for details).

Fig. 1: The left image shows a frame of action Eggs from a frontal point of view, and the right image shows its PLD. The PLDs correspond to the positions of the markers.

Iii Computational model

The computational model chosen for this study builds upon the model for detecting biological motion described in [29]. The model takes inspiration from the human ability to distinguish between biological and non-biological motion, an ability exhibited by newborns where they orient their attention towards biologically moving stimuli [23]. The model exploits human motor movement regularities resulting from the Two-thirds power law, a well known invariant of human movement [31, 21] and has also been implemented on an iCub humanoid robot as a proof of applicability [29, 30, 27].

The model for recognizing action similarities utilizes visual motion primitives to understand actions [28]. The approach is to identify necessary and sufficient action sub-components and use them as visual primitives to form simple motion representations that can reconstruct a wide range of complex actions. A broad break down of the model’s build is the following:

Firstly, the optical flow from the videos (of hand action stimuli) is extracted for each time instant, and the tangential velocity is computed (see [29]). The averaged velocities over time give a compact representation of each video. The velocity sequences over time are segmented into sub-movements (portions). The sub-movements are derived automatically with set points that correspond to a Start, Stop, Change in the action dynamics, that are the local minima of the velocity profile [20].

Secondly, the obtained sub-movements of all the actions (19 hand actions) are treated together and given as input to a K-means clustering, thereby building a unique dictionary of

K atoms. With the dictionary, each sub-movement of the training set is then reconstructed as an approximation of a linear combination of some of the atoms in the dictionary, using the sparse coding technique and represented as the sequence of weights used for each atom in the reconstruction. At the end of this procedure, given a video representing a given action, the model can describe each sub-movement ui

as the feature vector [

, , …], where is the coefficient/weight assigned to each atom (j-th atom, where j = 1…K). Since the representation is sparse, some of the coefficients are equal to 0, and K=15 is the number of atoms of the dictionary.

Thirdly, a classification of the actions (19 hand actions) is performed following a supervised approach. A multi-class classifier is built with a one-vs-all approach, where a binary classifier per class (i.e., per action) is built. So for each action, a binary classifier is trained to discriminate between the representation of that action versus all the rest. See Fig. 2 to see an example of how Eat contributed to the sub-movement dictionary and how Transport is represented via the dictionary primitives.

Fig. 2: (a) Eat action video from which optical flow is extracted, (b) Identified dynamic instants of Eat action based on set rules and extracted sub-movements, (c) Dictionary of primitives composed of 15 sub-movements (atoms) extracted from all the 19 actions, (d) Sub-movements extracted from the Eat action and (e) Transport action represented via the dictionary primitives- the sub-movement 1 has a large contribution from the atom 6, sub-movement 2 has a large contribution from atom 10 and so on. Images modified from [20] and [28].

Iv Experiment 1

This experiment addresses the extent to which the computational model based on kinematic primitives can determine action similarity among a group of hand-arm actions. It also addresses the extent to which the performance of the model relates to human similarity judgments of the same actions. A two-alternative forced-choice AST is designed for both the model and the human participants. The model and human participants perform the same task; hence the treatment procedures are designed to compare results. The comparison is made based on the accuracy, false-hit (incorrectly picked as target), and selection-bias (bias of an action classifier/choice actions to get picked).

Iv-a Model- Action similarity task

Iv-A1 Stimuli

For a given trial, video of the target action was fed to the model, and the model extracted optical flow from the video and computed the motion descriptor for each frame as described in III. The frontal viewpoint was used for both training and testing the model.

Iv-A2 Procedure

There are 19 action classifiers trained on each of the chosen 19 actions. For the classification, we used a Regularized Least Squares (RLS) classifier, adopting the library GURLS [24] for an efficient implementation of RLS. We employ Radial Basis Function (RBF) as a kernel. The model performed an AST where it was presented with the target (T) action video and two action classifiers (A and B). These two classifiers competed to see which one of them (A or B) was the same as T. So for a given trial, where Eggs is the T, then two action classifiers trained on say Eggs (H) and Rolling (M) shall compete, and the classifier with the higher score wins the trial. To simulate the constraints of a viewing period that a human participant would have, random instances of the stimuli were considered, where an instance is one sub-movement of the action (e.g., in the case of Mixing, one half-circular rotation of the palm would be considered one sub-movement). The similarities were computed by averaging the similarities between 10 random instances of the actions.

See Fig. 3 for a schematic description of the experiment design. Each trial consisted of the triad A, B, and T, with the condition (T=A OR T=B) AND AB, i.e., one of the classifiers (A or B) always belongs (i.e., trained) to the same action as T. Therefore unique permutations = 684 (3(r) actions at a time taken from a set of 19(n) actions, with the order and repetition factor). The total trials conducted were 684x24 = 16416 in randomized order.

Fig. 3: Schematic diagram of the experiment design for both the model and human AST.

Iv-B Human- Action similarity task

The human participants performed an equivalent version of the AST. PLDs of the actions were used, with no contextual information of the action (the tool used or the setting) provided – limiting perceptual conditions to kinematic features. Additionally, to assess the participants’ implicit semantic access, we tested their performance as a function of orientation: upright (UP) and inverted (INV) PLDs. If participants perform significantly poorly for the INV PLDs in contrast to the UP PLDs (inversion effect), that would indicate implicit semantic access for the UP PLDs.

Iv-B1 Participants

Twelve subjects (5 males, mean age of 31.4 years, age range 24 to 46 years) with normal (or corrected) vision participated. They were provided information about the task, and gave written informed consent for participation. They were given a movie ticket for their participation time. The experiment was carried out in accordance with the National Ethics Law and the World Medical Association Declaration of Helsinki.

Iv-B2 Stimuli

PLDs of the right arm for each of the actions (motion capture data) were generated using Biomotion toolbox-2 [26] in MATLAB. The PLDs consist of six dots positioned at the shoulder, elbow, wrist, and three at the palm region (Fig. 1). Two orientations of PLDs were used: UP and INV (by horizontal flipping of UP PLD). See Fig. 3 with a trial display of 3 PLDs (namely A, B, and T). The stimuli were presented in a frontal point of view (facing the participants) and played at their veridical speed. The experiment was conducted using MATLAB R2014a with Biomotion toolbox-2 [26] and Psychtoolbox-3 [13]. The stimuli were displayed on a 22 inch HP L2245wg LCD monitor, with a native resolution of 1680 x 1050 at 60 Hz, viewable dimension 29.5cm x 47.5cm(W x H), and a viewing distance of 100cm.

Iv-B3 Procedure

Participants performed the same AST as the model in which they viewed three actions (A, B, and T) in one frame, and they had to indicate (via keypress) which of the two stimuli A or B was the same as the T stimulus. Each trial lasted for 4 seconds only, and the participants had to respond within the same period. Upon failure to respond in 4s, the next trial started. Participants were informed about the PLD’s corresponding physical features, the viewpoint, and the orientations, but no information about the actions themselves was provided – just that they were performing day-to-day actions. The PLDs had random starting frames that played in a continuous loop at 30 FPS. Each response was followed by a fixation cross (0.23) at the center (500-700ms). After providing instructions, the participants performed practice trials (30 trials), followed by the experiment.

The experiment consisted of 3 independent variables in a mixed design; Orientation (UP/INV, within-subjects), Block-order (UP-INV/INV-UP, between-subjects), and Actions (19 actions, random variable). See Fig.

3 for a schematic description. The block-orders (UP-INV and INV-UP) were balanced between the subjects, with 6 participants viewing UP-INV. Individual trial orders within blocks were randomized. The overall trials performed were the same as the model (16416 trials).

Iv-C Results

The model’s and the human participants’ performance are presented in the form of confusion matrices for humans (H) and the model (M) in Fig. 4. Each matrix shows the similarity measure (accuracy of matching rate%) along the matrix’s diagonal. False-hit (frequency) for the target actions is shown at the end of their respective rows. Selection-bias (frequency) for the action/classifiers is shown at the end of their respective columns. These measures in the matrices highlight the strengths and weaknesses of the performer (model or human) for each of the actions (or action classifiers).

Fig. 4: Matrices with mean similarity measures, with target actions (y-axis) and matched actions or classifiers (x-axis). Measures for (H) Human (matches in %) and (M) Model (matches in %).

The confusion matrices (H) and (M) show the matching rate%, with target actions on the y-axis and matched actions (classifiers for (M)) on the x-axis. A cell has the match% of the i target action matched with the j action. The diagonal cells (referred to as the accuracy cells) indicate the percentage of times in which the target action was correctly identified. For example, matrix H, cell = 94.1%, shows the times the target action Dish was identified correctly.

The non-diagonal cells report the % of times the target action was incorrectly identified, which is split into false-hits and selection-bias. The similarity measures (%) in the are averaged together minus the respective accuracy cell, indicating the false-hit% of the i target action, e.g., for in matrix M, cell = 62.5% of the times Lemon was identified as Carrot, whenever Carrot was the other classifier/action and the averages 54.28% (excluding cell), indicating Lemon’s mean false-hits%. The similarity measures (%) in the are averaged together minus the accuracy cell, indicating the selection-bias% of the j action/classifier. E.g., for in matrix M, cell = 12.5% of the times Cut action/classifier selected as target action Carrot and the averages 49.88% (excluding cell), indicating Cut’s mean selection-bias%.

Independent sample t-tests were conducted on accuracy, false-hit and selection-bias measures for the two matrices. Both the model (

M = 85.72%, SD = 16.80) and the results for the human participants (M = 92.67%, SD = 4.82) showed a high level of accuracy with no significant difference between them; t(36) = -1.73, p = 0.092. Although the false-hit results were relatively low for both matrices, the model had significantly higher false-hit rates (M = 14.28%, SD = 16.80) than the human participants (M = 5.81%, SD = 3.16); t(36) = -2.16, p = 0.037. Similarly the selection-bias results were also relatively low for both matrices, but the model had a significantly higher selection-bias (M = 14.28%, SD = 15.78) than the human participants (M = 5.65%, SD = 2.38); t(36) = -2.16, p = 0.023.

TABLE I: RT means(in seconds),and (SD) across conditions

Concerning the second question on how the model’s performance relates to human action similarity judgments, a crucial part of which is to show the extent to which humans relied on the kinematic features to make their similarity judgments. If human action similarity judgments are not affected by the actions’ orientation, this would indicate that kinematic features were used as the basis for the judgments – lack of semantic level access for INV displays that would normally lead to a difference in performance between UP and INV displays [10].

A 2 orientation (within-subject) x 2 block-order (between-subject) mixed ANOVA was performed on the accuracy and reaction time (RT, correct responses only) measures for the human-AST data (all subjects were included). The actions were treated as random variables. The RT results are presented in Table I. The main effect of orientation was not significant (F(1,11) = 1.197, = 0.107, p = 0.299). There is no performance difference between UP and INV action stimuli. The main effect of block-order was also not significant (F(1,11) = 2.175, = 0.179, p =0.171). There was however a significant interaction effect (F(1,11) = 11.585, = 0.537, p = 0.007). The significant difference leading to the interaction effect consists of faster responses for UP displays (M = 1.675s, SD = 0.246) when presented after INV displays(M = 1.982s, SD = 0.217); t(10) = 2.31, p = 0.043. Further analyses between UP and INV did not show any significant differences, no simple main effects for orientation (p 0.05). Regarding accuracy, human participants performed equally well for both UP (M = 92.9%, SD = 3.29) and INV (M = 92.4%, SD = 3.65) conditions with no significant main effects or interaction effect.

Iv-D Discussion

From the accuracy measure, both the model and human participants performed reliably well with no significant difference. However, the model has certain drawbacks compared to participants in terms of overall performance, i.e., the model has significantly more false-hits and a significantly greater selection bias. That said, the observed differences come from a small set of action classifiers and target actions. Here we examine those cases to see for a possible cause.

For the selection-biases, action classifiers (matrix M) for Openbottle (57.41%), Cut (49.88%), and Spread (26.74%) (in order of decreasing measure), in cases when they are pitted against another target action – get selected instead of the correct one. These actions have the most number of kinematic primitives (atoms) that make up the dictionary primitives, so in a way, these actions contain most of the primitives that make up the sub-movements of all the 19 actions. Thereby these actions correspond also to other actions with populated sub-movements forming different atoms. Hence they have more chances to get confused with other actions, thereby leading to a high selection-bias. This also explains why the action classifiers with a high selection-bias have higher accuracy also, as they have sufficient primitives to create a strong representation of their own action.

Concerning the false-hits, target actions (matrix M) such as Lemon (54.28%), Pouring (49.54%), and Pestare (43.75%) (in order of decreasing measure), got the most number of false-hits along with low accuracy. These actions show a lack of descriptive capability, i.e., poor representation of the action by the dictionary primitives. This is in addition to their respective classifiers getting a low selection-bias, also pointing towards a lack of sufficient kinematic primitives. False-hits for these actions may result from the classifiers’ training process – that necessary and sufficient primitives were not extracted properly for the dictionary. Nevertheless, further studies will be needed to confirm these considerations, specifically whether a) the model performance can be improved by increasing the number of dictionary atoms (K), b) the training can be improved with better action videos, better as in longer temporal sequence, or different viewpoints.

Regarding the low selection-bias and low false-hits for humans, a possibility is that they were relying on action semantics to aid their judgment. In AST, we probed for implicit access to action semantics through orientation manipulation, with no difference. These results rule out semantic level access for UP displays. To further affirm that the participants had no idea what the actions were (at least to the point to aid them in AST), we conducted Experiment 2 to test for explicit access to action semantics.

V Experiment 2

This experiment addresses the third question on whether the human judgments in AST were based solely on the kinematic features of the actions. A five-alternative forced-choice AIT is presented to human participants (no participants from Experiment 1), where they had to identify the displayed action from a list of five action labels.

V-a Human- Action identification task

V-A1 Participants

Fifty-four Mechanical Turk workers (33 males, mean age of 37.33 years, age range 26 to 73 years) with normal (or corrected) vision and fluent in English participated. They were informed about the task and provided informed consent for their participation. Participants received monetary compensation of $2.50 for their participation time. The experiment was carried out in accordance with the National Ethics Law and the World Medical Association Declaration of Helsinki.

V-A2 Stimuli

The trial display consisted of one action PLD at a time followed by five action labels. The PLDs (19 actions) were the same as in Experiment 1: frontal viewpoint played at veridical speed with UP and INV orientation. The stimuli were displayed using Amazon Mechanical Turk with extensions from psiTurk [8] and jsPsych [6].

V-A3 Procedure

Participants performed an AIT where they were shown an action (target) for 4 seconds, after which they had to identify (mouse click) the target action label from 5 action labels (alternatives) within 10 seconds. The alternatives consisted of the correct label, and four randomly chosen (from the same pool of 19 action labels) labels with no repetition. Clicking or failing to respond within 10 seconds led to the next trial (preceded by a fixation cross for 700ms). The display orientation (UP or INV) was informed prior to the start. Participants were informed of the PLDs (identical to Experiment 1-human AST). The instructions were on-screen with example displays. After the instructions, a video of a trial was shown (no practice session). There were questionnaires about the difficulty of the task at the end of the experiment.

The experimental design is identical to Experiment 1-human AST. The block-order (UP-INV and INV-UP) were balanced between the subjects, with 29 participants viewing INV-UP. Individual trial orders were randomized for each participant. The blocks had 19 trails where each trial presented one of the 19 actions; the total number of trials per participant was 38.

V-B Results

We had a selection criterion where the participant’s mean RT should exceed 2 seconds; this was to ensure that the participants diligently performed the task. Therefore 14 participants were excluded, and data from 40 were taken for the analysis. Fig. 5 shows the accuracy% (for correct identification) and the selection bias%. To confirm humans’ reliance on kinematic features for their similarity judgments – we had to rule out explicit semantic level access for the PLDs. If participants perform poorly in identifying the PLDs, irrespective of the display orientation, this would strongly suggest limited semantic level access.

Fig. 5: Accuracy(%) and selection-bias% for Experiment 2

The overall accuracy (M = 37.85%, SD = 14.17) indicates poor performance with a mean selection-bias of 15.35% (SD = 3.12). Participants performed poorly for both UP displays (M = 38.68%, SD = 15.61) and INV displays (M = 35.92%, SD = 16.25). A 2 orientation (within-subject) x 2 block-order (between-subject) mixed ANOVA was performed on the accuracy to check for an inversion effect. The actions were treated as random variables. There was no significant main effect of orientation ( F(1,39) = 0.966, = 0.025, p = 0.332), indicating no performance difference between UP and INV action stimuli. The main effect of block-order was also not significant (F(1,39) = 1.807, = 0.045, p = 0.187). There was however a significant interaction effect (F(1,39) = 6.152, = 0.139, p = 0.018). The significant difference leading to the interaction effect consists of higher accuracy for responses for INV displays (M = 29.74%, SD = 14.01) when presented after UP displays (M = 42.11, SD = 16.29); t(38) = 2.57, p = 0.014.

V-C Discussion

Experiment 2 shows a poor overall accuracy(%), indicating that the participants were having difficulty identifying the actions from the displayed PLDs. Although most of the actions were identified above chance level (i.e., 20%, out of 5 options), very few actions had a relatively high accuracy such as Transport = 69%, Reaching = 50% and Table= 50%. Despite the poor accuracy, there was no particular selection-bias pattern. The kinematic information within the PLDs may not be enough for the participants to recognize the action and choose the correct action labels, which also points to why they did not show any particular selection preference.

Observing the results of AIT in light of AST, no inversion effect was observed for both the tasks and the poor accuracy in AIT indicate that the participants were not relying on semantics in the AST. Hence we show that humans did rely mainly on the kinematic features of the actions to perform AST – similar to the model.

Vi General Discussion

In this work, a comparison between a computational model’s performance and human judgments was carried out by using a common task – to understand the visual processing of action similarities better. To this purpose, we designed a similarity judgment task using the Multiview Cooking Actions dataset and considered different research questions on the reliability of the computational model and its similarity with human observers’ choices.

Overall, both the model and human participants could reliably identify whether given actions were the same or not, which indicates that the model and humans might be using similar information – an aspect which is the objective of our current, deeper investigation.

In our first experiment, human performance was better than the model in terms of low selection-bias and low false-hits. Given the very simple description adopted in the model, which is entirely based on low-level kinematic features, with no integration of information over time, this dissimilarity in performance is relatively limited. To ensure that this difference was not mainly due to the fact that humans exploit action semantics to aid their judgment, we performed the same experiment with INV stimuli, and we conducted an action identification study. The results of both analyses indicated that it is unlikely that semantics had been used.

An aspect deserving further attention refers to the differences in the type of information provided to the computational model and humans, which may be partially the cause of the differences in the performance. To address this issue, we are currently performing an investigation in which the same computational method is applied to motion capture data. Additionally, further investigation is needed to understand whether humans utilize kinematic primitives to judge action similarities – if so, how and is it in the same manner as the model.

The current work provides an insight into the potential mechanisms supporting action similarity detection in humans, providing a pathway towards implementing similar models in machines. The approach has a developmental inspiration, in that it builds upon an existing model of newborns’ ability (biological motion detection [27]) to assess how far such a simple representation allows to go in terms of a novel, more complex skill as the detection of action similarity. It is important to note that progressive development could continue from there toward more complex social competences. In fact, for human beings detecting action similarity plays a fundamental role in imitation. In particular, according to the similarity model [9]

kinematic similarity increases the predictability of the action. Imitation, in turn, supports the development of action understanding. For instance, several researchers have suggested that the experience of being imitated is crucial in the development of the Mirror Neurons System (e.g.,

[4, 12]). In this context, the child’s ability to judge the kinematic similarity between her and her caregiver’s actions would support the child’s ability to mimic, a further step towards action understanding.

In a similar vein, the topic of imitation has been widely investigated also in robotics (e.g., [33, 18, 2, 3]) and bears important implications for the domain of learning from demonstration [1]. Additionally, for this application, the possibility of detecting action similarity and performing actions that closely resembles that of the human partner could increase the intuitiveness and efficacy of the interaction.


This work has been partially carried out at the Machine Learning Genoa (MaLGa) center, Università di Genova (IT).

Publication Note

This paper is a pre-publication draft of the contribution to appear as part of Proceedings of the 10th Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob 2020), 26-30th October, 2020, Valparaíso, Chile.


  • [1] B. Argall, S. Chernova, M. Veloso, and B. Browning (2009) A survey of robot learning from demonstration. Robotics and Autonomous Systems 57, pp. 469–483. Cited by: §VI.
  • [2] E. A. Billing, T. Hellström, and L. E. Janlert (2011) Predictive learning from demonstration. In

    Agents and artificial Intelligence: Second International Conference, ICAART 2010, Valencia, Spain, January 22-24, 2010. Revised Selected Papers

    pp. 186–200. Cited by: §VI.
  • [3] E. A. Billing, T. Hellström, and L. Janlert (2010) Behavior recognition for learning from demonstration. In Proceedings of IEEE International Conference on Robotics and Automation, pp. 866–872. Cited by: §VI.
  • [4] C. Catmur, V. Walsh, and C. Heyes (2009) Associative sequence learning: the role of experience in the development of imitation and the mirror system. PhilosophicalTransactions of the Royal Society B 364 (1528), pp. 2369–2380. Cited by: §VI.
  • [5] C. Catmur and C. Heyes (2013) Is it what you do, or when you do it? the roles of contingency and similarity in pro-social effects of imitation. Cognitive Science 37 (8), pp. 1541–1552. Cited by: §I.
  • [6] J. R. De Leeuw (2015) JsPsych: a javascript library for creating behavioral experiments in a web browser. Behavior research methods 47 (1), pp. 1–12. Cited by: §V-A2.
  • [7] M. Giese and M. Lappe (2002) Measurement of generalization fields for the recognition of biological motion. Vision research 42 (15), pp. 1847–1858. Cited by: §I.
  • [8] T. M. Gureckis, J. Martin, J. McDonnell, A. S. Rich, D. Markant, A. Coenen, D. Halpern, J. B. Hamrick, and P. Chan (2016)

    PsiTurk: an open-source framework for conducting replicable behavioral experiments online

    Behavior research methods 48 (3), pp. 829–842. Cited by: §V-A2.
  • [9] J. Hale and A. F. d. C. Hamilton (2016) Cognitive mechanisms for responding to mimicry from others. Neuroscience & Biobehavioral Reviews 63, pp. 106–123. Cited by: §VI.
  • [10] P. E. Hemeren and S. Thill (2011) Deriving motor primitives through action segmentation. Frontiers in psychology 1, pp. 243. Cited by: §I, §IV-C.
  • [11] P. E. Hemeren (2008) Mind in action. Lund University Cognitive Studies 140. Cited by: §I.
  • [12] S. S. Jones (2006) Infants learn to imitate by being imitated. In Proceedings of the International Conference on Development and Learning: The Tenth International Conference on Development and Learning, Cited by: §VI.
  • [13] M. Kleiner, D. Brainard, and D. Pelli (2007) What’s new in psychtoolbox-3?. Cited by: §IV-B2.
  • [14] O. Kliper-Gross, T. Hassner, and L. Wolf (2011) The action similarity labeling challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (3), pp. 615–621. Cited by: §I.
  • [15] Y. Kong and Y. Fu (2018) Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230. Cited by: §I.
  • [16] D. Kulić, D. Kragic, and V. Krüger (2011) Learning action primitives. In Visual analysis of humans, pp. 333–353. Cited by: §I.
  • [17] D. Malafronte, G. Goyal, A. Vignolo, F. Odone, and N. Noceti (2017) Investigating the use of space-time primitives to understand human movements. In International Conference on Image Analysis and Processing, pp. 40–50. Cited by: §II.
  • [18] Y. Nagai, Y. Kawai, and M. Asada Emergence of mirror neuron system: immature vision leads to self-other correspondence. In 2011 IEEE International Conference on Development and Learning (ICDL), Vol. 2, pp. 1–6. Cited by: §VI.
  • [19] J. Qin, L. Liu, Z. Zhang, Y. Wang, and L. Shao (2015) Compressive sequential learning for action similarity labeling. IEEE Transactions on Image Processing 25 (2), pp. 756–769. Cited by: §I.
  • [20] F. Rea, A. Vignolo, A. Sciutti, and N. Noceti (2019) Human motion understanding for selecting action timing in collaborative human-robot interaction. Frontiers in Robotics and AI. Cited by: Fig. 2, §III.
  • [21] M. J. Richardson and T. Flash (2002) Comparing smooth arm movements with the two-thirds power law and the related segmented-control hypothesis. Journal of neuroscience 22 (18), pp. 8201–8211. Cited by: §III.
  • [22] R. Sifre, L. Olson, S. Gillespie, A. Klin, W. Jones, and S. Shultz (2018) A longitudinal investigation of preferential attention to biological motion in 2-to 24-month-old infants. Scientific reports 8 (1), pp. 1–10. Cited by: §I.
  • [23] F. Simion, L. Regolin, and H. Bulf (2008) A predisposition for biological motion in the newborn baby. Proceedings of the National Academy of Sciences 105 (2), pp. 809–813. Cited by: §I, §III.
  • [24] A. Tacchetti, P. S. Mallapragada, M. Santoro, and L. Rosasco (2012) GURLS: a toolbox for regularized least squares learning. Cited by: §IV-A2.
  • [25] B. Tversky (2019) Mind in motion: how action shapes thought. Hachette UK. Cited by: §I.
  • [26] J. J. van Boxtel and H. Lu (2013) A biological motion toolbox for reading, displaying, and manipulating motion capture data in research settings. Journal of vision 13 (12), pp. 7–7. Cited by: §IV-B2.
  • [27] A. Vignolo, N. Noceti, F. Rea, A. Sciutti, F. Odone, and G. Sandini (2017) Detecting biological motion for human-robot interaction: a link between perception and action. Frontiers in Robotics and AI. Cited by: §III, §VI.
  • [28] A. Vignolo, N. Noceti, A. Sciutti, F. Odone, and G. Sandini (2020) Learning dictionaries of kinematic primitives for action classification.

    International Conference on Pattern Recognition (ICPR)

    Cited by: §II, Fig. 2, §III.
  • [29] A. Vignolo, N. Noceti, A. Sciutti, F. Rea, F. Odone, and G. Sandini (2016) The complexity of biological motion. In 2016 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 66–71. Cited by: §I, §III, §III.
  • [30] A. Vignolo, F. Rea, N. Noceti, A. Sciutti, F. Odone, and G. Sandini (2016) Biological movement detector enhances the attentive skills of humanoid robot icub. In IEEE-RAS International Conference on Humanoid Robots, Cited by: §III.
  • [31] P. Viviani and N. Stucchi (1992) Biological movements look uniform: evidence of motor-perceptual interactions.. Journal of experimental psychology: Human perception and performance 18 (3), pp. 603. Cited by: §III.
  • [32] C. E. Watson and L. J. Buxbaum (2014) Uncovering the architecture of action semantics.. Journal of Experimental Psychology: Human Perception and Performance 40 (5), pp. 1832. Cited by: §I.
  • [33] K. Yasuo, Y. Yasuaki, I. Masayuki, and I. Hirochika (2003) From visuo-motor self learning to early imitation-a neural architecture for humanoid learning. 2003 IEEE International Conference on Robotics and Automation 3, pp. 3132–3139 vol.3. Cited by: §VI.
  • [34] G. Yovel and A. J. O’Toole (2016) Recognizing people in motion. Trends in cognitive sciences 20 (5), pp. 383–395. Cited by: §I.