Human and Machine Action Prediction Independent of Object Information

Predicting other people's action is key to successful social interactions, enabling us to adjust our own behavior to the consequence of the others' future actions. Studies on action recognition have focused on the importance of individual visual features of objects involved in an action and its context. Humans, however, recognize actions on unknown objects or even when objects are imagined (pantomime). Other cues must thus compensate the lack of recognizable visual object features. Here, we focus on the role of inter-object relations that change during an action. We designed a virtual reality setup and tested recognition speed for 10 different manipulation actions on 50 subjects. All objects were abstracted by emulated cubes so the actions could not be inferred using object information. Instead, subjects had to rely only on the information that comes from the changes in the spatial relations that occur between those cubes. In spite of these constraints, our results show the subjects were able to predict actions in, on average, less than 64 employed a computational model -an enriched Semantic Event Chain (eSEC)- incorporating the information of spatial relations, specifically (a) objects' touching/untouching, (b) static spatial relations between objects and (c) dynamic spatial relations between objects. Trained on the same actions as those observed by subjects, the model successfully predicted actions even better than humans. Information theoretical analysis shows that eSECs optimally use individual cues, whereas humans presumably mostly rely on a mixed-cue strategy, which takes longer until recognition. Providing a better cognitive basis of action recognition may, on one hand improve our understanding of related human pathologies and, on the other hand, also help to build robots for conflict-free human-robot cooperation. Our results open new avenues here.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 9

07/03/2019

Action Prediction in Humans and Robots

Efficient action prediction is of central importance for the fluent work...
05/16/2019

Understanding of Object Manipulation Actions Using Human Multi-Modal Sensory Data

Object manipulation actions represent an important share of the Activiti...
02/20/2020

Learning Intermediate Features of Object Affordances with a Convolutional Neural Network

Our ability to interact with the world around us relies on being able to...
10/25/2021

A Variational Graph Autoencoder for Manipulation Action Recognition and Prediction

Despite decades of research, understanding human manipulation activities...
04/14/2022

BEHAVE: Dataset and Method for Tracking Human Object Interactions

Modelling interactions between humans and objects in natural environment...
05/02/2019

Egocentric Hand Track and Object-based Human Action Recognition

Egocentric vision is an emerging field of computer vision that is charac...
02/26/2019

Beyond the Self: Using Grounded Affordances to Interpret and Describe Others' Actions

We propose a developmental approach that allows a robot to interpret and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Action Recognition and Prediction in Humans

Human beings excel at recognizing actions performed by others, and they do so even before the action goal has been effectively achieved [1, 2]. Thus, humans engage in action prediction. During this process, the brain activates a premotor-parietal network [3] that largely overlaps with the networks needed for action execution and action imagery [4]. Though in recent years, some progress has been made towards computationally more concrete models of the mechanisms and processes underlying action recognition [5], it still remains largely unresolved how the brain accomplishes this complex task. Prediction of actions can rely on different sources of information. The present study focused on the fact that human observers exploit static and dynamic spatial relations between the objects in an action scene. Comparing manipulation of appropriate objects (i.e., normal actions) with manipulations of inappropriate objects (i.e., pantomime), we found that brain activity during action observation was largely explained by processing of the actor’s movements [6]. As a caveat, this finding may be explained by the particular movement-focused strategy subjects selected in this study where normal and pantomime actions were presented in intermixed succession. Other studies show that motion features are used by the brain to segment observed actions into meaningful segments and to update internal predictive models of the observed action [7, 8]. Correspondingly, individuals segment actions into consistent, meaningful chunks [9, 10], and intra-individually, they do so in a highly consistent manner, albeit high inter-individual variability [7]. It has been argued that the objective quality of these chunks is that within the continuous sequence, breakpoints may convey a higher amount of information than the remainder of the event. Nevertheless, this suggestion remains speculative as long as we do not find a way to objectively quantify the flow of information that the continuous stream of input provides. This objectification is hampered by the fact that time-continuous information is highly variable with regard to spatial and temporal characteristics differing between action exemplars. Moreover, object information is a confounding factor in natural actions. As exemplars of object classes, individual objects provide information about possible types of manipulation the observer has learned these objects to be associated with [11, 12, 13]. For instance, knives are mostly used for cutting. Hence, objects can efficiently restrict the number of actions that an action observer expect to occur [11]. Speculatively, humans may use a mixed strategy exploiting object as well as static and dynamic spatial information, and this strategy may be adapted to current constraints; for instance, static and dynamic spatial information may become more relevant when objects are difficult to recognize.

In the present study, we tested the hypothesis that, in the absence of object and contextual (i.e., room, scene) information, action outcome prediction by human observers can be modeled as exploitation of static and dynamic spatial relations between objects involved in the action.

Action Recognition and Prediction in Machines

Action recognition is a fundamental task in computer vision, which recognizes human actions based on the complete action execution in a video. In other words, action recognition is an after-the-fact video analysis task that focuses on the present state

[14]. It has been studied for decades and is still a very popular topic due to its wide applications including human-computer interfaces [15], visual surveillance [16], video indexing [17], intelligent humanoids robots [18], ambient intelligence [19] and more. The actions could be simple human actions in constrained situations [20, 21, 22, 23] up to complex actions in cluttered scenes or in realistic videos [24, 25, 26, 27, 28].

Researchers have made great efforts to create an intelligent system that can recognize complex human actions in cluttered environments. But for a machine, an action in a video is just an array of pixels. Most of the time, the data of these pixels is mixed with noise, which must be eliminated during a pre-processing stage for example during pose estimation of moving humans

[29]. It has no idea about how to convert these pixels into an effective representation, and how to infer human actions from that representation. These two problems are considered as action representation and action classification in action recognition, and many attempts [30, 31, 32, 33] have been proposed to address these two problems [14]. Moreover, psychological studies on human behavior as well as reasoning [34] have pointed out the consequences of these two problems for both, understanding human cognition and intelligent system research. In this regard, there are methods that attempt to infer the type of action according to its (cognitive) consequences on the scene [35]. The majority of the existing methods for human action recognition focus on low-level spatio-temporal features, which can be brittle, for example due to problems of intra class variability arising from different humans performing the same action [36]. Approaches that use higher-level features [37, 38, 39] seem to be less affected by this. In this context, Ramirez-Amaro and coworkers have tried to consider human movements recognition from a semantic point of view [40].

Different from recognition, action prediction is a before-the-fact video understanding task and is focusing on the future state. In many applications (e.g. in autonomous navigation, surveillance, health care and etc), intelligent machines do not have the opportunity to wait until an action has finished before making a reaction. Two examples can make this clear: 1) driver action prediction to prevent accidents or 2) prediction of a handicapped person’s looming fall and proactive support by a robot. In these two examples post-hoc recognition will usually not help, but action prediction may prevent accidents.

For prediction, variability [41] and incompleteness of the action execution [42]

amplify the known problems in action recognition. After all, prediction is just “recognition earlier in time”. This topic is classified into one of three groups: 1) early action classification

[43, 44], 2) intention prediction [45, 46] and 3) motion trajectory prediction [47, 48]. While the first group has been extensively utilized to forecast short sequences of human motion, the second one remains elusive. Recently, Tanke et al. [49] approximated a person’s intention by a symbolic representation and exploited it in conjunction with the observed human poses. These combined features are then used for predicting the ongoing action.

Manipulation prediction, which is the topic of this paper, can be understood as a sub-set within the above-discussed more general problem of human action prediction and mostly falls into groups 1 and 3. Fermüller et al. have developed a recurrent neural network based method for manipulation action prediction

[39]. They depicted the hand movements before and after contact with the objects during the preparation and execution of actions and applied a method based on a recurrent neural network (RNN), where patches around the hand were used as inputs to the network. Moreover, there are some studies about hand motion trajectory recognition, which can be extended to prediction as well. They work in a causal way and can be also used for prediction. For example [50, 51]

use a hidden Markov model-based continuous gesture recognition system utilizing hand motion trajectories and we have extended their methods from recognition to prediction in

[52].

A central problem in all of these approaches is that action recognition (and prediction) heavily relies on time-continuous information (e.g. trajectories, movie sequences, etc.). This type of information, however, is highly variable, including spatial and temporal variations between action exemplars.

In an earlier study, we introduced a novel method in before-the-fact action recognition from incomplete action videos. We employed so-called extended semantic event chains (eSEC) [52], which are a strictly formalized and objective way to describe changes of static and dynamic spatial relation between objects involved in the scene. This approach allowed us to exactly determine discrete points in time at which combinations of spatial properties between objects in an action scene undergo significant change. Hence, this way we could disambiguate the ongoing action with regard to the intended goal. While the role of spatio-temporal changes along an action sequence had been considered important for action recognition before [53, 54, 55, 56], here we provide for the first time an analysis that highlights that humans employ a mixed strategy for accumulating evidence about the seen action. This is different from machine-vision, which can make use of the first unambiguous cue for recognition.

General Experimental Protocols and Methods

We designed a set of ten action scenarios in a virtual reality (VR) experiment to compare the predictive power of manipulation actions in humans and machine. Fig. 1 illustrates the steps taken to this end. In the following we will briefly describe our experimental protocols; for details see Methods.

Figure 1: Experimental schedule

We employed ten actions: chop, cut, hide, uncover, put on top, take down, lay, push, shake, and stir. All objects, including hand and tools, were represented by cubes of variable size and color to serve object-agnostic (except the hand) action recognition. The hand was always shown as a red cube (Fig. 2). Scene arrangements and object trajectories varied in order to generate a wide diversity in the samples of each manipulation action type. For each of the ten action types, 30 sample scenarios were recorded by human demonstration. All action scenes included different arrangements of several cubes (including distractor cubes) to ensure that videos were indistinguishable at the beginning.

Human Data Recording

Forty-nine right-handed participants (20-68 yrs, mean 31.69 yrs, SD = 9.86, 14 female) took part in the experiment. One additional participant completed the experiments, but was excluded from further analyses due to an error rate of 14.7%, classified as outlier. Prior to the testing, written informed consent was obtained from all participants.

Participants were given a detailed explanation regarding the stimuli and the task of the experiment. They were then familiarized with the VR system and shown how to deliver their responses during the experiment. The participants’ task was to indicate as quickly as possible which action was currently presented.

Every experiment started with a short training phase in which one example of each action was presented. During this demo version, the name of the currently presented action was highlighted in green on the background board (see Fig.2 (a)). During the test stage, a total of action videos (trials) were shown to the participants where the red hand-cube entered the scene and performed an action (Fig.2

 (b)). When the action was recognized and the participant pressed the motion controller’s button, the moment of this button press was recorded as response time. Concurrently, all cubes disappeared from the scene so that no post-decision cogitation about the action was possible. At the same time, the controller was marked with a red pointer added to its front. Hovering over the action of choice and pressing motion controller’s button again recorded the actual choice and advanced the experiment to the next trial (Fig.

2 (c)). Participants were allowed to rest during the experiment, and continued the experiment after resting. Since participants mostly proceeded quickly to the next trial, the overall duration of the experimental session usually did not exceed one hour. All experimental data were analysed using different statistical methods (see Methods section).

Figure 2: The VR experiment process, (a): experiment training stage: put on top action, (b): experiment testing stage: action scene playing and (c): experiment testing stage: selecting the action type

Machine Action Prediction

The extended semantic event chain framework (eSEC) used for machine prediction of action makes use of object-object relations. We defined three types of spatial relations in our framework: 1)“Touching (T)” and “Non-touching” (N) relations (TNR), 2) “Static Spatial Relation” (SSR) and 3)“Dynamic Spatial Relation” (DSR).

Touching and non-touching relations between two objects were defined according to collision or non-collision between their representative cubes.

Static Spatial Relations (SSR) are the relative position of two objects in space. In this type of relations, no data from previous image sequences is needed and they can be determined in each image. These could, for example, be: ‘Above” (Ab), “Below” (Be), “Around” (Ar), etc. For the complete list of SSRs see Methods.

Dynamic Spatial Relations, DSR are the relations between objects movements (when either or both of them move or are fixed). Here, different from SSR, some information from the previous frames (e.g., distance-related parameters) between each pair of objects is needed for DSR. Some examples are: “Moving Together” (MT), “Halting Together” (HT), etc. (for complete list, see Methods).

Importantly, eSEC do not make use of any real object information. Objects remain abstracted (like in the VR experiments). We defined five abstract object types that play an essential role in any manipulation action and call them the fundamental objects (see Table 1). Fundamental objects 1, 2, and 3 obtain their role in the course of an action: they are numbered according to the order by which they encounter transitions between the relations N (non-touching) and T (touching). For example, ‘fundamental object “1”’ obtains its role given by ’number 1’ by being the first that encounters a change in touching (usually this is the object first touched by the hand).

Note that not all fundamental objects defined in Table 1 are always existing in a specific action. Only hand, ground and fundamental object 1 are necessarily present in all analyzed actions. The action-driven “birth” of objects 1, 2, and 3 automatically leads to the fact that irrelevant (distractor) objects are always ignored by the eSEC analysis.

Thus, the maximal number of relations that had to be analyzed for an action was set by defined relations between fundamental objects: Given five object roles, there were possible combinations leading to ten relations for each type (N/T, SSR, DSR), resulting in 30 relations in total.


Object
Definition Remarks
Hand The object that Not touching anything at
performs an action. the beginning and at the end of
the action. It touches at least
one object during an action.
Ground The object that supports It is extracted as a ground
all other objects except plane in a visual scene.
the hand in the scene.
1 The object that is the first Trivially, the first transition
to obtain a change will always be a touch
in its T/N relations. by the hand.
2 The object that is the second Either TN or NT relational
to obtain a change change can happen.
in its T/N relations.
3 The object that is the third Either TN or NT relational
to obtain a change change can happen.
in its T/N relations.
Table 1: Definition of the fundamental objects during a manipulation action [52].

The eSEC) Matrix as an Action Descriptor for Machine Prediction

The Enriched Semantic Event Chain (eSEC) framework is inspired by the original Semantic Event Chain (SEC) approach [37]. The original SEC checks only touching (T) and non-touching (N) relations between each pair of fundamental objects in all frames of a manipulation scene and focuses on transitions (changes) in these relations. In the eSEC, the wealth of relations (see Methods section, Fig. 7 for all of them) are embedded into a matrix-form representation, showing how the set of spatial relations changes throughout the action [52]. Fig.3 shows the eSEC matrix for a “put on top” action and demonstrates how the set of all the different relations changes throughout this action.

Figure 3: Description of a “Put on top” action in the eSEC framework with possible relation graph between all objects. Only hand and ground are pre-specified, object 1 is the one first touched by the hand, object 2 the next where a touching/un-touching (T/N) change happens and object 3 in this case remains undefined (U) in all rows as there are no more T/N changes. This leads to the graph on the top left that shows all possible relations. Abbreviations in the eSEC are: U: undefined, T: touching, N: non-touching, O: very far (static), Q: very far (dynamic), Ab: above, To: top, Ar: around, ArT: around with touch, S: stable, HT: halt together, MT: move together, MA: moving apart, GC: getting close. Note that the two leftmost columns are identical for all actions as they indicate the starting situation before any action. The top, middle and bottom ten rows of the matrix indicate TNR, SSR and DSR between each pair of fundamental objects in a “put on top” action, respectively.

Measuring Predictive Power

Importantly, since action types differed in duration, we assessed predictive performance not in absolute time, but relative to the length of each video. Using a probabilistic approach, we assessed at which time point any action would be predictable by its eSEC. To this end, we divided our data (eSEC tables) into train and test samples and performed a column-by-column comparison. That is, similarity values between the eSECs were derived by comparing each test action’s eSEC to the every member of the training sample. We defined an action as “predicted” when the average similarity for one class remained high, while similarity for all other classes was low in this column. The similarity measurement algorithm between two eSEC matrices is explained in the Method section.

The eSEC column at which prediction happens is called “prediction column”. That way, event-based predictive power is defined as:

(1)

where is the “prediction column” and is the total number of columns in the action eSEC table.

The event-based predictive power provides the ratio of the event number we have predicted to the total number of events. Obviously, the lower the values of this measure, the lower the predictive power.

Results

In the human reaction time experiments, response times that exceeded the length of the action video were treated as time-outs and corresponding trials (13 out of 14700) were excluded from further analyses.

Regarding learning effects, hence, possible trends in performance change along an experiment, correlation analyses showed a very small significant reduction effect in error rates and a small significant deterioration effect for human predictive power .

Human predictive power was analyzed using a repeated measures ANOVA with action as within-subjects factor. Mauchly’s test indicated that the assumption of sphericity was violated

, therefore degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity

. The main effect of action was significant . As shown in Fig.4, predictive power varied strongly between actions. For instance, put and take actions were not correctly classified before most columns of the video (88% and 72%, respectively) were already presented, whereas cut, stir and uncover did only need about half (48%, 51%, and 52%, respectively) of the video time.

Separate one-sample t-tests per action for human vs. machine predictive power consistently showed higher predictive power for the algorithm

. See details in Fig. 4. Predictive power ranged from 14.3% to 62.5% for the machine, whereas humans predictive power ranged from 6.2% to 58.3%. On average, the machine spared observation of the remaining 45.6% of the video columns, humans the remaining 37%. In half of the actions (take, uncover, hide, push and put on top), this difference reached a very large effect size . Interestingly, most pronounced differences not in terms of effect size but in terms of overall sampling time emerged for actions that were most quickly classified by the algorithm (take, uncover, cut). For take actions, humans sampled twice as many columns (72%) as the optimal performing algorithm (38%).

Figure 4: Mean predictive power of human and machine. t-values and p-values according to the t-tests per action.
Figure 5: Fitting different models to the actions. (Abbreviations are shortened to allow to encode combinations by a short “+” annotation. We have Touch=T=TNR, Static=S=SSR, Dynamic=D=DSR. This leads to different combinations: T+S, T+D, S+D, T+S+D, where “Overall” refers to treating all eSEC columns independently of their individual information contents (see Methods).
Figure 6: Comparison of human (red bars) and machine (green bars) predictive performance. Blue bars indicate the relative amount (percentage) of action steps elapsed per action, before the TNR (light blue), DSR (blue) or SSR (dark blue) model provided maximal local informational gain, enabling a secure prediction of the respective action. For instance, the 5th eSEC-column of the overall 13 eSEC-columns describing the cut action provided a unique description in terms of DSR. That is, after around 38% of these action’s columns, the cut action could be predicted on the basis of DSR information, and this is what the algorithm did, as indicated by the green bar of equal length. In contrast, humans correctly predicted the cut action at the 6th (mean 6.26) column, corresponding to 48% of this action, exploiting both dynamic and static spatial information (cf. Figure 5 for this outcome).

Logistic regressions revealed significant results for the eight models for each action respectively. All models tested significantly against their null model . Figure 5 shows McFadden and BIC per model per action. Shaded cells indicate which model fits best human action prediction behavior based on the BIC. Deploying the AIC (Akaike-Information-Criterion) yielded similar results.

As to the type of information exploited for prediction, we found marked differences between human and machine strategies. The machine behavior was perfectly predicted by the biggest local gain in information, i.e., by transition into the column where the action code became unique for the respective action (Fig. 6). For instance, when dynamic information was the first to provide perfect disambiguation between competing action models, the algorithm always followed this cue immediately (this was the case for cut and hide). Likewise, static information ruled machine behavior for push and lay, reflecting the earliest possible point of certain prediction in these actions. Human suboptimal behavior was nicely reflected by the fact (see Fig. 5) that for cut and hide, subjects considered a combination of both dynamic and static spatial information (where they should have focused on dynamic information); the same strategy was applied to push and lay, where subjects should have better followed static information only.

Notably, when all three types of information were equally beneficial (take, uncover, shake, and put), human performance was best modeled by a combination of all three types of information, with the exception of chop, where subjects followed static spatial information. A post-hoc paired-sample t-test showed a significant effect of informational difference

. The z-transformed difference between mean human and machine predictive power was explicitly larger for informationally indifferent actions (M = 2.1) than for informationally different actions (M = 1.1). Expressed in non-transformed values, humans showed 12% less predictive power than the algorithm for informational indifferent action categories, but only 5% for the informational different ones.

Discussion

With the comprehensive entry of robots into various aspects of human life, it is necessary to pay more attention to establish appropriate interactions between humans and robots. These interactions must take place in a cognitive context that is “understandable” to both humans and machines. To improve on this, recently, Andriella et al. [57] have proposed a cognitive system framework for brain-training exercise based on human robot interaction. One of the central challenges for a proper interaction is the need to identify action of the human and make those “understandable” for the machine. This can be achieved by equipping the robot with an action encoding that allows action recognition and prediction.

Humans predict actions based on different sources of information, but we know only little about how flexible these sources can be exploited in case that others are noisy or unavailable. In the present study, we modeled human action prediction by an algorithm that was solely based on spatial information in terms of touching and untouching events between objects, their static spatial and dynamic spatial relations. This so-called enriched semantic event chain (eSEC) framework, which has been derived from older “grammatical” approaches towards action encoding [37, 53, 54, 56, 58] has significant predictive power, because each eSEC column captures highly indicative changes in the spatial and temporal relationships between objects and hand in an action. Thus, as soon as columns are different from one case to another, actions can be discriminated and become, thus, predictable. In an older study [52], we compared the predictive power of the eSEC framework with a Hidden Markov Model (HMM) algorithm, which represents the current state of the art in computer vision-based action recognition [50, 51]. This was done on two real data-sets and we found that the average predictive power of eSECs is slightly above 60% while % for the HMM-based approach.

In the present study, we tested how optimal human action prediction is when only spatial information is available. To this end we used action videos which were highly abstracted dynamic displays containing cubic place holders for all objects including hands, so that any information about real-world objects, environment, context, situation or actor were completely eliminated.

For such kind of action videos, we found that the algorithm was significantly faster than humans for predicting actions, i.e., for assigning the ongoing video to one out of ten basic action categories, before the video was completed. This difference was significant for each single action category. On average, humans achieved about 91% of the predictive power of the machine. Based on an information theoretic approach, further analyses revealed that humans did not select the optimal strategy to disambiguate actions as fast as possible: While the machine reliably detected the earliest occurrence of disambiguation between the ongoing action and all other action categories, as indicated by the highest gain in information at the respective action step, human subjects did so in only half of the action categories. Instead, humans unswervingly applied a mixing strategy, concurrently relying on both dynamic and spatial information in 8 out of 10 action types.

This strategy was particularly disadvantageous for actions that were equally well predictable based on either static (N/T, SSR) or dynamic (DSR) information. Particularly in these - one may say - informationally indifferent cases, humans were significantly biased towards prolonged decisions: here, they showed 12% less predictive power than the algorithm as compared to 5% for the informationally different actions.

In sum, the human bias towards using mixing strategies, combining static and dynamic spatial information, and to prolong decisions for informational indifferent action categories establish overall poorer human predictive power. In principal, these two effects may result from the same general heuristics of human action observers, to always exploiting several sources of information rather than relying on the first available source only. In other words, individuals seem to prioritize correct over fast classification of observed actions.

Our study was restricted to a number of ten possible actions, whereas in everyday life, the number of potentially observable actions is much higher, resulting in higher uncertainty and higher competition among these potential actions. Speculatively, the human bias to employ mixing exploitation strategies may be better adapted to disambiguate actions among this broader range of action classes. Future studies have to enlarge the sample of concurrently investigated actions to test this assumption and to increase overall ecological validity.

Limitations

We controlled for a number of additional sources of information that humans are known to exploit in natural actions. Especially, object information provides an efficient restriction on to-be-expected manipulations [11, 59, 60, 61, 62, 63]. It remains to be tested how non-spatial object information potentially interacts with the exploitation of static and dynamic spatial relations between objects involved in actions. Moreover, actions occur in certain contexts and environments that further restrict the observer’s expectation, for instance with regard of certain classes of actions [2, 64, 65, 66, 67].

Furthermore, our approach did not take into account all dynamic and static spatial information provided by human action. For instance, we restricted dynamic spatial information to between-object change, whereas in natural action, we would also register dynamic within-hand change. Thus, actors shape their hands to fit the to-be-grasped object already when starting to reach out for it [68, 69], providing a valuable pointer to potentially upcoming manipulations and goals [70]. Likewise, gaze information plays a role in natural action observation [71], as the actors’ looking to an object draws the observer’s attention to the same object [72], and hence, potentially upcoming targets of the action.

Concluding remarks

Describing action with a grammatical structure [53, 55, 54] such as eSEC [38, 52], renders a simple and fast framework for recognition and prediction in the presence of unknown objects and noise. This robustness lends itself to an intriguing hypothesis, which is asking to what degree such an event-based framework might help young infants to bootstrap action knowledge in view of the vast number of objects that they have never encountered before. In terms of spatial relations (as implemented in the current eSEC framework), the complexity of an action is far smaller than the complexity of the realm of objects with which an action can be performed, even when only considering a typical baby’s environment. Clearly, this approach has proven to be beneficial for robotic applications [73] and we plan to extend it to complex actions and interactions between several agents to examine the exploitation and exploration of predictive information during cooperation and competition.

Detailed Methods

Virtual Reality System

The main components of our VR system include computing power (for 3D data processing), head mounted display (for showing the VR content) and motion controllers (as the input devices). A Vive VR headset and motion controller released by HTC in April 2016 with a resolution of 1080 x 1200 per eye, have been used as our VR system. The “roomscale” system, which provides a precise 3D motion tracking between two infrared base stations, is the main advantage of this headset, which creates the opportunity to record and review actions for experiments on a larger scale of up to 5 meters diagonally. The Unreal Engine 4 (UE4) is a high performance game engine developed by Epic Games and is chosen as the game engine basis of this project. It has built-in support for VR environments and the Vives motion controllers.

Scenario Recording

In order to make VR-movies for the 10 different actions, 30 variants of each action were recorded by two members of BCCN team (a 23 year old undergraduate male and a 30 year old doctoral student female). They implemented a VR platform by using C++ code structure. The motion controller is the core input component of the VR environment and they provided a separate function for each button on that by C++ programming. The designed system included three different modes. First, a mode to record new actions for the experiment; second, a mode to review in, and last, the experiment itself. To keep the controls as simple as possible and to avoid a second motion controller without implementing a complex physics system, the recording mode was split into two sub-modes: A single-cube recording mode (for single, mostly static cubes) and a two-cubes recording mode (for object manipulation).

Stimuli

Actions were defined as follows:

Chop: The hand-object (hereafter: hand) touches an object (tool), picks up the object from the ground, puts it on another object (target) and starts chopping. When the target object has been divided into two parts, the tool object untouches the pieces of the target object. After that, the hand puts the tool object on the ground, untouches it, and leaves the scene.
Chop scenarios had a mean length of 17.86 s (SD = 3.56, range = 13-27).

Cut: The hand touches an object (tool), picks up the object from the ground, puts it on another object (target) and starts cutting. When the target object was divided into two parts, the tool object untouches the pieces of the target object. After that, the hand puts the tool object on the ground, untouches it, and leaves the scene.
Cut scenarios had a mean length of 19.50 s (SD = 3.13, range = 13-25).

Hide: The hand touches an object (tool), picks up the object from the ground, puts it on another object (target) and starts coming down on the target object until it covers that object thoroughly. Then the hand untouches the tool object and leaves the scene.
Hide scenarios had a mean length of 13.43 s (SD = 2.40, range = 9-20).

Uncover: The hand touches an object (tool), picks up the object from the ground. The second object (target) emerges as the tool object is raised from the ground, because the tool object had hidden the target object. After that, the hand puts the tool object on the ground, untouches it, and leaves the scene.
Uncover scenarios had a mean length of 12.66 s (SD = 3.20, range = 9-21).

Put on top: The hand touches an object, picks up the object from the ground and puts it on another object. After that, the hand untouches the first object and leaves the scene.
Put on top scenarios had a mean length of 10.90 s (SD = 2.006, range = 8-16).

Take down: The hand touches an object that is on another object, picks up the first object from the second object and puts it on the ground. After that, the hand untouches the first object and leaves the scene.
Take down scenarios had a mean length of 10.60 s (SD = 3.04, range = 6-18).

Lay: The hand touches an object on the ground and changes its direction (lays it down) while it remains touching the ground. After that, the hand untouches the object and leaves the scene.
Lay scenarios had a mean length of 11.23 s (SD = 1.79, range = 8-15).

Push: The hand touches an object on the ground and starts pushing it on the ground. After that, the hand untouches the object and leaves the scene.
Push scenarios had a mean length of 12.56 s (SD = 1.73, range = 9-17).

Shake: The hand touches an object, picks up the object from the ground and starts shaking it. Then, the hand puts it back on the ground, untouches it and leaves the scene.
Shake scenarios had a mean length of 12.10 s (SD = 2.05, range = 9-17).

Stir: The hand touches an object (tool), picks up the object from the ground, puts it on another object (target) and starts stirring. After that, the hand puts the tool object on the ground, untouches it, and leaves the scene.
Stir scenarios had a mean length of 20.23 s (SD = 4.67, range = 14-31).

Machine Action Prediction

The methods described in the following are largely identical to those used in [52] and [56].

Spatial Relations

The details on how to calculate static and dynamic spatial relations are provided below. Here we start first with a general description.

1) Touching and non-touching relations (TNR) between two objects were defined according to collision or non-collision between their representative cubes.

2) Static spatial relations (SSR) included : ‘Above” (Ab), “Below” (Be), “Right” (R), “Left” (L), “Front” (F), “Back” (Ba), “Inside” (In), “Surround” (Sa). Since “Right”, “Left”, “Front” and “Back” depend on the viewpoint and directions of the camera axes, we combined them into “Around” (Ar) and used it at times when one object was surrounded by another. Moreover, “Above” (Ab), “Below” (Be) and “Around” (Ar) relations in combination with “Touching” were converted to “Top” (To), “Bottom” (Bo) and “Touching Around” (ArT), respectively, which corresponded to the same cases with physical contact. Fig. 7 (a1-a3) shows static spatial relations between two objects cubes. If two objects were far from each other or did not have any of the above-mentioned relations, their static relation was considered as Null (O). This led to a set of nine static relations in the eSECs: SSR = {Ab, Be, Ar, Top, Bottom, ArT, In, Sa, O}. The additional relations, mentioned above: R, L, F, Ba are only used to define the relation Ar=around, because the former four relations are not view-point invariant.

Figure 7: (a) Static Spatial Relations: (a1) Above/Below, (a2) Around, (a3) Inside/Surround. (b) Dynamic Spatial Relations: (b1) Moving Together, (b2) Halting Together, (b3) Fixed-Moving Together, (b4) Getting Close, (b5) Moving Apart, (b6) Stable.

3) Dynamic Spatial Relations (DSR) require to make use of the frame history in the video. We used a history of 0.5 seconds, which is an estimate for the time that a human hand takes to change the relations between objects in manipulation actions. DSRs included the following relations: “Moving Together” (MT), “Halting Together” (HT), “Fixed-Moving Together” (FMT), “Getting Close” (GC), “Moving Apart” (MA) and “Stable” (S). DSRs between two objects cubes are shown in Fig. 7 (b1-b6). MT, HT and FMT denote situations when two objects are touching each other while: both of them are moving in a same direction (MT), are motionless (HT), or when one object is fixed and does not move while the other one is moving on or across it (FMT). Case S denotes that any distance-change between objects remained below a defined threshold of cm during the entire action. All these dynamic relations cases are clarified in Fig. 7 (b). In addition, Q is used as a dynamic relation between two objects when their distance exceeded the defined threshold or if they did not have any of the above-defined dynamic relations. Therefore, dynamic relations make a set of seven members: DSR = {MT, HT, FMT, GC, MA, S, Q}.

Finally, whenever one object became “Absent” or hidden during an action, the symbol (A) was used for annotating this condition. In addition, we use the symbol (X) whenever one object was destroyed or lost its primary shape (e.g. in cut or chop actions).

Object Types

An exhaustive description of the five fundamental object types had been given in the main text and shall not be repeated here.

Mathematical Definition of the Spatial Relations

As mentioned above, touching and non-touching relations between two objects are defined according to collision or non-collision between their representative cubes. 3D collision detection is a challenging topic which has been addressed in [74]. But, because the objects in our study are just cubes, we interpreted the contact of one of the six surfaces of one cube with one of the other cube’s surfaces (see Fig.8) as touching event and this can be detected easily.

Figure 8: Possible situations that two cubes touch each other

For example, in the left second situation of Fig. 8, which has been shown with more details in Fig.9, the following condition will lead to a touching relation from a side.

Figure 9: Coordinate details of the two cubes that touch each other from side.

Moreover, all discussed static and dynamic relations are defined by a set of rules. We start with explaining the rule set for static spatial relations and then proceed to dynamic spatial relations. In general, , , , , and indicate the minimum and maximum values between the points of object cube in x, y and z axes, respectively.

Let us define the relation “Left”, (object is to the left of object ) if:

(3)

and the following exception condition holds

(4)

The exception condition excludes from the relation “Left” those cases when two object cubes do not overlap in altitude (y direction) or front/back (z direction). Several examples of objects holding relation , when the size and shift in y direction varies, are shown in Fig. 10.

Figure 10: Possible states of Left relation between two objects cubes when size and y positions vary.

is defined by and the identical set of exception conditions. The relations Ab, Be, F, Ba are defined in an analogous way. For Ab and Be the emphasis is on the “y” dimension, while for the F, Ba the emphasis is on the “z” dimension.

For the relation “inside” we use:

(5)

The opposite holds for relation Sa (surrounding). For example, if

In addition of computing spatial relations TNR between two objects based on the above rules, we also check the touching relation between those two objects. This is then used to define several other relations. For example, if one object is above the other object, while they are touching each other, their static relation will be To (top).

(6)

There can be more than one static spatial relations between two object cubes. For example, one object can be both to the left and in back of the other object. However, to fill the eSEC matrix elements we need only one relation per object pair. This problem is solved by definition of a new notion called shadow.

Each cube has six surfaces. We label them as top, bottom, right, left, front and back based on their positions in our scene coordinate system. Whenever object is to the left of object , one can make a projection from the right surface of object onto the left rectangle of object and consider only the rectangle intersection area, This area is represented by the newly defined parameter shadow. Suppose while and we have calculated the for all relations between the objects and . The relation with the biggest shadow is then selected as the main static relation between the two objects: (Fig.11 includes the above description in the image format.)

(7)
(8)
Figure 11: Selection of one static spatial relation from several possible relations.

Dynamic spatial relations (DSR) are defined as follows. Suppose shows the central point of the object cube (object in frame); we define to be a two argument function for measuring the Euclidean distance between the cubes and in frame.

(9)

For this we use a time window of frames (image snapshots in VR) in our experiments ; the threshold is kept at 0.1 m:

In the following we defined five conditions P1 to P5, which then will be used to characterize the remaining DSRs.

P1: (10)
P2:
P3:
P4:
P5:

The dynamic relations MT, HT, FMT and S, based on the five conditions above are now defined in the following way:

(11)

Similarity Measure between eSECs

Suppose and are the names of two actions with eSECs that have and columns, respectively.

Instead of writing down a 30-row eSEC each, we can concatenate the corresponding TNR, SSR and DSR of each fundamental object pair into a triple and make a 10-row matrix for and with ternary elements (TNR, SSR, DSR) instead:

Using the elements of both matrices, we define the differences in the three different relation categories by:

where ,

Then we define the compound difference for the three categories in the following way:

(12)

In case one matrix had more columns than the other matrix. i.e., or vice versa, we repeated the last column of the smaller matrix to match the number of columns of the bigger matrix. This leads to a consistent drop in similarity regardless of which two action are being compared.

Now we define as the matrix, which holds all compound differences between the elements of the two eSECs.

where denotes the dissimilarity of objects pair at the time stamp (column). Then, , which is the total dissimilarity between eSECs of and is obtained as the average across all elements of matrix .

(13)

Accordingly, the similarity between these eSECs , is obtained as:

(14)

Statistical Data Analysis

Data were analyzed using RStudio (Version 1.2.5001, RStudio Inc.) and SPSS 26 (IBM, New York, United States).

To inspect the presence of learning effects in the human sample, correlations (Spearman Rho) were calculated for the number of trial (1 - 30) per action and predictive power as well as error rate.

To compare human and machine predictive power, first, a repeated measures ANOVA on predictive power of humans was calculated with action (1 - 10) as within-subject factor. Then, human and machine performance was compared for each action separately using one-sample t-tests. As the machine data do not show variance, their predictive power value was used to compare it to human performance.

To model human action prediction based on eSEC matrices, we calculated the informational gain based on each eSEC column entry. More specifically, based on the eSEC descriptions of all ten actions, we derived a measurement of the amount of information presented in each column (or action step) of each action in comparison to all other actions. Each eSEC column, for a given sub-table (Touching=T, Static=S, Dynamic=D), contains ten coded descriptions of the spatial relations between hand, objects and ground. By stringing the eSEC codes of one column together, each column gets a new single code formally describing the action stage of a sub-table the participant observes at that moment. By taking the frequency of each action step or column-code across all 10 actions, we calculated the likelihood of a specific code in reference to the other actions in this column. So, if all eSEC descriptions are the same for one column, this column-code is assigned a likelihood of “1”. If only one action differs (from the remaining nine actions), it gets a likelihood of 0.1 and the column-code of the differing action receives a likelihood value of 0.9, and so forth. Because not every action has the same number of columns, the lack of eSEC descriptions is also treated as a possible event. That means, if for example seven out of ten actions already have stopped at one point in time, these seven actions would receive a likelihood of 0.7 for this specific column.

We conducted this likelihood assignment procedure for each of the three types of information (TNR, SSR, DSR) separately. Note that the likelihood also gives an estimate of the information about one action that is presented in a column. If the likelihood of an action code is low, only a few or just this single action has this particular action code. So, if this code appears, it powerfully constrains action prediction.

Based on the likelihood p of an action step x, we then calculated bit rates to quantify (self-)information I according to Shannon [75]:

This transformation into information has two advantages over calculating with likelihoods. Firstly, it is more intuitive because more information is also displayed as a higher value, and secondly, we now were able to derive cumulated information by adding up the information values associated with successive columns. The transformation and cumulation were also done for all three information types separately. Thus, we obtained information values for each action step for each type of information separately. The additivity of the data also made it possible to combine multiple types of information by simply summing up the columns of the sub-tables.

Based on these information values, we modeled human performance. We employed the following models: one based only on TNR, one based only on SSR, one based only on DSR, three models adding two of the three types of information (T+S; T+D; S+D), one model adding all three types of information (T+S+D) and finally one model that ignores the three differing types of information and calculates the self-information based on all eSEC entries independent of the information type (Overall). For each model and for each action separately, a logistic regression was calculated using SPSS26. Each logistic regression included the absolute amount of information per action step according to the respective model, the accumulated information up to each action step, and the interaction term of these absolute and accumulated predictors. The logistic regressions’ dependent binary variable was the presence of a response during the respective action step, indicating whether the action was predicted during this action step or not. Since predictors were correlated, models were estimated using the stepwise forward method for variable entry. Note that we did not interpret the coefficients and therefore did not need to regularize the regression model due to coefficient’s correlation. Model fits were compared using the BIC (Bayesian-Information-Criterion)

[76].

References

  • [1] Isik, L., Tacchetti, A. & Poggio, T. A fast, invariant representation for human action in the visual system. Journal of Neurophysiology 119, 631–640 (2017).
  • [2] Wurm, M. F. & Schubotz, R. I. Squeezing lemons in the bathroom: contextual information modulates action recognition. Neuroimage 59, 1551–1559 (2012).
  • [3] Caspers, S., Zilles, K., Laird, A. R. & Eickhoff, S. B. ALE meta-analysis of action observation and imitation in the human brain. Neuroimage 50, 1148–1167 (2010).
  • [4] Hardwick, R. M., Caspers, S., Eickhoff, S. B. & Swinnen, S. P. Neural correlates of action: Comparing meta-analyses of imagery, observation, and execution. Neuroscience & Biobehavioral Reviews (2018).
  • [5] Giese, M. A. & Rizzolatti, G. Neural and computational mechanisms of action processing: Interaction between visual and motor representations. Neuron 88, 167–180 (2015).
  • [6] Schubotz, R. I. & von Cramon, D. Y. The case of pretense: Observing actions and inferring goals. Journal of Cognitive Neuroscience 21, 642–653 (2009).
  • [7] Schubotz, R. I., Korb, F. M., Schiffer, A.-M., Stadler, W. & von Cramon, D. Y. The fraction of an action is more than a movement: neural signatures of event segmentation in fMRI. NeuroImage 61, 1195–1205 (2012).
  • [8] Kurby, C. A. & Zacks, J. M. Segmentation in the perception and memory of events. Trends in Cognitive Sciences 12, 72–79 (2008).
  • [9] Newtson, D. Attribution and the unit of perception of ongoing behavior. Journal of Personality and Social Psychology 28, 28 (1973).
  • [10] Newtson, D. & Engquist, G. The perceptual organization of ongoing behavior. Journal of Experimental Social Psychology 12, 436–450 (1976).
  • [11] Schubotz, R. I., Wurm, M. F., Wittmann, M. K. & von Cramon, D. Y. Objects tell us what action we can expect: dissociating brain areas for retrieval and exploitation of action knowledge during action observation in fMRI. Frontiers in Psychology 5, 636 (2014).
  • [12] Bach, P., Nicholson, T. & Hudson, M. The affordance-matching hypothesis: how objects guide action understanding and prediction. Frontiers in Human Neuroscience 8, 254 (2014).
  • [13] Nicholson, T., Roser, M. & Bach, P. Understanding the goals of everyday instrumental actions is primarily linked to object, not motor-kinematic, information: evidence from fmri. PLOS One 12, e0169700 (2017).
  • [14] Kong, Y. & Fu, Y. Human action recognition and prediction: A survey. arXiv preprint arXiv:1806.11230 (2018).
  • [15] Rautaray, S. S. & Agrawal, A. Vision based hand gesture recognition for human computer interaction: a survey. Artificial Intelligence Review 43, 1–54 (2015).
  • [16] Gerónimo, D. & Kjellström, H. Unsupervised surveillance video retrieval based on human action and appearance. In

    2014 22nd International Conference on Pattern Recognition

    , 4630–4635 (IEEE, 2014).
  • [17] Poleg, Y., Ephrat, A., Peleg, S. & Arora, C. Compact CNN for indexing egocentric videos. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 1–9 (IEEE, 2016).
  • [18] Zhang, L., Jiang, M., Farid, D. & Hossain, M. A. Intelligent facial emotion recognition and semantic-based topic detection for a humanoid robot. Expert Systems with Applications 40, 5160–5168 (2013).
  • [19] Ramos, C., Augusto, J. C. & Shapiro, D. Ambient intelligence — the next step for artificial intelligence. IEEE Intelligent Systems 23, 15–18 (2008).
  • [20] Aggarwal, J. K. & Cai, Q. Human motion analysis: A review. Computer Vision and Image Understanding 73, 428–440 (1999).
  • [21] Bobick, A. & Davis, J. The representation and recognition of human movement using temporal templates. In IEEE Conference on Computer Vision and Pattern Recognition, 928–934 (1997).
  • [22] Moeslund, T. B. & Granum, E. A survey of computer vision-based human motion capture. Computer Vision and Image Understanding 81, 231–268 (2001).
  • [23] Wallraven, C., Caputo, B. & Graf, A. Recognition with local features: the kernel recipe. In Proceedings Ninth IEEE International Conference on Computer Vision, vol. 1, 257–264 (2003).
  • [24] Lan, T., Sigal, L. & Mori, G. Social roles in hierarchical models for human activity recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 1354–1361 (IEEE, 2012).
  • [25] Laptev, I., Marszalek, M., Schmid, C. & Rozenfeld, B. Learning realistic human actions from movies. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, 1–8 (IEEE, 2008).
  • [26] Patron-Perez, A., Marszalek, M., Reid, I. & Zisserman, A. Structured learning of human interactions in TV shows. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 2441–2453 (2012).
  • [27] Yao, B. et al. Human action recognition by learning bases of action attributes and parts. In 2011 International Conference on Computer Vision, 1331–1338 (IEEE, 2011).
  • [28] Jhuang, H., Gall, J., Zuffi, S., Schmid, C. & Black, M. J. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, 3192–3199 (2013).
  • [29] Simo-Serra, E., Ramisa, A., Alenyà, G., Torras, C. & Moreno-Noguer, F.

    Single image 3D human pose estimation from noisy observations.

    In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2673–2680 (IEEE, 2012).
  • [30] Laptev, I. On space-time interest points. International Journal of Computer Vision 64, 107–123 (2005).
  • [31] Raptis, M. & Sigal, L. Poselet key-framing: A model for human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2650–2657 (2013).
  • [32] Ji, S., Xu, W., Yang, M. & Yu, K.

    3D convolutional neural networks for human action recognition.

    IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 221–231 (2012).
  • [33] Carreira, J. & Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308 (2017).
  • [34] Locke, E. A. & Latham, G. P. A theory of goal setting & task performance. (Prentice-Hall, Inc, 1990).
  • [35] Yang, Y., Fermüller, C. & Aloimonos, Y. Detection of manipulation action consequences (MAC). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2563–2570 (2013).
  • [36] Bulling, A., Blanke, U. & Schiele, B. A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR) 46, 33 (2014).
  • [37] Aksoy, E. E. et al. Learning the semantics of object–action relations by observation. The International Journal of Robotics Research 30, 1229–1249 (2011).
  • [38] Ziaeetabar, F., Aksoy, E. E., Wörgötter, F. & Tamosiunaite, M. Semantic analysis of manipulation actions using spatial relations. In 2017 IEEE International Conference on Robotics and Automation (ICRA), 4612–4619 (IEEE, 2017).
  • [39] Fermüller, C. et al. Prediction of manipulation actions. International Journal of Computer Vision 126, 358–374 (2018).
  • [40] Ramirez-Amaro, K., Yang, Y. & Cheng, G. A survey on semantic-based methods for the understanding of human movements. Robotics and Autonomous Systems 119, 31–50 (2019).
  • [41] Dinerstein, J., Ventura, D. & Egbert, P. K. Fast and robust incremental action prediction for interactive agents. Computational Intelligence 21, 90–110 (2005).
  • [42] Kong, Y., Tao, Z. & Fu, Y. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1473–1481 (2017).
  • [43] Ryoo, M. S. Human activity prediction: Early recognition of ongoing activities from streaming videos. In 2011 International Conference on Computer Vision, 1036–1043 (IEEE, 2011).
  • [44] Cao, Y. et al. Recognize human activities from partially observed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2658–2665 (2013).
  • [45] Pei, M., Jia, Y. & Zhu, S.-C. Parsing video events with goal inference and intent prediction. In 2011 International Conference on Computer Vision, 487–494 (IEEE, 2011).
  • [46] Li, K., Hu, J. & Fu, Y. Modeling complex temporal composition of actionlets for activity prediction. In 2011 International Conference on Computer Vision, 487–494 (IEEE, 2011).
  • [47] Zhou, B., Wang, X. & Tang, X. Random field topic model for semantic region analysis in crowded scenes from tracklets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3441–3448 (IEEE, 2011).
  • [48] Morris, B. T. & Trivedi, M. M. Trajectory learning for activity understanding: Unsupervised, multilevel, and long-term adaptive approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 2287–2301 (2011).
  • [49] Tanke, J. & Gall, J. Human motion anticipation with symbolic label. arXiv preprint arXiv:1912.06079 (2019).
  • [50] Elmezain, M., Al-Hamadi, A. & Michaelis, B.

    Hand gesture recognition based on combined features extraction.

    World Academy of Science, Engineering and Technology 60, 395 (2009).
  • [51] Elmezain, M., Al-Hamadi, A. & Michaelis, B. Hand trajectory-based gesture spotting and recognition using HMM. In 2009 16th IEEE International Conference on Image Processing (ICIP), 3577–3580 (IEEE, 2009).
  • [52] Ziaeetabar, F., Kulvicius, T., Tamosiunaite, M. & Wörgötter, F. Recognition and prediction of manipulation actions using enriched semantic event chains. Robotics and Autonomous Systems 110, 173–188 (2018).
  • [53] Pastra, K. & Aloimonos, Y. The minimalist grammar of action. Philosophical Transactions of the Royal Society B: Biological Sciences 367, 103–117 (2012).
  • [54] Yang, Y., Guha, A., Fermüller, C. & Aloimonos, Y. A cognitive system for understanding human manipulation actions. Advances in Cognitive Sysytems 3, 67–86 (2014).
  • [55] Summers-Stay, D., Teo, C. L., Yang, Y., Fermüller, C. & Aloimonos, Y. Using a minimal action grammar for activity understanding in the real world. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 4104–4111 (IEEE, 2012).
  • [56] Wörgötter, F. et al. Humans predict action using grammar-like structures. Scientific reports 10, 1–11 (2020).
  • [57] Andriella, A., Torras, C. & Alenyà, G. Cognitive system framework for brain-training exercise based on human-robot interaction. Cognitive Computation 1–18 (2020).
  • [58] Aksoy, E. E., Orhan, A. & Wörgötter, F. Semantic decomposition and recognition of long and complex manipulation action sequences. International Journal of Computer Vision 122, 84–115 (2017).
  • [59] Ruddle, R. A., Savage, J. C. & Jones, D. M. Symmetric and asymmetric action integration during cooperative object manipulation in virtual environments. ACM Transactions on Computer-Human Interaction (TOCHI) 9, 285–308 (2002).
  • [60] Gupta, A. & Davis, L. S. Objects in action: An approach for combining action understanding and object perception. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, 1–8 (IEEE, 2007).
  • [61] Hrkać, M., Wurm, M. F., Kühn, A. B. & Schubotz, R. I. Objects mediate goal integration in ventrolateral prefrontal cortex during action observation. PLOS One 10, e0134316 (2015).
  • [62] El-Sourani, N., Wurm, M. F., Trempler, I., Fink, G. R. & Schubotz, R. I. Making sense of objects lying around: How contextual objects shape brain activity during action observation. NeuroImage 167, 429–437 (2018).
  • [63] El-Sourani, N., Trempler, I., Wurm, M. F., Fink, G. R. & Schubotz, R. I. Predictive impact of contextual objects during action observation: Evidence from fMRI. Journal of Cognitive Neuroscience 32, 326–337 (2019).
  • [64] Shapovalova, N., Gong, W., Pedersoli, M., Roca, F. X. & Gonzalez, J. On importance of interactions and context in human action recognition. In Iberian Conference on Pattern Recognition and Image Analysis, 58–66 (Springer, 2011).
  • [65] Zheng, Y., Zhang, Y.-J., Li, X. & Liu, B.-D. Action recognition in still images using a combination of human pose and context information. In 2012 19th IEEE International Conference on Image Processing, 785–788 (IEEE, 2012).
  • [66] Wurm, M. F., Artemenko, C., Giuliani, D. & Schubotz, R. I. Action at its place: Contextual settings enhance action recognition in 4- to 8-year-old children. Developmental Psychology 53, 662–670 (2017).
  • [67] Wurm, M. F., Cramon, D. Y. & Schubotz, R. I. The context-object-manipulation triad: Cross talk during action perception revealed by fMRI. Journal of Cognitive Neuroscience 24, 1548–1559 (2012).
  • [68] Ingram, J. N., Howard, I. S., Flanagan, J. R. & Wolpert, D. M. Multiple grasp-specific representations of tool dynamics mediate skillful manipulation. Current Biology 20, 618–623 (2010).
  • [69] Jeannerod, M., Arbib, M., Rizzolatti, G. & Sakata, H. Grasping objects: the cortical mechanisms. Trends Neurosci 18, 314–32 (1995).
  • [70] Heumer, G., Amor, H. B. & Jung, B.

    Grasp recognition for uncalibrated data gloves: A machine learning approach.

    Presence: Teleoperators and Virtual Environments 17, 121–142 (2008).
  • [71] Land, M. F. Vision, eye movements, and natural behavior. Visual Neuroscience 26, 51–62 (2009).
  • [72] Fathi, A., Li, Y. & Rehg, J. M. Learning to recognize daily actions using gaze. In European Conference on Computer Vision, 314–327 (Springer, 2012).
  • [73] Aein, M. J., Aksoy, E. E. & Wörgötter, F. Library of actions: Implementing a generic robot execution framework by using manipulation action semantics. The International Journal of Robotics Research 38, 910–934 (2019).
  • [74] Jiménez, P., Thomas, F. & Torras, C. 3D collision detection: a survey. Computers & Graphics 25, 269–285 (2001).
  • [75] Shannon, C. E. A mathematical theory of communication. Bell System Technical Journal 27, 379–423 and 623–656 (1948).
  • [76] Schwarz, G. E. Estimating the dimension of a model. Annals of Statistics 6, 461–464 (1978).