Human Action Recognition and Assessment via Deep Neural Network Self-Organization

01/04/2020 ∙ by German I. Parisi, et al. ∙ University of Hamburg 0

The robust recognition and assessment of human actions are crucial in human-robot interaction (HRI) domains. While state-of-the-art models of action perception show remarkable results in large-scale action datasets, they mostly lack the flexibility, robustness, and scalability needed to operate in natural HRI scenarios which require the continuous acquisition of sensory information as well as the classification or assessment of human body patterns in real time. In this chapter, I introduce a set of hierarchical models for the learning and recognition of actions from depth maps and RGB images through the use of neural network self-organization. A particularity of these models is the use of growing self-organizing networks that quickly adapt to non-stationary distributions and implement dedicated mechanisms for continual learning from temporally correlated input.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial systems for human action recognition from videos have been extensively studied in the literature, with a large variety of machine learning models and benchmark datasets 

[Poppe, 2010, Guo et al., 2016]. The robust learning and recognition of human actions are crucial in human-robot interaction (HRI) scenarios where, for instance, robots are required to efficiently process rich streams of visual input with the goal of undertaking assistive actions in a residential context (Fig. 1).

Deep learning architectures such as convolutional neural networks (CNNs) have been shown to recognize actions from videos with high accuracy through the use of hierarchies that functionally resemble the organization of earlier areas of the visual cortex (see [Guo et al., 2016]

for a survey). However, the majority of these models are computationally expensive to train and lack the flexibility and robustness to operate in the above-described HRI scenarios. A popular stream of vision research has focused on the use of depth sensing devices such as the Microsoft Kinect and ASUS Xtion Live for human action recognition in HRI applications using depth information instead of, or in combination with, RGB images. Post-processed depth map sequences provide real-time estimations of 3D human motion in cluttered environments with increased robustness to varying illumination conditions and reducing the computational cost for motion segmentation and pose estimation (see

Han et al. [2013] for a survey). However, learning models using low-dimensional 3D information (e.g. 3D skeleton joints) have often failed to show robust performance in real-world environments since this type of input can be particularly noisy and susceptible to self-occlusion.

In this chapter, I introduce a set of neural network models for the efficient learning and classification of human actions from depth information and RGB images. These models use different variants of growing self-organizing networks for the learning of action sequences and real-time inference. In Section 2

, I summarize the fundamentals of neural network self-organization with focus on a particular type of growing network, the Grow When Required (GWR) model, that can grow and remove neurons in response to a time-varying input distribution, and the Gamma-GWR which extends the GWR with temporal context for the efficient learning of visual representations from temporally correlated input. Hierarchical arrangements of such networks, which I describe in Section 

3, can be used for efficiently processing body pose and motion features and learning a set of training actions.

Figure 1: Person tracking and action recognition with a depth sensor on a humanoid robot in a domestic environment. [Parisi et al., 2016c].

Understanding people’s emotions plays a central role in human social interaction and behavior Picard [1997]. Perception systems making use of affective information can significantly improve the overall HRI experience, for instance, by triggering pro-active robot behavior as a response to the user’s emotional state. An increasing corpus of research has been conducted in the recognition of affective states, e.g., through the processing of facial expressions Alonso-Martin et al. [2013], speech detection Nwe et al. [2003] and the combination of these multimodal cues Barros and Wermter [2016]. While facial expressions can easily convey emotional states, it is often the case in HRI scenarios that a person is not facing the sensor or is standing far away from the camera, resulting in insufficient spatial resolution to extract facial features. The recognition of emotions from body motion, instead, has received less attention in the literature but has a great value in HRI domains. The main reason is that affective information is seen as harder to extrapolate from complex full-body expressions with respect to facial expressions and speech analysis. In Section 3.2, I introduce a self-organizing neural architecture for emotion recognition from 3D body motion patterns.

In addition to recognizing short-term behavior such as domestic daily actions and dynamic emotional states, it is of interest to learn the user’s behavior over longer periods of time [Vettier and Garbay, 2014]. The collected data can be used to perform longer-term gait assessment as an important indicator for a variety of health problems, e.g., physical diseases and neurological disorders such as Parkinson’s disease [Aerts et al., 2012]. The analysis and assessment of body motion have recently attracted significant interest in the healthcare community with many application areas such as physical rehabilitation, diagnosis of pathologies, and assessment of sports performance. The correctness of postural transitions is fundamental during the execution of well-defined physical exercises since inaccurate movements may not only significantly reduce the overall efficiency of the movement and but also increase the risk of injury [Kachouie et al., 2014]. As an example, in the healthcare domain, the correct execution of physical rehabilitation routines is crucial for patients to improve their health condition [Velloso et al., 2013]. Similarly, in weight-lifting training, correct postures improve the mechanical efficiency of the body and lead the athlete to achieve better results across training sessions. In Section 4, I introduce a self-organizing neural architecture for learning body motion sequences comprising weight-lifting exercise and assessing their correctness in real time.

State-of-the-art models of action recognition have mostly proposed the learning of a static batch of body patterns [Guo et al., 2016]. However, systems and robots operating in real-world settings are required to acquire and fine-tune internal representations and behavior in a continual learning fashion. Continual learning refers to the ability of a system to seamlessly learn from continuous streams of information while preventing catastrophic forgetting, i.e., a condition in which new incoming information strongly interferes with previously learned representations Mermillod et al. [2013], Parisi et al. [2019]. Continual machine learning research has mainly focused on the recognition of static images patterns whereas the processing of more complex stimuli such as dynamic body motion patterns has been overlooked. In particular, the majority of these models address supervised continual learning on static image datasets such as the MNIST LeCun et al. [1998] and the CIFAR-10 Krizhevsky [2009] and have not reported results on video sequences. In Section 5, I introduce the use of deep neural network self-organization for the continual learning of human actions from RGB video sequences. Reported results evidence that deep self-organization can mitigate catastrophic forgetting while showing competitive performance with state-of-the-art batch learning models.

Despite significant advances in artificial vision, learning models are still far from providing the flexibility, robustness, and scalability exhibited by biological systems. In particular, current models of action recognition are designed for and evaluated on highly controlled experimental conditions, whereas systems and robots in HRI scenarios are exposed to continuous streams of (often noisy) sensory information. In Section 6, I discuss a number of challenges and directions for future research.

2 Neural Network Self-Organization

2.1 Background

Input-driven self-organization is a crucial component of cortical processing which shapes topographic maps based on visual experience [Willshaw and von der Malsburg, 1976, Nelson, 2000]. Different artificial models of input-driven self-organization have been proposed to resemble the basic dynamics of Hebbian learning and structural plasticity [Hebb, 1949], with neural map organization resulting from unsupervised statistical learning. The goal of the self-organizing learning is to cause different parts of a network to respond similarly to certain input samples starting from an initially unorganized state. Typically, during the training phase these networks build a map through a competitive process, also referred to as vector quantization

, so that a set of neurons represent prototype vectors encoding a submanifold in the input space. Throughout this process, the network learns significant

topological relations of the input without supervision.

A well-established model is the self-organizing map (SOM) 

[Kohonen, 1991] in which the number of prototype vectors (or neurons) that can be trained is pre-defined. However, empirically selecting a convenient number of neurons can be tedious, especially when dealing with non-stationary, temporally-correlated input distributions [Strickert and Hammer, 2005]. To alleviate this issue, a number of growing models have been proposed that dynamically allocate or remove neurons in response to sensory experience. An example is the Grow When Required (GWR) network Marsland et al. [2002] which grows or shrinks to better match the input distribution. The GWR has the ability to add new neurons whenever the current input is not sufficiently matched by the existing neurons (whereas other popular models, e.g. Growing Neural Gas (GNG) [Fritzke, 1995]), will add neurons only at fixed, pre-defined intervals). Because of their ability to allocate novel trainable resources, GWR-like models have the advantage of mitigating the disruptive interference of existing internal representations when learning from novel sensory observations.

2.2 Grow When Required (GWR) Networks

The GWR Marsland et al. [2002] is a growing self-organizing network that learns the prototype neural weights from a multi-dimensional input distribution. It consists of a set of neurons with their associated weight vectors, and edges that create links between neurons. For a given input vector , its best-matching neuron or unit (BMU) in the network, , is computed as the index of the neural weight that minimizes the distance to the input:


where is the set of neurons and denotes the Euclidean distance.

The network starts with two randomly initialized neurons. Each neuron is equipped with a habituation counter that considers the number of times that the neuron has fired. Newly created neurons start with and iteratively decreased towards 0 according to the habituation rule


where and is a constant that controls the monotonically decreasing behavior. Typically, is habituated faster than by setting .

A new neuron is added if the activity of the network computed as is smaller than a given activation threshold and if the habituation counter is smaller than a given threshold . The new neuron is created half-way between the BMU and the input. This mechanism leads to creating neurons only after the existing ones have been sufficiently trained.

At each iteration, the neural weights are updated according to:


where is a constant learning rate () and the index indicates the BMU and its topological neighbors. Connections between neurons are updated on the basis of neural co-activation, i.e. when two neurons fire together, a connection between them is created if it does not exist.

While the mechanisms for creating new neurons and connections in the GWR do not resemble biologically plausible mechanisms of neurogenesis (e.g., Eriksson et al. [1998], Ming and Song [2011], Knoblauch [2017]), the GWR learning algorithm represents an efficient model that incrementally adapts to non-stationary input. A comparison between GNG and GWR learning in terms of the number of neurons, quantization error (average discrepancy between the input and its BMU), and parameters modulating network growth (average network activation and habituation rate) is shown in Fig. 2. This learning behavior is particularly convenient for incremental learning scenarios since neurons will be created to promptly distribute in the input space, thereby allowing a faster convergence through iterative fine-tuning of the topological map. The neural update rate decreases as the neurons become more habituated, which has the effect of preventing that noisy input interferes with consolidated neural representations.

Figure 2:

Comparison of GNG and GWR training: a) number of neurons, b) quantization error, and c) GWR average activation and habituation counter through 30 training epochs on the Iris dataset (150 four-dimensional samples) 

Parisi et al. [2017].

2.3 Gamma-GWR

The GWR model does not account for the learning of latent temporal structure. For this purpose, the Gamma-GWR Parisi et al. [2017] extends the GWR with temporal context. Each neuron consists of a weight vector and a number of context descriptors  ().

Given the input , the index of the BMU, , is computed as:


where denotes the Euclidean distance, and are constant values that modulate the influence of the temporal context, is the weight vector of the BMU at , and is the global context of the network with . If , then Eq. 5 resembles the learning dynamics of the standard GWR without temporal context. For a given input , the activity of the network, , is defined in relation to the distance between the input and its BMU (Eq. 4) as follows:


thus yielding the highest activation value of when the network can perfectly match the input sequence ().

The training of the existing neurons is carried out by adapting the BMU and its neighboring neurons :


where and is a constant learning rate (). The habituation counters are updated according to Eq. 2.

Empirical studies with large-scale datasets have shown that Gamma-GWR networks with additive neurogenesis show a better performance than a static network with the same number of neurons, thereby providing insights into the design of neural architectures in incremental learning scenarios when the total number of neurons is fixed Parisi et al. [2018a].

3 Human Action Recognition

3.1 Self-Organizing Integration of Pose-Motion Cues

Human action perception in the brain is supported by a highly adaptive system with separate neural pathways for the distinct processing of body pose and motion features at multiple levels and their subsequent integration in higher areas [Ungerleider and Mishkin, 1982, Felleman and Van Essen, 1991]. The ventral pathway recognizes sequences of body form snapshots, while the dorsal pathway recognizes optic-flow patterns. Both pathways comprise hierarchies that extrapolate visual features with increasing complexity of representation [Taylor et al., 2015, Hasson et al., 2008, Lerner et al., 2011]. It has been shown that while early visual areas such as the primary visual cortex (V1) and the motion-sensitive area (MT+) yield higher responses to instantaneous sensory input, high-level areas such as the superior temporal sulcus (STS) are more affected by information accumulated over longer timescales Hasson et al. [2008]. Neurons in higher levels of the hierarchy are also characterized by gradual invariance to the position and the scale of the stimulus [Orban et al., 1982]. Hierarchical aggregation is a crucial organizational principle of cortical processing for dealing with perceptual and cognitive processes that unfold over time [Fonlupt, 2003]. With the use of extended models of neural network self-organization, it is possible to obtain progressively generalized representations of sensory inputs and learn inherent spatiotemporal dependencies of input sequences.

Figure 3: GWR-based architecture for pose-motion integration and action classification: 1) hierarchical processing of pose-motion features in parallel; 2) integration of neuron trajectories in the joint pose-motion feature space [Parisi et al., 2015b].

In Parisi et al. Parisi et al. [2015b], we proposed a learning architecture consisting of a two-stream hierarchy of GWR networks that processes extracted pose and motion features in parallel and subsequently integrates neuronal activation trajectories from both streams. This integration network functionally resembles the response of STS model neurons encoding sequence-selective prototypes of action segments in the joint pose-motion domain. An overall overview of the architecture is depicted in Fig. 3. The hierarchical arrangement of the networks yields progressively specialized neurons encoding latent spatiotemporal dynamics of the input. We process the visual input under the assumption that action recognition is selective for temporal order [Giese and Poggio, 2003, Hasson et al., 2008]. Therefore, the recognition of an action occurs only when neural trajectories are activated in the correct temporal order with respect to the learned action template.

Following the notation in Fig. 3, and are trained with pose and motion features respectively. After this step, we train and with concatenated trajectories of neural activations in the previous network layer. The STS stage integrates pose-motion features by training G with the concatenation of vectors from and in the pose-motion feature space. After the training of is completed, each neuron will encode a sequence-selective prototype action segment, thereby integrating changes in the configuration of a person’s body pose over time. For the classification of actions, we extended the standard implementations of unsupervised GNG and GWR learning with two labeling functions for associating symbolic labels to visual representations learned in an unsupervised fashion.

Figure 4: Snapshots of actions from the KT action dataset visualized as raw depth images, segmented body, skeleton, and body centroids.
Figure 5: Full-body action representations: (a) Three centroids with body slopes and , and b) comparison of body centroids (top) and noisy skeletons (bottom).

We evaluated our approach both on our Knowledge Technology (KT) full-body action dataset Parisi et al. [2014c] and the public action benchmark CAD-60 [Sung et al., 2012]. The KT dataset is composed of 10 full-body actions performed by 13 subjects with a normal physical condition. The dataset contains the following actions: standing, walking, jogging, picking up, sitting, jumping, falling down, lying down, crawling, and standing up. Videos were captured in a home-like environment with a Kinect sensor installed meters above the ground. Depth maps were sampled with a VGA resolution of and an operation range from to meters at frames per second. From the raw depth map sequences, 3D body joints were estimated on the basis of the tracking skeleton model provided by OpenNI.111OpenNI SDK. Snapshots of full-body actions are shown in Fig. 4 as raw depth images, segmented body silhouettes, skeletons, and body centroids. We proposed a simplified skeleton model consisting of three centroids and two body slopes. The centroids were estimated as the centers of mass that follow the distribution of the main body masses on each posture. As can be seen in Fig. 5, three centroids are sufficient to represent prominent posture characteristics while maintaining a low-dimensional feature space. Such low-dimensional representation increases tracking robustness for situations of partial occlusion with respect to a skeleton model comprising a larger number of body joints. Our experiments showed that a GWR-based approach outperforms the same type of architecture using GNG networks with an average accuracy rate of 94% (5% higher than GNG-based).

The Cornell activity dataset CAD-60 [Sung et al., 2012] is composed of 60 RGB-D videos of four subjects (two males, two females, one left-handed) performing 12 activities: rinsing mouth, brushing teeth, wearing contact lens, talking on the phone, drinking water, opening pill container, cooking (chopping), cooking (stirring), talking on couch, relaxing on couch, writing on whiteboard, working on computer. The activities were performed in 5 different environments: office, kitchen, bedroom, bathroom, and living room. The videos were collected with a Kinect sensor with distance ranges from 1.2 m to 3.5 m and a depth resolution of 640480 at 15 frames per second. The dataset provides raw depth maps and RGB images, and skeleton data. For our approach, we used the set of 3D positions without the feet, leading to 13 joints (i.e., 39 input dimensions). Instead of using world coordinates, we encoded the joint positions using the center of the hips as the frame of reference to obtain translation invariance. We computed joint motion as the difference of two consecutive frames for each pose transition.

For our evaluation on the CAD-60, we adopted a similar scheme as the one reported by Sung et al. [2012] using all the 12 activities plus a random action with new person

strategy, i.e. the first 3 subjects for training and the remaining for test purposes. We obtained 91.9% precision, 90.2% recall, and 91% F-score. The reported best state-of-the-art result is 93.8% precision, 94.5% recall, and 94.1% F-score 

[Shan and Akella, 2014], where they estimate, prior to learning, a number of key poses to compute spatiotemporal action templates. Consequently, each action must be segmented into atomic action templates composed of a set of key poses, where depends on the action’s duration and complexity. Furthermore, experiments with low-latency (close to real-time) classification have not been reported. The second approach that outperforms our model achieves 93.2% precision, 91.9% recall, and 91.5% F-score Faria et al. [2014]

, in which they used a dynamic Bayesian Mixture Model to classify motion relations between body poses. However, the authors estimated their own skeleton model from raw depth images and did not use the one provided by the CAD-60 benchmark dataset. Therefore, differences in the tracked skeleton exist that hinder a direct quantitative comparison with our approach.

3.2 Emotion Recognition from Body Expressions

The recognition of emotions plays an important role in our daily life and is essential for social communication and it can be particularly useful in HRI scenarios. For instance, a socially-assistive robot may be able to strengthen its relationship with the user if it can understand whether that person is bored, angry, or upset. Body expressions convey an additional social cue to reinforce or complement facial expressions [Pollick et al., 2001][Sawada et al., 2003]. Furthermore, this approach can complement the use of facial expressions when the user is not facing the sensor or is too distant from it for facial features to be computed. Despite its promising applications in HRI domains, emotion recognition from body motion patterns has received significantly less attention with respect to facial expressions and speech analysis.

Movement kinematics such velocity and acceleration represent significant features when it comes to recognizing emotions from body patterns [Pollick et al., 2001][Sawada et al., 2003]. Similarly, using temporal features in terms of body motion resulted in higher recognition rates than pose features alone Patwardhan and Knapp [2016]. Schindler et al. Schindler and Van Gool [2008] presented an image-based classification system for recognizing emotion from images of body postures. The overall recognition accuracy of his system resulted in 80% for six basic emotions. Although these systems show a high recognition rate, they are limited to postural emotions, which are not sufficient for a real-time interactive situation between humans and robots in a domestic environment. Piana et al. Piana et al. [2014] proposed a real-time emotion recognition system using postural, kinematic, and geometrical features which were extracted from sequences of 3D skeletons videos. However, they only considered a reduced set of upper-body joints, i.e., head, shoulders, elbows, hands, and torso.

Figure 6: Proposed learning architecture with a hierarchy of self-organizing networks. The first layer processes pose and motion features from individual frames, whereas in the second layer a Gamma-GWR network learns the spatiotemporal structure of the joint pose-motion representations Elfaramawy et al. .

In Elfaramawy et al. Elfaramawy et al. , we proposed a neural network architecture to recognize a set of emotional states from body motion patterns dd. The focus of our study was on investigating whether full-body expressions from depth map videos sequences convey adequate affective information for the task of emotion recognition. The overall view of the architecture is shown in Fig. 4, consists of a hierarchy of self-organizing networks for learning sequences of 3D body joint features. In the first layer, two GWR networks Marsland et al. [2002], and , learn a dictionary of prototype samples of pose and motion features respectively. Motion features are obtained by computing the difference between two consecutive frames containing pose features. In the second layer, a Gamma-GWR Parisi et al. [2017], , is used to learn prototype sequences and associate symbolic labels to unsupervised visual representations of emotions for the purpose of classification. During the inference phase, unlabeled novel samples are processed by the hierarchical architecture, yielding patterns of neural weight activations. One best-matching neuron in will activate for every 10 processed input frames.

For the evaluation of our system, we collected a dataset named the Body Expressions of Emotion (BEE), with nineteen participants performing six different emotional states: anger, fear, happiness, neutral, sadness, and surprise. The dataset was acquired in an HRI scenario consisting of a humanoid robot Nao extended with a depth sensor to extract 3D body skeleton information in real time. Nineteen participants took part in the data recordings (fourteen male, five female, age ranging from 21 to 33). The participants were students at the University of Hamburg and they declared not to have suffered any physical injury resulting in motor impairments. To compare the performance of our system to human observers, we performed an additional study in which 15 raters that did not take part in the data collection phase had to label depth map sequences as one of the six possible emotions.

For our approach, we used the full 3D skeleton model except for the feet, leading to 13 joints (i.e., 39 input dimensions). To obtain translation invariance, we encoded the joint positions using the center of the hips as the frame of reference. We then computed joint motion as the difference of two consecutive frames for each pose transition. Experimental results showed that our system successfully learned to classify the set of six training emotions and that its performance was very competitive with respect to human observers (see Table 1). The overall accuracy of emotions recognized by human observers was 90.2%, whereas our system showed an overall accuracy of 88.8%.

System Human
Accuracy 88.8% 90.2%
Precision 66.3% 70.1%
Recall 68% 70.7%
F-score 66.8% 68.9%
Table 1: A comparison of overall recognition of emotions between our system and human performance.

As additional future work, we could investigate the development of a multimodal emotion recognition scenario, i.e., by taking into account auditory information that complements the use of visual cues Barros and Wermter [2016]. The integration of audio-visual stimuli for emotion recognition has been shown to be very challenging but also strongly promising for a more natural HRI experience.

4 Body Motion Assessment

4.1 Background

The correct execution of well-defined movements plays a key role in physical rehabilitation and sports. While the goal of action recognition approaches is to categorize a set of distinct classes by extrapolating inter-class differences, action assessment requires instead a model to capture intra-class dissimilarities that allow expressing a measurement on how much an action follows its learned template. The quality of actions can be computed in terms of how much a performed movement matches the correct continuation of a learned motion sequence template. Visual representations can then provide useful qualitative feedback to assist the user on the correct performance of the routine and the correction of mistakes (Fig. 7). The task of assessing the quality of actions and providing feedback in real time for correcting inaccurate movements represents a challenging visual task.

Figure 7: Visual feedback for correct squat sequence (top), and a sequence containing knees in mistake (bottom, joints and limbs in red) [Parisi et al., 2015a].

Artificial systems for the visual assessment of body motion have been previously investigated for applications mainly focused on physical rehabilitation and sports training. For instance, Chan et al. Chang et al. [2011] proposed a physical rehabilitation system using a Kinect sensor for young patients with motor disabilities. The idea was to assist the users while performing a set of simple movements necessary to improve their motor proficiency during the rehabilitation period. Although experimental results have shown improved motivation for users using visual hints, only movements involving the arms at constant speed were considered. Furthermore, the system does not provide real-time feedback to enable the user to timely spot and correct mistakes. Similarly, Su et al. Su [2013] proposed the estimation of feedback for Kinect-based rehabilitation exercises by comparing current motion with a pre-recorded execution by the same person. The comparison was carried out on sequences using dynamic time warping and fuzzy logic with the Euclidean distance as a similarity measure. The evaluation of the exercises was based on the degree of similarity between the current sequence and a correct sequence. The system provided qualitative feedback on the similarity of body joints and execution speed, but it did not suggest the user how to correct the movement.

4.2 Motion Prediction and Correction

In Parisi et al. Parisi et al. [2016a], we proposed a learning architecture that consists of two hierarchically arranged layers with self-organizing networks for human motion assessment in real time (Fig. 8). The first layer is composed of two GWR networks, and , that learn a dictionary of posture and motion feature vectors respectively. This hierarchical scheme has the advantage of using a fixed set of learned features to compose more complex patterns in the second layer, where the Gamma-GWR with is trained with sequences of posture-motion activation patterns from the first layer to learn the spatiotemporal structure of the input.

Figure 8: Learning architecture with growing self-organizing networks. In Layer 1, two GWR networks learn posture and motion features respectively. In Layer 2, a Gamma-GWR learns spatiotemporal dynamics of body motion. This mechanism allows predicting the template continuation of a learned sequence and computing feedback as the difference between its current and its expected execution [Parisi et al., 2016a].

The underlying idea for assessing the quality of a sequence is to measure how much the current input sequence differs from a learned sequence template. Provided that the trained model is able to predict a training sequence with a satisfactory degree of accuracy, it is then possible to quantitatively compute how much a novel sequence differs from such this expected pattern. We defined a function that computes the difference of a current input sequence, , from its expected input, i.e. the prediction of the next element of the sequence given :


where is the set of neurons and denotes the Euclidean distance. Since the weight and context vectors of the prototype neurons lie in the same feature space as the input (), it is possible to provide joint-wise feedback computations. The recursive prediction function can be applied an arbitrary number of timesteps into the future. Therefore, after the training phase is completed, it is possible to compute in real time with linear computational complexity .

Figure 9: Visual hints for the correct execution of the Finger to nose routine. Progressively fading violet lines indicate the learned action template [Parisi et al., 2015a].

The visual effect of this prediction mechanism is shown in Fig. 9. For this example, the architecture was trained with the Finger to nose routine which consists of keeping your arm bent at the elbow and then touching your nose with the tip of your finger. As soon the person starts performing the routine, we can see progressively fading violet lines representing the next time steps which lead to visual assistance for successful execution. The value was empirically determined to provide a substantial reference to future steps while limiting visual clutter.

To compute visual feedback, we used the predictions as hints on how to perform a routine over 100 timesteps into the future, and then use to spot mistakes on novel sequences that do not follow the expected pattern for individual joint pairs. Execution mistakes are detected if exceeds a given threshold over timesteps. Visual representations of these computations can then provide useful qualitative feedback to assist the user on the correct performance of the routine and the correction of mistakes (Fig. 7). Our approach learns also motion intensity to better detect temporal discrepancies. Therefore, it is possible to provide accurate feedback on posture transitions and the correct execution of lockouts.

4.3 Dataset and Evaluation

We evaluate our approach with a data set containing 3 powerlifting exercises performed by 17 athletes: High bar back squat, Deadlift, and Dumbbell lateral raise. The data collection took place at the Kinesiology Institute of the University of Hamburg, Germany, where 17 volunteering participants (9 male, 8 female) performed 3 different powerlifting exercises. We captured body motion of correct and incorrect executions with a Kinect v2 sensor222Microsoft Kinect 2.0 – and estimated body joints using Kinect SDK 2.0 that provides a set of joint coordinates at frames per second. The participants executed the routines frontal to the sensor placed at 1 meter from the ground. We extracted the 3D joints for head, neck, wrists, elbows, shoulders, spine, hips, knees, and ankles, for a total of 13 3D-joints (39 dimensions). We computed motion intensity from posture sequences as the difference between consecutive joint pairs. The Kinect’s skeleton model (Fig. 7), although not faithful to human anatomy, provides reliable estimations of the joints’ position over time when the user is facing the sensor. We manually segmented single repetitions for all exercises. In order to obtain translation invariance, we subtracted the spine_base joint (the center of the hips) from all the joints in absolute coordinates.

We evaluated our method for computing feedback with individual and multiple subjects. We divided the correct body motion data with 3-fold cross-validation into training and test sets and trained the models with data containing correct motion sequences only. For the inference phase, both the correct and incorrect movements were used with feedback threshold over frames. Our expectation was that the output of the feedback function would be higher for sequences containing mistakes. We observed true positives (TP), false negatives (FN), true negatives (TN), and false positives (FP) as well as the measures true positive rate (TPR or sensitivity), true negative rate (TPR or specificity), and positive predictive value (PPV or precision). Results for single- and multiple-subject data on E1, E2, and E3 routines are displayed in Table 6.1 and 6.2 respectively, along with a comparison with the best-performing feedback function from Parisi et al. [2015a] in which we used only pose frames without explicit motion information.

The evaluation on single subjects showed that the system successfully provides feedback on posture errors with high accuracy. GWR-like networks allow reducing the temporal quantization error over longer timesteps (Fig. LABEL:fig:tqe), so that more accurate feedback can be computed and thus reduce the number of false negatives and false positives. Furthermore, since the networks can create new neurons according to the distribution of the input, each network can learn a larger number of possible executions of the same routine, thus being more suitable for training sessions with multiple subjects. Tests with multiple-subject data showed significantly decreased performance, mostly due to a large number of false positives. This is not exactly a flaw due to the learning mechanism but rather a consequence people having different body configurations and, therefore, slightly different ways to perform the same routine. To attenuate this issue, we can set different values for the feedback threshold

. For larger values, the system would tolerate more variance in the performance. However, one must consider whether a higher degree of variance is not desirable in some application domains. For instance, rehabilitation routines may be tailored to a specific subject based on their specific body configuration and health condition.

E1 35 10 33 0 0.77 1 1
35 2 41 0 0.97 1 1
E2 24 0 20 0 1 1 1
24 0 20 0 1 1 1
E3 63 0 26 0 1 1 1
63 0 26 0 1 1 1
Table 2: Single-subject evaluation.
E1 326 1 7 151 0.99 0.04 0.68
328 1 13 143 0.99 0.08 0.70
E2 127 2 0 121 0.98 0 0.51
139 0 0 111 1 0 0.56
E3 123 0 8 41 1 0.16 0.75
126 0 15 31 1 0.33 0.80
Table 3: Multi-subject evaluation. Best results in bold.

Our results encourage further work in embedding this type of real-time system into an assistive robot which can interact with the user and motivate the correct performance of physical rehabilitation routines and sports training. The positive effects of having a motivational robot for health-related tasks has been shown in a number of studies [Dautenhahn, 1999, Kidd and Breazeal, 2007, Nalin et al., 2012]. The assessment of body motion plays a role not only for the detection of mistakes on training sequences but also in the timely recognition of gait deterioration, e.g., linked to age-related cognitive declines. Growing learning architectures are particularly suitable for this task since they can adapt to the user through longer periods of time while still detecting significant changes in their motor skills.

5 Continual Learning of Human Actions

5.1 Background

Deep learning models for visual tasks typically comprise a set of convolution and pooling layers trained in a hierarchical fashion for yielding action feature representations with increasing degree of abstraction (see [Guo et al., 2016] for a recent survey). This processing scheme is in agreement with neurophysiological studies supporting the presence of functional hierarchies with increasingly large spatial and temporal receptive fields along cortical pathways [Giese and Poggio, 2003, Hasson et al., 2008] However, the training of deep learning models for action sequences has been proven to be computationally expensive and require an adequately large number of training samples for the successful learning of spatiotemporal filters. Consequently, the question arises whether traditional deep learning models for action recognition can account for real-world learning scenarios, in which the number of training samples may not be sufficiently high and system may be required to learn from novel input in a continual learning fashion.

The approaches described in Section 3 and  4 rely on the extraction of a simplified 3D skeleton model from which low-dimensional pose and motion features can be computed to process actor-independent action dynamics. The use of such models is in line with biological evidence demonstrating that human observers are very proficient in learning and recognizing complex motion underlying a skeleton structure [Jastorff et al., 2006, Hiris, 2007]. These studies show that the presence of a holistic structure improves the learning speed and accuracy of action patterns, also for non-biologically relevant motion such as artificial complex motion patterns. However, skeleton models are susceptible to sensor noise and situations of partial occlusion and self-occlusion (e.g. caused by body rotation). In this section, I describe how self-organizing architectures can be extended to learning and recognize actions in a continual learning fashion from raw RGB image questions.

5.2 Deep Neural Network Self-Organization

In Parisi et al. Parisi et al. [2017], we proposed an self-organizing architecture consisting of a series of hierarchically arranged growing networks for the continual learning of actions from high-dimensional input streams (Fig. 10). Each layer in the hierarchy comprises a Gamma-GWR and a pooling mechanism for learning action features with increasingly large spatiotemporal receptive fields. In the last layer, neural activation patterns from distinct pathways are integrated. The proposed deep architecture is composed of two distinct processing streams for pose and motion features, and their subsequent integration in the STS layer. Neurons in the G network are activated by the latest input samples, i.e. from time to .

Figure 10: Diagram of our deep neural architecture with Gamma-GWR networks for continual action recognition. Posture and motion action cues are processed separately in the ventral (VP) and the dorsal pathway (DP) respectively. At the STS stage, the recurrent GWR network learns associative connections between prototype action representations and symbolic labels Parisi et al. [2017].

Deep architectures obtain invariant responses by alternating layers of feature detectors and nonlinear pooling neurons using, e.g., the maximum (MAX) operation, which has been shown to achieve higher feature specificity and more robust invariance with respect to linear summation Guo et al. [2016]. Robust invariance to translation has been obtained via MAX and average pooling, with the MAX operator showing faster convergence and improved generalization [Scherer et al., 2010]

. In our architecture, we implemented MAX-pooling layers after each Gamma-GWR network (see Fig. 

10). For each input image patch, a best-matching neuron is be computed in layer and only its maximum weight value will be forwarded to the next layer :


where is computed according to Eq. 4 and the superscript on indicates that this value is not an actual neural weight of layer , but rather a pooled activation value from layer that will be used as input in layer . Since the spatial receptive field of neurons increase along the hierarchy, this pooling process will yield scale and position invariance.

5.3 Datasets and Evaluation

We conducted experimental results with two action benchmarks: the Weizmann [Gorelick et al., 2005] and the KTH [Schuldt et al., 2004] datasets.

The Weizmann dataset contains 90 low-resolution image sequences with 10 actions performed by 9 subjects. The actions are walk, run, jump, gallop sideways, bend, one-hand wave, two-hands wave, jump in place, jumping jack, and skip. Sequences are sampled at pixels with a static background and are about 3 seconds long. We used aligned foreground body shapes by background subtraction included in the dataset. For compatibility with Schindler and Van Gool [2008], we trimmed all sequences to a total of 28 frames, which is the length of the shortest sequence, and evaluated our approach by performing leave-one-out cross-validation, i.e., 8 subjects were used for training and the remaining one for testing. This procedure was repeated for all 9 permutations and the results were averaged. Our overall accuracy was 98.7%, which is competitive with the best reported result of 99.64% Gorelick et al. [2005]. In their approach, they extracted action features over a number of frames by concatenating 2D body silhouettes in a space-time volume and used nearest neighbors and Euclidean distance to classify. Notably, our results outperform the overall accuracy reported by Jung et al. [2015] with three different deep learning models: convolutional neural network (CNN, 92.9%), multiple spatiotemporal scales neural network (MSTNN, 95.3%), and 3D CNN (96.2%). However, a direct comparison of the above-described methods with ours is hindered by the fact that they differ in the type of input and number of frames per sequence used during the training and the test phase.

The KTH action dataset contains 25 subjects performing 6 different actions: walking, jogging, running, boxing, hand-waving and hand-clapping, for a total of 2391 sequences. Action sequences were performed in 4 different scenarios: indoor, outdoor, variations in scale, and changes in clothing. Videos were collected with a spatial resolution of pixels taken over homogeneous backgrounds and sampled at frames per second. Following the evaluation schemes from the literature, we trained our model with 16 randomly selected subjects and used the other 9 subjects for testing. The overall classification accuracy averaged across 5 trials achieved by our model was 98.7%, which is competitive with the two best reported results:  Ravanbakhsh et al. [2015] and  Gao et al. [2010]. In the former approach, they used a hierarchical CNN model to capture sub-actions from complex ones. Key frames were extracted using binary coding of each frame in a video which helps to improve the performance of the hierarchical model (from 94.1% to 95.6%). In the latter approach, they computed handcrafted interest points with substantial motion, which requires high computational requirements for the estimation of ad-hoc interest points. Our model outperforms other hierarchical models that do not rely on handcrafted features, such as 3D CNN (, S. et al. [2013]

) and 3D CNN in combination with long short-term memory (

, Baccouche et al. [2011]).

6 Conclusions and Open Challenges

The underlying neural mechanisms for action perception have been extensively studied, comprising cortical hierarchies for processing body motion cues with increasing complexity of representation [Taylor et al., 2015, Hasson et al., 2008, Lerner et al., 2011], i.e. higher-level areas process information accumulated over larger temporal windows with increasing invariance to the position and the scale of stimuli. Consequently, the study of the biological mechanisms for action perception is fundamental for the development of artificial systems aimed to address the robust recognition of actions in HRI scenarios.

Motivated by the process of input-driven self-organization exhibited by topographic maps in the cortex [Nelson, 2000, Willshaw and von der Malsburg, 1976, Miikkulainen et al., 2005], I introduced learning architectures hierarchically arranged growing networks that integrate body posture and motion features for action recognition and assessment. The proposed architectures can be considered a further step towards more flexible neural network models for learning robust visual representations on the basis of visual experience. Successful applications of deep neural network self-organization include human action recognition [Parisi et al., 2014c, 2015b, Elfaramawy et al., ], gesture recognition [Parisi et al., 2014a, b], body motion assessment [Parisi et al., 2015a, 2016a], human-object interaction [Mici et al., 2017, 2018], continual learning [Parisi et al., 2017, 2018b], and audio-visual integration [Parisi et al., 2016b].

Models of hierarchical action learning are typically feedforward. However, neurophysiological studies have shown that the visual cortex exhibits significant feedback connectivity between different cortical areas [Felleman and Van Essen, 1991, Salin and Bullier, 1995]. In particular, action perception demonstrates strong top-down modulatory influences from attentional mechanisms [Thornton et al., 2002] and higher-level cognitive representations such as biomechanically plausible motion [Shiffrar and Freyd, 1990]. Spatial attention allows animals and humans to process relevant environmental stimuli while suppressing irrelevant information. Therefore, attention as a modulator in action perception is also desirable from a computational perspective, thereby allowing the suppression of uninteresting parts of the visual scene and thus simplifying the detection and segmentation of human motion in cluttered environments.

The integration of multiple sensory modalities such as vision and audio is crucial for enhancing the perception of actions, especially in situations of uncertainty, with the aim to reliably operate in highly dynamic environments. Experiments in HRI scenarios have shown that the integration of audio-visual cues significantly improves performance with respect to unimodal approaches for sensory-driven robot behavior Parisi et al. [2016c], Cruz et al. [2016, 2018]. The investigation of biological mechanisms multimodal action perception is an important research direction for the development of learning systems exposed to rich streams of information in real-world scenarios.

The author would like to thank Pablo Barros, Doreen Jirak, Jun Tani, and Stefan Wermter for great discussions and feedback.


  • Aerts et al. [2012] M. Aerts, R. Esselink, B. Post, B. van de Warrenburg, and B. Bloem. Improving the diagnostic accuracy in parkinsonism: a three-pronged approach. Practical Neurology, 12(1):77–87, 2012.
  • Alonso-Martin et al. [2013] F. Alonso-Martin, M. Malfaz, J. Sequeira, J. F. Gorostiza, and M. A. Salichs. A multimodal emotion detection system during human-robot interaction. Sensors, 13(11):15549–15581, 2013.
  • Baccouche et al. [2011] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human action recognition. In Human Behavior Understanding (HBU): Second International Workshop, pages 29–39. Springer Berlin Heidelberg, 2011.
  • Barros and Wermter [2016] P. Barros and S. Wermter. Developing crossmodal expression recognition based on a deep neural model. Adaptive Behavior, 24(5):373–396, 2016.
  • Chang et al. [2011] Y.-J. Chang, S.-F. Chen, and J.-D. Huang. A Kinect-based system for physical rehabilitation: A pilot study for young adults with motor disabilities. Research in Developmental Disabilities, 32(6):2566–2570, 2011. ISSN 08914222. doi: 10.1016/j.ridd.2011.07.002.
  • Cruz et al. [2016] F. Cruz, G. Parisi, J. Twiefel, and S. Wermter.

    Multi-modal integration of dynamic audiovisual patterns for an interactive reinforcement learning scenario.

    In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 759–766, 2016.
  • Cruz et al. [2018] F. Cruz, G. Parisi, J. Twiefel, and S. Wermter. Multi-modal integration of dynamic audiovisual patterns for an interactive reinforcement learning scenario. Proceedings of the IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pages 759–766, 2018.
  • Dautenhahn [1999] K. Dautenhahn. Robots as social actors: Aurora and the case of autism. In Third Cognitive Technology Conference, 1999.
  • [9] N. Elfaramawy, P. Barros, G. I. Parisi, and S. Wermter. Emotion recognition from body expressions with a neural network architecture. pages 143–149. Proceedings of the International Conference on Human Agent Interaction (HAI’17), Bielefeld, Germany.
  • Eriksson et al. [1998] P. S. Eriksson, E. Perfilieva, T. Bjork-Eriksson, A.-M. Alborn, C. Nordborg, D. A. Peterson, and F. H. Gage. Neurogenesis in the adult human hippocampus. Nature Medicine, 4(11):1313–1317, 1998. ISSN 1078-8956. doi: 10.1038/3305.
  • Faria et al. [2014] D. R. Faria, C. Premebida, and U. Nunes. A probabilistic approach for human everyday activities recognition using body motion from RGB-D images. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pages 842–849, 2014.
  • Felleman and Van Essen [1991] D. Felleman and D. Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1(1):1–47, 1991.
  • Fonlupt [2003] P. Fonlupt. Perception and judgement of physical causality involve different brain structures. Cognitive Brain Research, 17(2):248 – 254, 2003. ISSN 0926-6410. doi: 10.1016/S0926-6410(03)00112-5.
  • Fritzke [1995] B. Fritzke. A growing neural gas network learns topologies. In Advances in Neural Information Processing Systems 7, pages 625–632. MIT Press, 1995.
  • Gao et al. [2010] Z. Gao, M.-y. Chen, A. G. Hauptmann, and A. Cai. Comparing Evaluation Protocols on the KTH Dataset, pages 88–100. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
  • Giese and Poggio [2003] M. A. Giese and T. Poggio. Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience, 4(3):179–192, March 2003. doi: 10.1038/nrn1057.
  • Gorelick et al. [2005] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In

    Proceedings of the International Conference on Computer Vision (ICCV)

    , pages 1395–1402, 2005.
  • Guo et al. [2016] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew. Deep learning for visual understanding: A review. Neurocomputing, 187:27 – 48, 2016.
  • Han et al. [2013] J. Han, L. Shao, D. Xu, and J. Shotton. Enhanced computer vision with Microsoft Kinect sensor. IEEE Transactions on cybernetics, 43(5):1318–1334, 2013.
  • Hasson et al. [2008] U. Hasson, E. Yang, I. Vallines, D. J. Heeger, and N. Rubin. A hierarchy of temporal receptive windows in human cortex. The Journal of Neuroscience, 28(10):2539–2550, 2008. ISSN 1529-2401.
  • Hebb [1949] D. O. Hebb. The organization of behavior: a neuropsychological theory. Wiley, New York, 1949.
  • Hiris [2007] E. Hiris. Detection of biological and nonbiological motion. Journal of Vision, 7(12):1–16, 2007.
  • Jastorff et al. [2006] J. Jastorff, Z. Kourtzi, , and M. A. Giese. Learning to discriminate complex movements: biological versus artificial trajectories. Journal of Vision, 6(8):791–804, 2006.
  • Jung et al. [2015] M. Jung, J. Hwang, and J. Tani. Self-organization of spatio-temporal hierarchy via learning of dynamic visual image patterns on action sequences. PLoS ONE, 10(7):e0131214, 07 2015.
  • Kachouie et al. [2014] R. Kachouie, S. Sedighadeli, R. Khosla, and M. Chu. Socially assistive robots in elderly care: A mixed-method systematic literature review. Int. J. Hum. Comput. Interaction, 30(5):369–393, 2014. doi: 10.1080/10447318.2013.873278.
  • Kidd and Breazeal [2007] C. D. Kidd and C. Breazeal. A robotic weight loss coach. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , pages 1985–1986, 2007.
  • Knoblauch [2017] A. . Knoblauch. Impact of structural plasticity on memory formation and decline. Rewiring the Brain: A Computational Approach to Structural Plasticity in the Adult Brain, eds A. van Ooyen and M. Butz, Elsevier, Academic Press, 2017.
  • Kohonen [1991] T. Kohonen. Self-organizing maps: optimization approaches. In Artificial neural networks, II, pages 981–990, 1991.
  • Krizhevsky [2009] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, 2009.
  • LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
  • Lerner et al. [2011] Y. Lerner, C. J. Honey, L. J. Silbert, and U. Hasson. Topographic mapping of a hierarchy of temporal receptive windows using a narrated story. The Journal of neuroscience, 31(8):2906–2915, 2011. doi: 10.1523/jneurosci.3684-10.2011.
  • Marsland et al. [2002] S. Marsland, J. Shapiro, and U. Nehmzow. A self-organising network that grows when required. Neural Networks, 15(8-9):1041–1058, 2002.
  • Mermillod et al. [2013] M. Mermillod, A. Bugaiska, and P. Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in Psychology, 4(504), 2013.
  • Mici et al. [2017] L. Mici, G. I. Parisi, and S. Wermter. An incremental self-organizing architecture for sensorimotor learning and prediction. arXiv:1712.08521, 2017.
  • Mici et al. [2018] L. Mici, G. I. Parisi, and S. Wermter. A self-organizing neural network architecture for learning human-object interactions. Neurocomputing, 307:14–24, 2018.
  • Miikkulainen et al. [2005] R. Miikkulainen, J. A. Bednar, Y. Choe, and J. Sirosh. Computational Maps in the Visual Cortex. Springer, 2005. ISBN 978-0-387-22024-6. doi: 10.1007/0-387-28806-6.
  • Ming and Song [2011] G.-l. Ming and H. Song. Adult neurogenesis in the mammalian brain: Significant answers and significant questions. Neuron, 70(4):687–702, 2011. doi: 10.1016/j.neuron.2011.05.001. URL
  • Nalin et al. [2012] M. Nalin, I. Baroni, A. Sanna, and C. Pozzi. Robotic companion for diabetic children: emotional and educational support to diabetic children, through an interactive robot. In ACM SIGCHI, pages 260–263, 2012.
  • Nelson [2000] C. A. Nelson. Neural plasticity and human development: the role of early experience in sculpting memory systems. Developmental Science, 3(2):115–136, 2000.
  • Nwe et al. [2003] T. L. Nwe, S. W. Foo, and L. C. D. Silva.

    Speech emotion recognition using hidden markov models.

    Speech Communication, 41(4):603 – 623, 2003.
  • Orban et al. [1982] G. Orban, L. Lagae, A. Verri, S. Raiguel, X. D., H. Maes, and V. Torre. First-order analysis of optical flow in monkey brain. Proceedings of the National Academy of Sciences, 89(7):2595–2599, 1982.
  • Parisi et al. [2014a] G. I. Parisi, P. Barros, and S. Wermter. FINGeR: Framework for interactive neural-based gesture recognition. In Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, pages 443–447, 2014a.
  • Parisi et al. [2014b] G. I. Parisi, D. Jirak, and S. Wermter. HandSOM - Neural clustering of hand motion for gesture recognition in real time. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Edinburgh, Scotland, UK, pages 981–986, 2014b.
  • Parisi et al. [2014c] G. I. Parisi, C. Weber, and S. Wermter. Human action recognition with hierarchical growing neural gas learning. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), pages 89–96, 2014c.
  • Parisi et al. [2015a] G. I. Parisi, F. von Stosch, S. Magg, and S. Wermter. Learning human motion feedback with neural self-organization. In Proceedings of International Joint Conference on Neural Networks (IJCNN), pages 2973–2978, 2015a.
  • Parisi et al. [2015b] G. I. Parisi, C. Weber, and S. Wermter. Self-organizing neural integration of pose-motion features for human action recognition. Frontiers in Neurorobotics, 9(3), 2015b.
  • Parisi et al. [2016a] G. I. Parisi, S. Magg, and S. Wermter. Human motion assessment in real time using recurrent self-organization. In Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pages 71–76, 2016a.
  • Parisi et al. [2016b] G. I. Parisi, J. Tani, C. Weber, and S. Wermter. Emergence of multimodal action representations from neural network self-organization. Cognitive Systems Research, 2016b.
  • Parisi et al. [2016c] G. I. Parisi, C. Weber, and S. Wermter. A neurocognitive robot assistant for robust event detection. Trends in Ambient Intelligent Systems: Role of Computational Intelligence, Series ”Studies in Computational Intelligence”, Springer, pages 1–28, 2016c.
  • Parisi et al. [2017] G. I. Parisi, J. Tani, C. Weber, and S. Wermter. Lifelong learning of humans actions with deep neural network self-organization. Neural Networks, 96:137–149, 2017.
  • Parisi et al. [2018a] G. I. Parisi, X. Ji, and S. Wermter. On the role of neurogenesis in overcoming catastrophic forgetting. NIPS’18, Workshop on Continual Learning, Montreal, Canada, 2018a.
  • Parisi et al. [2018b] G. I. Parisi, J. Tani, C. Weber, and S. Wermter. Lifelong learning of spatiotemporal representations with dual-memory recurrent self-organization. arXiv:1805.10966, 2018b.
  • Parisi et al. [2019] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019.
  • Patwardhan and Knapp [2016] A. Patwardhan and G. Knapp. Multimodal affect recognition using kinect. arXiv:1607.02652, 2016.
  • Piana et al. [2014] S. Piana, A. Stagliano, F. Odone, A. Verri, and A. Camurri. Real-time automatic emotion recognition from body gestures. arXiv:1402.5047, 2014.
  • Picard [1997] R. W. Picard. Affective Computing. MIT Press, Cambridge, MA, USA, 1997. ISBN 0-262-16170-2.
  • Pollick et al. [2001] F. E. Pollick, H. M. Paterson, A. Bruderlin, and A. J. Sanford. Perceiving affect from arm movement. Cognition, 82(2):B51–B61, 2001.
  • Poppe [2010] R. Poppe. A survey on vision-based human action recognition. Image and Vision Computing, 28:976–990, 2010.
  • Ravanbakhsh et al. [2015] M. Ravanbakhsh, H. Mousavi, M. Rastegari, V. Murino, and L. S. Davis. Action recognition with image based cnn features. CoRR abs/1512.03980, 2015.
  • S. et al. [2013] J. S., X. W., Y. M., and Y. K. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231, 2013.
  • Salin and Bullier [1995] P. Salin and J. Bullier. Corticocortical connections in the visual system: structure and function. Physiological Reviews, 75(1):107–154, 1995.
  • Sawada et al. [2003] M. Sawada, K. Suda, and M. Ishii. Expression of emotions in dance: Relation between arm movement characteristics and emotion. Perceptual and Motor Skills, 97(3):697–708, 2003.
  • Scherer et al. [2010] D. Scherer, A. Müller, and S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), pages 92–101, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3-642-15824-2, 978-3-642-15824-7.
  • Schindler and Van Gool [2008] K. Schindler and L. J. Van Gool. Action snippets: How many frames does human action recognition require? In

    Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)

    . IEEE Computer Society, 2008.
  • Schuldt et al. [2004] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on the Pattern Recognition (ICPR), pages 32–36, Washington, DC, USA, 2004. IEEE Computer Society.
  • Shan and Akella [2014] J. Shan and S. Akella. 3D human action segmentation and recognition using pose kinetic energy. In Workshop on Advanced Robotics and its Social Impacts (IEEE), pages 69–75, 2014.
  • Shiffrar and Freyd [1990] M. Shiffrar and J. J. Freyd. Apparent motion of the human body. Psychological Science, 1:257–264, 1990.
  • Strickert and Hammer [2005] M. Strickert and B. Hammer. Merge SOM for temporal data. Neurocomputing, 64, 2005. doi: 10.1016/j.neucom.2004.11.014.
  • Su [2013] C.-J. Su. Personal rehabilitation exercise assistant with Kinect and dynamic time warping. International Journal of Information and Education Technology, 3(4):448–454, 2013. doi: 10.7763/IJIET.2013.V3.316.
  • Sung et al. [2012] J. Sung, C. Ponce, B. Selman, and A. Saxena. Unstructured human activity detection from RGBD images. In Proceedings of the International Conference on Robotics and Automation (ICRA), pages 842–849, 2012.
  • Taylor et al. [2015] P. Taylor, J. N. Hobbs, J. Burroni, and H. T. Siegelmann. The global landscape of cognition: hierarchical aggregation as an organizational principle of human cortical networks and functions. Scientific Reports, 5(18112), 2015.
  • Thornton et al. [2002] I. M. Thornton, R. A. Rensink, and M. Shiffrar. Active versus passive processing of biological motion. Perception, 31:837–853, 2002.
  • Ungerleider and Mishkin [1982] L. Ungerleider and M. Mishkin. Two cortical visual systems. Analysis of Visual Behavior. Cambridge: MIT press, pages 549–586, 1982.
  • Velloso et al. [2013] E. Velloso, A. Bulling, G. Gellersen, W. Ugulino, and G. Fuks. Qualitative activity recognition of weight lifting exercises. In Augmented Human International Conference (ACM), pages 116–123, 2013.
  • Vettier and Garbay [2014] B. Vettier and C. Garbay. Abductive agents for human activity monitoring. International Journal on Artificial Intelligence Tools, 23, 2014.
  • Willshaw and von der Malsburg [1976] D. J. Willshaw and C. von der Malsburg. How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London B: Biological Sciences, 194(1117):431–445, 1976.