A Neurorobotics Approach to Behaviour Selection based on Human Activity Recognition

07/27/2021 ∙ by Caetano M. Ranieri, et al. ∙ Universidade de São Paulo Heriot-Watt University 0

Behaviour selection has been an active research topic for robotics, in particular in the field of human-robot interaction. For a robot to interact effectively and autonomously with humans, the coupling between techniques for human activity recognition, based on sensing information, and robot behaviour selection, based on decision-making mechanisms, is of paramount importance. However, most approaches to date consist of deterministic associations between the recognised activities and the robot behaviours, neglecting the uncertainty inherent to sequential predictions in real-time applications. In this paper, we address this gap by presenting a neurorobotics approach based on computational models that resemble neurophysiological aspects of living beings. This neurorobotics approach was compared to a non-bioinspired, heuristics-based approach. To evaluate both approaches, a robot simulation is developed, in which a mobile robot has to accomplish tasks according to the activity being performed by the inhabitant of an intelligent home. The outcomes of each approach were evaluated according to the number of correct outcomes provided by the robot. Results revealed that the neurorobotics approach is advantageous, especially considering the computational models based on more complex animals.



There are no comments yet.


page 11

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Truly autonomous behaviour is still not the norm for robots designed to interact socially with humans [Clabaugh2019EscapingRobotics]. In general, behaviour selection has been an active research topic for robotics in general, and human-robot interaction in particular [Ko2018BehaviorThought]. In this context, the need for a real-time understanding of human actions is of paramount importance for the robotic agent to behave proactively and effectively. Such a requirement could be achieved with techniques for human activity recognition [Mojarad2018HybridRobots].

When dealing with complex modalities (e.g., videos or data from inertial units), activity recognition approaches often rely on machine learning. For instance, video-based activity recognition have been approached by architectures based on convolutional and recurrent neural networks

[Herath2017GoingSurvey, Ma2019TS-LSTMRecognition]. For inertial data, similar architectures have been proposed, processing either raw data [Ordonez2016DeepRecognition, Garcia2019TemporalSensors]

or descriptors obtained through feature extraction methods

[StevenEyobu2018FeatureNetwork, Ashry2020CHARM-Deep:Smartwatch]. To provide a wider range of possibilities, robots may act symbiotically with other pervasive devices, such as wearable technologies or ambient sensors in intelligent environments, which may provide additional capabilities for sensing and acting based on application-specific components [Bacciu2019AnEnvironments]. When synchronised data from different sensors are available, activity recognition techniques may rely on multiple sensor modalities to provide more accurate results, giving rise to techniques for multimodal activity recognition [Lu2019AutonomousSensors, Imran2020EvaluatingRecognition, Ranieri2020UncoveringApproach].

Although human activity recognition has been a quite fertile field of research, few approaches have been developed to link the outputs from those algorithms into actual response behaviours from a robot. Related works usually consist of direct associations between the recognised activities and the response behaviours [Georgievski2017PlanningBuildings, Li2019Real-TimeMechanism, RodriguezLera2020AScenarios]. One of the possibilities consist of combining computational neuroscience to the robotics scenarios, characterising the field of neurorobotics [VanDerSmagt2016Neurorobotics:Action], which may build upon different biological aspects that influence the behaviour of living beings.

Li et al. [Li2019CombinedSurvey] provided a comprehensive survey on neurorobotics systems (NRS) and the different components that may integrate them. According to the authors, a generalised framework can be depicted for most NRSs in the literature, composed of a simulated brain, which is fed with sensory signals from a body and turns them into control signals for a hierarchical controller, responsible for decoding these signals into control commands for the body, which actuates and senses an external environment. Bioinspired strategies may be introduced to different aspects of the framework, according to the required task of a particular study.

The basal ganglia, a group of subcortical nuclei present in the vertebrate’s brain, is known to have an important role in action selection mechanisms, especially regarding striatal circuits [Markowitz2018TheSelection]. The so-called direct and indirect pathways are characterised by competitive or complementary functions that mediate the excitation of the motor system based on inputs from the motivational system of an individual, deciding whether to ”go” or to ”stop” performing a certain behaviour [Bariselli2019ASelection]. The potential roles of such a mechanism in robotic frameworks have also been evaluated, including simulations in which bioinspired networks receiving different stimuli are expected to respond with different behaviours, resulting in cooperative interactions that produce robot behaviours [Bahuguna2018ExploringFramework].

In this paper, we present a neurorobotics model which embeds computational models of the basal ganglia-thalamus-cortex (BG-T-C) circuit [Kumaravelu2016ADisease, ranieri2021towardsPD] to provide a decision-making mechanism for a robot - in this context, we may call it a neurorobot. The neurorobotics approach has been proposed for enhancing the decision-making mechanism, as suggested by related researches in neurorobotics [Liang2019APrediction, Mulcahy2020BasalStates]

. It consisted of simulating neurophysiological aspects within a cognitive framework, in which different stimuli was introduced to certain brain structures within the circuit, according to real-time outputs of the activity recognition module. The resulting spike trains from the neurorobotics model were then converted to neural firing frequencies across brain regions, which would be further decoded using convolutional neural networks, in order to infer the most suitable response behaviour for the robot.

The application scenario is a simulated smart home, in which an activity recognition model, presented in [Ranieri2021ActivitySensors] for the HWU-USP activities dataset [Ranieri2021HumanSensors], was employed in human-robot interaction tasks, using a mobile robot. In summary, the robot needs to produce response behaviours according to the contextual information inferred by the user (i.e., the recognised activity).

The neurorobotics approach, which is the central contribution of this work, was compared to a heuristics approach, in which a deterministic behaviour selection mechanism was considered using simple heuristics that associate recognised activities to robot behaviours. This neurorobotics approach embedded two computational models, one that resembled neurophysiological data of rodents (i.e., the rat model), and one that resembled data from marmoset monkeys (i.e., the primate model).

The different factors considered for this study were evaluated according to the relative number of correct outcomes of the robot simulation. Considering the activity recognition framework, the results have confirmed that more accurate classifiers for the activity recognition module led to a greater number of robot tasks successfully completed. Although the performances of the heuristic and neurorobotics approaches varied according to the computational model embedded, the study confirmed that the most complex neurorobotics model (i.e., the marmoset-based model of the BG-T-C circuit) led to an increased performance in relation to the heuristic approaches when a more accurate activity recogniser was considered (i.e., the video-based classifier).

The remainder of this paper is organised as follows. The brain structures considered for this neurorobotics approach and the computational modelling adopted are presented in Section 2. In Section 3, are presented the general aspects of the robotic system, and the integration between each of its modules. In Section 4, the neurorobotics approach is detailed. In Section 5, the methods and implementations are depicted. The corresponding results are presented in Section 6 and discussed in Section 7. Finally, the concluding remarks and directions for future research are provided in Section 8.

2 The BG-T-C Circuit and Original Computational Models

In this section, we present the basic concepts on the brain structures present in the basal ganglia-thalamus-cortex (BG-T-C) circuit, and the original computational modelling. The BG-T-C circuit, illustrated in Figure 1, is formed by the motor cortex (M1), the thalamus (TH), and the basal ganglia (BG), the latter composed of a subset of structures: the striatum (Str), the globus pallidus, divided into pars interna (GPi) and pars externa (GPe), the subthalamic nucleus (STN), and the substantia nigra, divided into pars compacta (SNc) and pars reticulata (SNr).

In [McGregor2019CircuitDisease], is provided a discussion about the mechanisms of this circuit and presented models to describe it. The most useful model to explain the connections within this circuit, especially those affected by PD, is the so-called classic model, illustrated in Figure 0(a).

(a) Classic model, as described by [McGregor2019CircuitDisease].
(b) Computational model, as designed by [Kumaravelu2016ADisease] for rodent data, and adapted by [ranieri2021towardsPD] for primate data.
Figure 1:

Schematic representations of the classic and computational models of the BG-T-C circuit. In the connections, excitatory synapses are shown as blue arrows, and inhibitory synapses, as red squares.

The pathways start with an excitatory connection from the cortex to the striatum, which projects its output neurons, named

medium spiny neurons (MSN), to other structures inside the BG. In the direct pathway, the direct MSN (dMSN) inhibits the GPi, which reduces its inhibition to the TH. Then, it excites the motor cortex. In the indirect pathway, the indirect MSN (iMSN) inhibits the GPe, which reduces its inhibition to the STN, which excites the GPi. Thus, this results on inhibition of the TH and absence of excitatory outputs to the motor cortex. In other words, the direct pathway excites the cortex (i.e., positive feedback loop), while the indirect pathway inhibits it (i.e., negative feedback loop).

In [Kumaravelu2016ADisease], a computational model of the BG-C-T circuit, originaly developed to study the underlying mechanisms of Parkinson’s Disease (PD), was proposed and implemented based on neural data from healthy and PD-induced (i.e., 6-OHDA lesioned) rats [Kita2011CorticalGanglia]. Eight brain structures were modelled and connected based on a simplified version of the classic model (see Figure 0(b)). In particular, the direct and indirect pathways were modelled separately representing the MSN modulation by D1 and D2 dopamine receptors in the striatum (i.e., StrD1 and StrD2, respectively). The cortex is represented by regular spiking (RS) excitatory neurons and fast spiking (FSI) inhibitory interneurons (i.e., CtxRS and CtxFSI, respectively). A bias current was added in the TH, GPe, and GPi, accounting for the inputs not explicitly modelled. This model was designed with the ability to shift from the simulation of healthy to the PD status, which is done by altering certain conductances.

Although all mammals have a similar set of BG structures that are similarly connected with thalamic and cortical structures, subtle differences between species may be found, with primates being more similar to humans than rodents [Lienard2014, Koprich2017AnimalDevelopment, Dawson2018]. A data-driven approach was proposed in [ranieri2021towardsPD] to obtain a primate-based computational model of the BG-T-C circuit and the mechanisms of PD. The resulting marmoset computational model was evaluated based on the differences between healthy and PD individuals, with respect to the spectral signature of the brain activity [Tinkhauser2017BetaMedication], the dynamics of the firing rates of neurons across brain regions [VanAlbada2009], and the coherence between spike trains [Halje2019].

The implementation used in [ranieri2021towardsPD] built on a Python translation of the original computational model of [Kumaravelu2016ADisease], originally made by [Romano2020EvaluationDisease] using the NetPyNE framework and the libraries from the NEURON simulator [Dura2019]. Based on the results of the machine learning framework, a practical setup of either the rat or marmoset computational models was made available. The adaptations performed in this work to the original computational models (see Subsection 4.1) were based on the code made available by the authors. For all neurorobotics model evaluations, we considered both the rat and primate computational models, always with the healthy state set on.

3 Integrated System

The modules of the application scenario, and the interactions between them, are illustrated in Figure 2. In this scenario, the human activities are inferred by a machine learning algorithm, and the supporting behaviours are performed by a mobile robot placed in a simulated environment, composing an ambient assisted living (AAL) application [calvaresi2017exploring].

The general information flow was: given the multimodal data provided by a set of sensors within a sensed environment, apply an activity recognition module to classify such data into a set of predefined human activities, and produce correspondent response behaviours for a mobile robot.

The neurorobotics approach was compared to a heuristics approach. The heuristics approach (Figure 1(a)) consisted of associating the predictions of the activity recognition module to response behaviours based on simple heuristics, presented in Subsection 5.4. In the neurorobotics approach (Figure 1(b)), the predictions from the activity recognition module were employed to stimulate a bioinspired computational model, whose outputs (i.e., neural firing frequencies of brain simulated regions) were decoded by a CNN-based decoder, which provided the decisions for the mobile robot.

(a) Heuristics approach: the predictions from the activity recognition module are fed directly to the mobile robot.
(b) Neurorobotics approach: the predictions are used as stimuli for the embedded, bioinspired computational model of the BG-T-C circuit, which simulates neural activity that is further interpreted by a CNN-based decoder, responsible for deciding the behaviour to be performed by the robot. Both the bioinspired computational model and the CNN-based decoder compose the neurorobotics model presented in this research.
Figure 2: Interaction between modules for the application scenario proposed.

More specifically, a sensed environment consisted of a previously collected dataset [Ranieri2021HumanSensors], composed by a set of recording sessions , with each associated to an activity , where is the number of classes (i.e., labels) considered for this dataset. The function describing these associations is given by Equation 1.


Each data tuple comprises a segment, with a previously defined length, of a recording session starting at timestep , equally spaced among them, to be segmented from . The activity recognition module is a machine learning classifier , which might associate a recording session at timestep to an activity

, through a prediction vector

(see Equation 2). In other words, considering that is unknown at inference time, the inference model , learned from labelled samples, provides a prediction vector , where

is the probability that a given input

corresponds to activity .


The application scenario was designed so that each activity in was associated to a desired response for the mobile robot. We defined a set of response behaviours , so that each human activity can be, but not necessarily is, associated to a response behaviour of the robot. The ”no action” behaviour is denoted as . Hence, the function that associates recognised activities to response behaviours is given by Equation 3.


The robot simulation would be considered successfully completed if:

  • For an activity being performed in the environment in a session , the robot completed an expected response behaviour before was finished; or

  • No response behaviour was expected (i.e., ) and the robot did not complete any of the behaviours in .

It is worth to notice that, according to this evaluation policy, besides an accuracy requirement (i.e., the correct behaviour must be given in response to a human activity), there was also a time constraint that must be satisfied (i.e., if required, the response behaviour must be completed while the human is still performing the given activity).

Since, by definition, is not known at runtime, and can only be inferred by a classifier as successive prediction vectors are provided, a decision-making mechanism was needed to perform adaptive decisions based on partial, time-localised predictions. To this aim, we proposed the neurorobotics model presented in Section 4, and compared it to a simple heuristics-based approach as described in Section 5.

4 The Neurorobotics Model

The neurorobotics model embeds the bioinspired computational model and the CNN-based decoder (see Figure 2). It consists of simulating and decoding the neurophysiological mechanisms of the basal ganglia-thalamus-cortex (BG-T-C) circuit in mammals (see Section 2), responsible for abilities such as motor control, decision-making, and learning [Girard2008WhereSelection, Liang2019APrediction, Mulcahy2020BasalStates]. As already stated in Section 2, both the rat-based [Kumaravelu2016ADisease] and the marmoset-based [ranieri2021towardsPD] computational models were evaluated as a decision-making mechanism of the neurorobotics model.

4.1 Bioinspired Computational Model

Motivated by the work of [Mulcahy2020BasalStates], two key modifications were introduced to the computational models of the BG-T-C circuit adopted in this work [Kumaravelu2016ADisease, ranieri2021towardsPD]. First, an additional structure, called prefrontal cortex (PFC), was included as a variable source of excitatory stimuli towards the striatum (see Figure 2(a)). Second, populations of neurons were implemented as independent channels , each associated to exactly one response behaviour (see Figure 2(b)), as defined in Equation 4.

(a) Schematic representation of the computational model as adapted in this work. In the connections, excitatory synapses are shown as blue or green arrows, and inhibitory synapses, as red squares. The blue arrows and red squares correspond to the original synapses as designed by [Kumaravelu2016ADisease] and adapted by [ranieri2021towardsPD], while the green arrows are the adaptations provided in this work to allow the stimulation of the circuit in the context of the application scenario proposed.
(b) Predictions from the activity recognition module are interpreted as stimulation originated on the prefrontal cortex (PFC), which selectively stimulates different populations of the computational model, each associated to one response behaviour of the robot.
Figure 3: Adapted version of the computational model of the BG-T-C circuit.

At each timestep, the channels received a stimulus , whose intensity was based on the linear combination between a prediction vector and a weight function , given by Equation 5. The actual value of is given by function , defined as in Equation 6.


Considering that, as ensured by the softmax activation on the classifiers, , then , which has shown to be a stable, biologically plausible interval. For a recording sequence , the set of prediction vectors is employed to update periodically each stimulus , during the course of a corresponding simulation of the computational model (not to be confused with the robot simulation). For each simulation, subsequent updates would be done for all , computed for the timesteps in .

A simulation, after finished, produced a spike train for each of the brain regions modelled, contemplating all its length (i.e., all updates were considered). The neural firing frequencies were computed according to [Lansky2004MeanRate], with the parameters detailed in Subsection 5.3, and summed across each region of each channel, resulting in output signals for each simulation, each with length , where is the number of regions (see Figure 1b).

Formally, let be defined as the output for a given simulation, where is a recording session, is the classifier employed for activity recognition, and , a computational model. Therefore, let a simulation be defined as (see Equation 7), whose output is as a multivariate time-series with variables and timesteps.


After the simulations were completed, the spike trains at the cortex populations were converted into temporal signals (i.e., neural firing frequencies) based on the mean firing rates across brain regions [Lansky2004MeanRate]

. The resulting signals were segmented in smaller windows and applied to train and evaluate a convolutional neural network (CNN), which would be employed to determine the decision of the robot at each timestep of the robot simulation (i.e., the

CNN-based decoder). More details on the implementation of the CNN-based decoder are presented in the next section.

4.2 CNN-Based Decoder

Each simulation of the computational model provided the summed neural firing frequencies of each channel and brain region, generating a data structure , associated to the whole recording session that generated it. As a requirement to provide a realistic scenario for the robot simulation, time-localised decisions were required, which must be taken based only in past events. In other words, at a timestep of the robot simulation, only predictions obtained on timesteps could be taken into account when providing a response behaviour to the robot.

To fulfil this requirement, each instance , correspondent to the recording session in the set of conditions and (see Equation 7), was segmented in windows of timesteps, with partial superposition, producing segments. Considering recording sessions in a given set of conditions, the function would generate a total of instances , as defined in Equation 8.


The resulting segments were employed to train a machine learning decoder (i.e., the CNN-based decoder). We considered only the cortex regions to compose the input tuples for the decoder, aiming to preserve biological plausibility regarding this aspect. Given that each channel of the computational model has two cortex regions (i.e., cortex RS and FSI), and that the experiments were performed with channels, associated to the response behaviours , the resulting instances had shape . The decoder might be trained to provide a decision vector , which corresponds to the probability that a given segment of cortex firing frequencies, given by, might be associated to a behaviour in . This decoding function may be defined as in Equation 9.


We have adopted a one-dimensional convolutional neural network (CNN) as decoder, which has shown to provide state-of-the-art results in related work [Ranieri2020UnveilingNetworks]

(for the architectural choices and hyperparameter settings, see Subsection

5.3). Classification metrics were provided considering that the categorical output is chosen according to Equation 10, where corresponds to a response behaviour.


Finally, the decisions decoded would be fed to the robot simulation and turned into commands, as discussed in Subsection 5.4.

5 Methods

In Figure 4, the different factors assessed in this work, already mentioned, are illustrated. Both the heuristics and the neurorobotics approaches were evaluated with two different models of the activity recognition module: the IMU + ambient and the video-based (see Subsection 5.1). For the heuristics approach, a couple heuristics was considered and compared: the window and the exponential (see Subsection 5.4). For the neurorobotics approach, the rat and marmoset computational models were assessed (see Subsections 4.1 and 4.2).

Figure 4: Factors and conditions analysed for the heuristics and neurorobotics approaches. For both approaches, two models for the activity recognition module were considered: the IMU + ambient and the video-based. For the heuristics approach, two approaches were analysed for the decision-making mechanism: window or exponential (see Subsection 5.4). For the neurorobotics approach, two computational models of the BG-T-C circuit were considered: the rat-based and the marmoset-based.

All code was developed in Python language. The machine learning techniques presented for the activity recognition and the CNN-based decoder were implemented with the Tensorflow/Keras framework. The computational models were implemented using the NetPyNE platform

[Dura2019]. The robot simulation was implemented in the Gazebo simulator [koenig2004design] with the Robot Operating System (ROS) [quigley2009ros] as a middleware. The next subsections will provide the implementation details of this work.

5.1 Dataset and Classifiers

We have adopted the HWU-USP activities dataset [Ranieri2021HumanSensors], a multimodal and heterogeneous dataset of human activities recorded in the Robotic Assisted Living Testbed (RALT), at Heriot-Watt University (UK). It is composed by readings of ambient sensors (e.g., switches at wardrobes and drawers, presence detectors, power measurements), inertial units attached to the waist and to the dominant wrist of the subjects, and videos. A set of nine well-defined, pre-segmented activities of daily living was performed by the 16 participants of the data collection. A total of recording sessions were provided, all of them pre-segmented and labelled (i.e., , and

were provided). The length of the recording sessions varied from less than 25 to over 100 seconds, with high variance either between-classes and between-subjects.

As the activity recognition module, we have employed the framework presented and evaluated in [Ranieri2021ActivitySensors]. This was composed by different time-localised classifiers based on artificial neural networks, each focused on a particular modality (i.e., set of similar sensors) or set of modalities. We adopted a couple pre-trained classifiers (i.e., the IMU + ambient and the video-based classifiers) from the framework to provide the prediction vectors, respecting the between-subjects 8-fold approach for training and evaluating. Let those classifiers be denoted by and , respectively. Although both classifiers were described in [Ranieri2021ActivitySensors], we give a brief presentation of their architectures in the next paragraphs, for the sake of completeness.

Classifier was fed with two parallel inputs: a two-seconds-long (i.e., timesteps-long) time-window with the raw signals from the inertial sensors, and the mean values of the ambient sensors in the correspondent timestamps. The inertial data was processed by a one-dimensional Convolutional Neural Network (CNN) [Zeiler2014VisualizingNetworks]

, composed of two convolutional layers interspersed with pooling layers, followed by a Long Short-Term Memory (LSTM) recurrent layer

[Hochreiter1997LongMemory], generating the feature vector . The ambient data was processed by a single fully-connected layer, generating the feature vector . Both and were concatenated and sent to a softmax output layer.

Classifier has taken, as input, a sequence of optical flow pairs, correspondent to two seconds of video, computed with the TVL1 algorithm [Zach2007AFlow]. The InceptionV3 CNN architecture [Szegedy2016RethinkingVision] was trained to classify each optical flow pair individually. The CNN-LSTM architecture, adopted by the authors, consisted of feeding each optical flow pair within a sequence to this pre-trained InceptionV3 module, and feeding the resulting features as inputs to each timestep of a LSTM layer, whose outputs were connected to a softmax output layer.

Both above-mentioned classifiers were endowed with softmax activation in their outputs, which ensured that the prediction vector respects a valid probability distribution. To provide the prediction vectors, we split each recording session in

timesteps, regardless to its original length, and used the referred framework to provide the predictions on each of those timesteps. The effect is to assume that all activities have similar length, a simplification that allowed the design of more uniform and comparable experiments related to the bioinspired computational models (Subsection 4.1), and the robot simulation (Subsection 5.4).

The output of the activity recognition module is, for a whole recording session processed by a classifier , a total of sets of prediction vectors , with . Outputs from both classifiers and were applied to all simulations, as described in the following subsections.

5.2 Heuristics Model Implementation

Two policies were considered for the heuristics model, named window or exponential, that is, . This experimental setup resulted in a total of four conditions for evaluation in the neurorobotics approach, given by the space .

The window policy consisted of deriving a wider prediction vector , correspondent to timesteps. This was done by averaging the most recent prediction vectors in , from the activity recognition module, as in Equation 11. We have set , which corresponds to windows of four seconds from the recording sessions, because this was the length of the segments considered for the neurorobotics approach (see Subsection 4.2). On the other hand, the exponential policy consisted of deriving a prediction vector that considered the whole sequence of previous prediction vectors in , with an exponential decay across iterations, as in Equation 12. If is the set of the functions to compute and , then the decision of the heuristics approach, for either the window or exponential policies, is given by Equation 13. It is important to note that, for the window policy of the heuristics approach, as in the neurorobotics approach, the robot can only begin to move after the first four seconds of each simulation, in which it is gathering the number of prediction vectors necessary to compute the first decision.


As a reference, we introduced an additional approach, a control condition in which the ground truth labels are directly fed to the robot simulation, providing a unique decision every timestep, as shown by Equation 14.


5.3 Neurorobotics Model Implementation

Let be the bioinspired computational model, which can be rat-based or marmoset-based, that is, . This experimental setup resulted in a total of four conditions for evaluation in the neurorobotics approach, given by the space . Each independent simulation of the computational model (not to be confused with the robot simulation) was ran for each of the recording sessions under each condition being evaluated, that is, the simulations of the computational models were required to contemplate all instances in the space . Hence, a total of simulations of the computational model was performed.

Each of those simulations ran for seconds with sampling rate of Hz. The stimuli set was updated every second (i.e., update frequency of 2 Hz). This led to an adaptive dynamic that would respond to successive prediction vectors , , with , according to the confidence of each response behaviour. The resulting spike trains in each neuron population were converted to neural firing frequencies (for details, see [Lansky2004MeanRate]), with bins of size , which resulted in sequences of length . As stated in Subsection 5.1, for the experiments reported in this work, , hence . Considering that , the multivariate time-series had variables and timesteps, composing a data structure with dimensions ,

The segments for the decoder we set to timesteps (i.e., four-seconds-long) with superposition (i.e., a one-second-long step between the beginning of each segment), resulting in segments. Considering the each condition was composed of recording sessions, these simulations of the computational models led to a total of instances , for each .

The CNN architecture for decoding these time-series into response behaviours is depicted in Table 1. It was composed of two convolutional layers, with and

filters, respectively, interspersed with max-pooling layers. A global average pooling operation preceded the softmax output layer, which produced the decision vector


Layer Type Output shape Free parameters
1 Input -
2 Conv1D
3 MaxPool1D -
4 Conv1D
5 MaxPool1D -
6 Global Average Pooling -
7 Softmax -
Table 1: Layers in the CNN-based decoder. The inputs to the neural network are windows of timesteps from the four cortex channels of the output signals (i.e., neural firing frequencies) of the simulations under a given condition. The output is a decision vector with the confidences for each response behaviour.

For each set of conditions, the CNN was trained in a cross-subject 8-fold cross-validation scheme, similar to the one adopted for the activity recognition module [Ranieri2021ActivitySensors]. The input data was linearly normalised to the range , and the classification models were trained for epochs with batch size . The ADAM algorithm was employed, with learning rate

, to optimise the categorical cross-entropy loss function. The outputs of the evaluations were stored and organised, in order to serve as inputs to the next steps. The resulting sequences

were then introduced to the decision-making mechanism.

5.4 Robot Behaviours

The behaviours consisted of transporting an object , from a starting position to a fixed destination . This task was adopted because it comprises a basic and generic functionality for a mobile robot in a home environment. The associations between the behaviours and the objects are given by Equation 15, while the associations between the objects and their starting positions in the map are given by Equation 16.


At each timestep , a decision (i.e., a response for each recording session of the activity recognition module) was sent to the robot simulation, composed of a mobile social robot in a home environment (for details on the platforms and implementations employed, see Subsection 5.5). For the neurorobotics approach, this decision is given by Equation 10, already presented in Subsection 4.2. For the heuristics approach, the two policies mentioned (i.e., window and exponential) were evaluated.

The decisions were turned into commands to the robot following a table of rules, depicted in Table 2. A decision is sent to the robot at each timestep. This decision can be one of the behaviours in or the ”no action” behaviour . Let be the object being carried by the robot at a certain timestep. Two types of situations might be considered: or .

Decision Object carried Robot position Output command
Move towards
Move towards
Finish behaviour
Move towards
Move towards
Table 2: Table of rules associating a response behaviour to an output command at each timestep of the robot simulation, considering the object being carried and the current robot position.

The first type of situation is characterised by , in which the robot must return any object that it may be carrying to the corresponding position, and then stand still, waiting for any further commands. Otherwise, , the second type of situation, in which the robot is supposed to grab an object from position to a destination . If the robot is not carrying any object, that is, , then it must move to and take the object. If it is already carrying the correct object, then it must move towards the destination . If it is carrying the wrong object, it is, , then it must return it to .

5.5 Robot Simulator

The simulator adopted for the robotics experiments was previously made available as part of the LARa framework [Ranieri2018LARa:Environments], consisted of a robot and a software library. The LARa robot was a mobile social robot built on the top of a Pioneer P3-DX platform, endowed with a Hokuyo laser, a mini computer, a Microsoft Kinect sensor, a microphone, a screen, and a speaker. The LARa library was a set of functionalities implemented to control the robot based on high-level software interfaces, integrated within the Robot Operating System (ROS) [quigley2009ros]. Besides navigation skills and a framework for human-robot interaction, this included a platform for simulation, under conditions that resembled those of the actual robot, deployed to allow offline experiments. The Gazebo simulator [koenig2004design] was employed, and a map of a typical home environment was designed, as reproduced in Figure 4(a). The simulated robot - a simplified version of the LARa robot - is shown in Figure 4(b), while the pieces of furniture employed in the experiments are shown in Figure 4(c).

(a) Map of the whole home environment employed for the experiments.
(b) Simulated mobile robot.
(c) In a different camera angle, the section of the map in which the robot behaviours were performed, with the indications of the robot and the pieces of furniture involved in the tasks (i.e., shelf 1, shelf 2 and table).
Figure 5: Virtual environment for the robot simulation, using the Gazebo platform.

This setting comprised a realistic environment, which provided several challenging aspects resembling those of a real-world scenario, such as sensors’ noise, communication delays and mechanical issues. The ROS platform was employed to connect this simulated environment to a navigation stack, which provided a 2D occupancy grid in which each position (i.e., cell) might be considered empty, navigable or obstacle. This representation was generated previously to the robot simulations reported here, via the GMapping algorithm [grisetti2007improved] for Simultaneous Localisation and Mapping (SLAM). The mapping algorithm ran while the robot was teleoperated through the whole environment, with the laser readings and the wheels’ encoders combined to gradually compose the occupancy grid. Once the grid was created, the Augmented Monte Carlo Localisation (AMCL) and A* algorithms could be employed as a global planner to perform autonomous navigation. The navigation package was also endowed with a local planner, responsible for creating adaptable short-term paths for obstacle avoidance and environmental changes.

For this work, a set of two response behaviours was defined as . In Table 3, are shown the set of daily activities from the dataset (i.e., , ), and the expected response behaviours associated to each of those activities (i.e., ). These were chosen respecting semantic relationships between the activities (i.e., is the desired response when the user is preparing meals, and , when he is quietly consuming or exchanging information).

Activity Description Response behaviour
making a cup of tea
making a sandwich
making a bowl of cereals
using a laptop
using a phone
reading a newspaper
setting the table
cleaning the dishes
tidying the kitchen
Table 3: List of activities provided by the HWU-USP activities dataset, and expected response behaviours in the application scenario.

These behaviours were based on the assumption that the user is located in the kitchen, and that the human activities are being monitored by sensors that are not affected by the robot actions. The starting position for only the first robot simulation in a battery of experiments is given in Figure 5. However, it had negligible effect in the overall results, since this position was not reset for each simulation, as we discuss later in this subsection.

As shown in Figure 4(c), three pieces of furniture are considered. These are shelf 1, associated to the robot position , ; shelf 2, associated to the robot position , ; and table, the destination, associated to the robot position . The two specific behaviours considered for the experiments performed, and , consist, respectively, of transporting object from (i.e., shelf 1) to (i.e., the table), and transporting object from (i.e., shelf 2) to (i.e., the table). Considering that shelf 2 is closer to the table than shelf 1, then the distances required for are larger than those for . As a consequence, it was expected that, on average, required more time to be completed than .

The maximum robot simulation time was set to seconds, with each timestep corresponding to one second in the simulation. Consequently, an expected response behaviour had to be finished within seconds to be considered successfully completed. We configured , which in exploratory experiments has shown to give a reasonable margin for the robot simulations.

A total of robot simulations was performed for each condition analysed. The first simulation for each approach began with the robot positioned as in Figure 4(c). All the next simulations began without resetting the robot position after the ending of the previous one, with only the object flag, corresponding to the object being carried by the robot, being cleared. In this scenario, each simulation could be started with the robot in any of the positions in , or in locations belonging to the path between them.

6 Results

Concerning the activity recognition module, its classification results are presented in [Ranieri2021ActivitySensors]. The overall accuracy registered for the classifiers were computed by taking a set of prediction vectors obtained for a recording session and averaging it. A categorical classification was provided by returning the element in the averaged vector. A cross-validation approach, following the same cross-subject partitioning adopted for evaluating the CNN-based decoder in this work, have been performed. The accuracy reported for the modalities considered for the experiments reported here was for , and for .

The other modules in this work relied on important adaptations to frameworks previously implemented in related work, as happened to the computational models and the robot simulation, or to components developed from scratch, case of the CNN-based decoder. The corresponding results are shown in the following subsections. The classification metrics from the neural firing frequencies synthesised with the bioinspired computational models are presented in Subsection 6.1. The outcomes of the robot simulations, in all conditions analysed, are presented in Subsection 6.2.

6.1 Simulated Neural Firing Frequencies

A sample of the segments , provided in the simulations of the computational models, is shown in Figure 6. This was generated from a rat model, being stimulated according to an IMU + ambient classifier as the activity recognition module. A larger stimulus introduced to the striatum is expected to increase neural firing rates in the BG-T-C circuit, which might be propagated to the cortex.

Figure 6: Sample output from the bioinspired computational model of the BG-C-T circuit. For the motor cortex of each channel, RS and FSI, we considered the overall mean firing rates computed with time bins of size milliseconds, and evaluated on two-seconds-long time windows. This data was used as input for the CNN-based decoder, in the next step of the bioinspired pipeline.

The overall accuracy and F1-score of the decoder, trained and evaluated according to the 8-fold cross-subject approach described, are shown in the bars plot of Figure 7. The classifier used in the activity recognition module and the computational model employed are shown side-by-side.

Figure 7: Accuracy and F1-score for the CNN-based decoder in classifying a MFR signal into a set of three possible decisions: B1, B2 or "no action". On choosing the models for evaluation, two factors were analysed: the modalities and models employed for activity recognition (IMU + ambient sensors or video-based) [Ranieri2021ActivitySensors], and the computational model considered (rat-based or marmoset-based) [Kumaravelu2016ADisease, ranieri2021towardsPD].

The decoder was applied as a part of the decision-making mechanism, responsible for providing decision vectors for the robot simulation. Hence, its results might be correlated to the correct outcomes of the decisions made during the robot simulation. In other words, a good accuracy of the decoder might result in more correct decisions of the robot, which may more often complete the tasks with the correct outcome. The next subsection will present the experiments performed to validate this statement. These are the outcomes of the robot simulation not only for each of those conditions, but also for each policy employed for the heuristics approach.

6.2 Outcomes of the Robot Simulations

As it was mentioned before, three possible outcomes were considered for the robot simulations, with being the activity associated to a recording session :

  • Correct, if and the activity was completed before the end of the simulation, or if and no behaviour was completed;

  • Incorrect, if the robot completed a behaviour different from , i.e., ;

  • Unfinished, if a response behaviour was expected from the robot, but no behaviour was completed before the end of the simulation.

In Subsection 5.5, a control condition was introduced, with ground truth decisions being sent for the robot. For this approach, as it was expected, all robot simulations let to the correct outcome. In Figure 7(a), the outcomes for the heuristics approach are presented, with each of the policies analysed (i.e., window and exponential) being represented in different plots, each illustrating the outcomes for each classifier considered for the activity recognition module. The outcomes for the neurorobotics approach are shown in Figure 7(a), with the classifiers for activity recognition (IMU + ambient or video) and the computational models (rat or marmoset) being represented.

(a) Outcomes for the robot simulations performed with the heuristics approach. Four batteries of simulations were performed, considering two factors: the classifiers employed for activity recognition (IMU + ambient sensors and video-based) and the policy for the decision-making mechanism (window or exponential) (see Figure 4a).
(b) Outcomes for the robot simulations performed with the neurorobotics approach. Four batteries of simulations were performed, related to two factors analysed: the modalities and models employed for activity recognition (IMU + ambient sensors or video-based) [Ranieri2021ActivitySensors], the computational model considered (rat-based or marmoset-based) [ranieri2021towardsPD] (see Figure 4b).
Figure 8: Outcomes for the robot simulations. Three possible outcomes were considered: the robot completed the expected (correct) behaviour; the robot concluded the incorrect behaviour; no behaviour was completed (unfinished), although an action was required from the robot.

The times elapsed for providing the correct

outcome, when a response behaviour was expected from the robot, were also recorded. The mean and standard deviations, within all simulations performed for each condition, are represented in Figure

9. Two separate plots were provided, separating the classifiers employed for the activity recognition module. The ground truth approach was reproduced in both of them, since it does not depend on prediction vectors, but in the ground truth activities.

This metric considers only the outcomes completed successfully. An approach that provides a fast response with poor accuracy would provide a low time response, though it would not necessarily provide the correct response behaviours very often. Hence, the fact that the heuristics approach with the window policy led to a faster average response than the ground-truth condition is consistent. Since incorrect and unfinished outcomes were not considered for the computation of this mean value, this result only shows that, for this model, the correct outcomes were mostly associated to activities that could be completed in less time (e.g., the behaviour ).

Figure 9: Average times elapsed across the 144 sequences on each simulation in which the correct behaviour was performed. Incorrect and unfinished outcomes, as well as correct outcomes when no action was required from the robot, were not considered in this evaluation. All simulated models, basend on either heuristics or neurorobotics, are shown.

7 Discussion

The results from the CNN-based decoder, shown in Figure 7

, confirmed some expectations regarding the output signals produced by the simulations of the computational models according to the stimuli provided: it performed better for the the video-based classifier than for the IMU + ambient, and for the marmoset-based model, compared to the rat-based. The accuracy and F1-score metrics were very close, which considering a strictly balanced dataset, points that the results were not affected by any serious issues regarding the trade-off between precision and recall.

All evaluations led to an accuracy measure of over for three classes (i.e., response behaviours , or ). It is important to consider that the stimuli came from noisy prediction vectors from activity recognition algorithms, whose accuracy is variable across successive segments [Ranieri2021ActivitySensors], with overall accuracy values of , for , and , for . These results show that the neural activity provided by the computational models could be reliably interpreted by the proposed decoder, even considering segments of limited length (i.e., four-seconds-long segments within a 70-seconds-long sequence). Hence, this particular technique for brain signals, analysed in previous studies for processing related neuronal data of the BG-T-C circuit [Oh2018ASignals, Ranieri2020UnveilingNetworks], has shown to be suitable for the decision-making approach proposed.

Since the accuracy measure of the classifier for activity recognition was significantly higher for the video-based classifier than for the IMU + ambient, it was expected that it could be more easily decoded by the neural network, which was confirmed by the decoder results (see Figure 7). Also, the marmoset-based model led to better decoding performances than the rat-based model, which also meets the expectations, considering a more sophisticated morphology and dynamics in the underlying brain structures in primates than in rodents [Lienard2014].

Regarding the robot simulations, heuristics approaches were evaluated in parallel to the neurorobotics approaches. In most experiments performed in this work, better performances were found for the models fed by the video-based classifier than those fed by the IMU + ambient classifier, which was expected, since the video classifier is expressively more accurate [Ranieri2021ActivitySensors]. As shown in Figure 7(a), the window policy led to a lower number of successfully completed response behaviours, especially when fed with prediction vectors coming from the IMU + ambient classifier (less accurate). This condition may be the fairest comparison to the neurorobotics approach since it limits its decisions to data from the four-seconds-long segment that precedes a given decision, the same constraint applied to the CNN-based decoder.

In this context, the neurorobotics approach has shown to provide more accurate outcomes in most conditions, especially for the marmoset model. For the IMU+ambient modality of the activity recogniser, the window policy of the heuristics approach led to of correct outcomes, which was surpassed by the result for either the rat or marmoset models. For the video modality, the window policy of the heuristics approach led to of correct outcomes, which was only slightly above the rat model, which hit , and expressively below the marmoset model, which hit . These results point that the proposed neurorobotics approach, in the conditions analysed in this study, may lead to better outcomes than a simple heuristics for a real-time task of an autonomous robot.

For the exponential policy of the heuristics approach, a particularity was found: it led to similar results for either the video and IMU + ambient conditions (i.e., and of correct outcomes, respectively), both with more correct outcomes than those of the window policy. This result is relevant, since it reveals that, by performing a long-term aggregation of prediction vectors obtained subsequently from a single recording session, it may be possible to compensate lower accuracy values provided by certain classifiers that work with different sets of sensors. This possibility might be considered in practical applications, in which more informative modalities that usually lead to high accuracy, such as videos, may be either difficult to be obtained, due to privacy concerns [FernandesJunior2016DetectionHomes], or unfeasible to provide real-time outputs, due to the high computational cost inherent to the operations required for processing them [Rodriguez-Moreno2019VideoState-of-the-Art].

Regarding the different conditions considered for the neurorobotics approach (i.e., the activity recogniser and the computational model), the expectation was that, when applied to the robot simulation, the number of correct outcomes would be comparatively proportional to the accuracy measures of the decoder (see Figure 7). As shown in Figure 7(b), this expectation was met for most conditions, although some exceptions were found.

Better results for the marmoset model were expected, since the number of neurons and the connectivity are larger [Prescott2006AProcessing, Koprich2017AnimalDevelopment]. The results of the decoder, previously discussed, corroborate to this hypothesis. For the robot simulations, considering the video modality, the marmoset model led to the best results found among all of the simulations, with of correct outcomes, against achieved by the rat model. However, for the IMU + ambient modality, the results were similar for both models. A possible explanation for this result is that such an increased capacity could compensate the mistakes for a more accurate activity recognise. In other words, the prediction vectors across successive segments could assign higher confidence values (i.e., probabilities) to the expected label (i.e., the ground-truth activity) for the video-based classifier than for the IMU + ambient, and the marmoset-based model, more sophisticated, was able to take more advantage on it than the rat-based.

By measuring the time elapsed in the robot simulations with correct outcomes, we can see only modest variations across conditions. An important observation regarding this metric is that a fast response is not necessarily an indication of a good performance, since this result is affected not only for the assertiveness of the correct outcomes (i.e., few changes of decision within a simulation), but also to the accuracy of the simulations in a given set of conditions. For instance, a given condition may lead to fast response when it provides the correct outcome, but most simulations may lead to an incorrect or unfinished outcome.

For the neurorobotics approach, the times were approximately similar between both classifiers, except for the marmoset model, which took significantly longer to finish, on average, when fed with the video-based classifier. Considering the heuristics approach, the video-based classifier led to clearly longer times for completing the behaviours, which was probably because some of the changes in decisions (i.e., the robot is performing behaviour , but the decision-making mechanism changes it to after receiving new, updated prediction vectors) within the simulations allowed for completing more simulations with the correct outcome. The same reason explains why the correct outcomes of the window policy for the IMU + ambient classifier led to a faster response, on average, than the ground-truth value.

8 Conclusions and Future Work

In this paper, we employed a neurorobotics approach based on the embodiment of validated computational models of brain structures for creating a decision-making mechanism to provide effective response behaviours to a mobile robot in a simulated environment.

The chosen application scenario was a simulated smart home where data from the sensed environment was processed with a previously designed activity recognition framework. The neurorobotics approach was compared to some heuristics. For this, two simple heuristics were proposed and evaluated to provide real-time decisions based on the outputs from an activity recognition classifier.

The neurorobotics model used computational models (CM) of the basal ganglia-thalamus-cortex (BG-T-C) circuit, originally designed to study the underlying mechanisms of Parkinson’s Disease. The CM were adapted, so that the outputs from the activity recognition module were applied as stimuli to the striatum of the circuit, and spike activity at the cortex was decoded with a convolutional neural network (CNN) to provide decisions to the robot simulation. Different conditions were analysed, including whether the computational models were based on rodent or primate models.

Results were reported with respect to the accuracy obtained for the CNN-based decoder in each condition for the computational model, and to the outcomes of the robot simulations, considering the neurorobotics and the heuristics approaches. The expectations were met for most of the different conditions regarding the neurorobotics approaches. The primate-based computational model led to the best outcomes between the simulations analysed.

Hence, one can conclude that the proposed neurorobotics approach is promising not only as an embedded tool for understanding the neurophysiological aspects of animal behaviour, but also as a practical component to integrate decision-making mechanisms for action selection in mobile robots engaged in human-robot-interaction scenarios.

Future work may consist of providing a real-time simulation of the proposed application scenario, with a robot placed in a physical environment in which human participants may be performing activities. This would require the integration among the different modules shown in the pipelines presented, thus ensuring that all of them can work in real-time. Such an experiment may validate our approach in even more challenging conditions and scenarios, which may foster a wide range of applications.


This work was funded by the Sao Paulo Research Foundation (FAPESP), grants 2017/02377-5, 2017/01687-0 and 2018/25902-0, and the Neuro4PD project - Royal Society and Newton Fund (NAF\R2\180773). Moioli acknowledge the support from the Brazilian institutions: INCT INCEMAQ of the CNPq/MCTI, FAPERN, CAPES, FINEP, and MEC. This research was carried out using the computational resources from the CeMEAI funded by FAPESP, grant 2013/07375-0. Additional resources were provided by the Robotics Lab within the ECR, and by the Nvidia Grants program.