Autonomous Identification and Goal-Directed Invocation of Event-Predictive Behavioral Primitives

02/26/2019 ∙ by Christian Gumbsch, et al. ∙ Max Planck Society Universität Tübingen 0

Voluntary behavior of humans appears to be composed of small, elementary building blocks or behavioral primitives. While this modular organization seems crucial for the learning of complex motor skills and the flexible adaption of behavior to new circumstances, the problem of learning meaningful, compositional abstractions from sensorimotor experiences remains an open challenge. Here, we introduce a computational learning architecture, termed surprise-based behavioral modularization into event-predictive structures (SUBMODES), that explores behavior and identifies the underlying behavioral units completely from scratch. The SUBMODES architecture bootstraps sensorimotor exploration using a self-organizing neural controller. While exploring the behavioral capabilities of its own body, the system learns modular structures that predict the sensorimotor dynamics and generate the associated behavior. In line with recent theories of event perception, the system uses unexpected prediction error signals, i.e., surprise, to detect transitions between successive behavioral primitives. We show that, when applied to two robotic systems with completely different body kinematics, the system manages to learn a variety of complex and realistic behavioral primitives. Moreover, after initial self-exploration the system can use its learned predictive models progressively more effectively for invoking model predictive planning and goal-directed control in different tasks and environments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Opening the fridge, grasping the milk and drinking from the bottle – behavioral sequences, composed of multiple, smaller units of behavior, are ubiquitous in our minds [1, 2, 3]. More generally speaking, we humans seem to organize our behavior and the accompanying perception into small, compositional structures in a highly systematic manner [4]. These structures are often referred to as building blocks of behavior or behavioral primitives and can be viewed as elementary units of behavior above the level of single motor commands [5].

A large challenge for the brain as well as artificial cognitive systems lies in the effective segmentation of our continuous perceptual stream of sensorimotor information into such behavioral primitives. When does a particular behavior commence? When does it end? How are individual behavioral primitives encoded compactly? In most cognitive systems approaches so far, behavioral primitives are segmented by hand, pre-programmed into the system, or learned by demonstration [6, 7, 8, 9]. In all cases, though, the primitives are made explicit to the system, that is, the learning system does not need to identify the primitives autonomously. Our brain, however, seems to identify such primitives on its own, starting with bodily self-exploration.

Here, we introduce a computational architecture, termed SUrprise-based Behavioral MODularization into Event-predictive Structures (SUBMODES), that learns behavioral primitives as well as behavioral transitions completely from scratch. The SUBMODES architecture learns such primitives by exploring the behavioral repertoire of an embodied agent. In this study we demonstrate its effectiveness by examining its performance on complex simulated robots that are acting in physics-based environments. Initial exploration is realized by a closed loop control scheme that adapts quickly to the sensorimotor feedback. In particular, we use differential extrinsic plasticity (DEP) [10], which causes the agent to explore body-motor-environment interaction dynamics. DEP essentially fosters the exploration of coordinated, rhythmical sensorimotor patterns, including a tendency to ‘zoom’ into particular dynamic attractors, stay and explore them for a while, and eventually leave one attractor in favor of another one – particularly in the case of a disruption such as when hitting a wall. Starting with this self-exploration mechanism, the algorithm learns internal models that are trained to predict the motor commands and the resulting sensory consequences of the currently performed behavior.

The SUBMODES system uses an unexpected increase in prediction error to detect the transition from one behavioral primitive to another. If such a ‘surprising’ error signal is perceived, the internal predictive model either switches to a previously learned model or a new model is generated if the behavior was never experienced before. In this way, the agent systematically structures its perceived continuous stream of sensorimotor information on-line into modular, compositional models of behavioral primitives as well as predictive event-transition models. We show that a large variety of realistic behavioral primitives can be learned form scratch even in robotic systems that have both many degrees of freedom and interact with complex, noisy environments. Moreover, we show that after initial self-exploration the agent can use its learned predictive models progressively more effectively for invoking goal-directed planning and control. In effect, the system learns predictive behavioral primitive and event transition models to invoke hierarchical, model-predictive planning

[11, 12], anticipating the sensory consequences of the available behaviors and choosing those behavioral primitives that are believed to bring the system closer to a desired goal state.

2 System Motivation and Related Work

The problem of abstracting our sensorimotor experiences into conceptual, compositionally meaningfully re-combinable units of thought is a long-standing challenge in cognitive science, including cognitive linguistics, cognitive robotics, and neuroscience-inspired models [11, 1, 3, 13, 14, 15, 16]. One important type of such units concerns concrete behavioral interactions with the environment, regardless if they lead to transitive motions of the body or other objects. Depending on the level of abstraction and the field of research, different synonyms can be found in the literature [2], such as ‘behavioral primitives’ [5], ‘movement primitives’ [6], ‘motor primitives’ [7], ‘motor schemas’ [17], or ‘movemes’[18]. It has been suggested that our ability to serially combine these compositional elements is crucial for our ability to quickly learn complex motor skills and to flexibly adjust our behavior to new tasks [6]

. Furthermore, the assumption that there exists a limited repertoire of behavior, has been proposed as a way to deal with the curse of dimensionality and redundancy at different levels of the motor hierarchy, moving from simple behavioral primitives towards an ontology of more sophisticated interaction complexes

[2, 19, 8, 9].

Although the acquisition and application of behavioral primitives has been extensively studied in cognitive robotics and related fields, it is still not clear how we discover, encode, and ultimately use these behavioral primitives for the effective invocation of goal-directed behavioral control.

2.1 From Sensorimotor Signals to Goal-Directed Control

According to the Ideo-Motor Principle [20, 21, 22], encodings of behavior are closely linked to their sensory effects. The main idea is that initially purely reflex-like actions are paired with the sensory effects they cause. At a later point in time, when the previously learned effects become desirable, the behavior can be applied again [22, 1]. While the Ideo-Motor Principle was heavily criticized and ridiculed during the beginning of the 19th century and in the era of Behaviorism, it has seen a revival over the last decades in various fields of cognitive science, as, for example, manifested in the propositions of the Anticipatory Behavioral Control (ABC) theory [21] as well as the Theory of Event Coding (TEC) [23].

TEC suggests that perceptual information and action plans are encoded in a common representation. According to TEC, actions and their consequent perceptual effects are encoded in a common predictive network, which allows the anticipation of perceptual action consequences and the inverse, goal-directed invocation of the associated motor commands. TEC implies that behavior is primarily learned with respect to the effects that it produces. The ABC theory focuses even more on the learning of sensorimotor structures. According to ABC, the critical conditions for the application of an action-effect encoding are learned by focusing on (unexpected) perceptual changes, which lead to a further differentiation of conditional structures [24]. For example, it can be learned that an object first needs to be in reach before we are able to grasp it [1, 25]. In sum, both theories emphasize that our brain encodes behavior with respect to the effect it entails and it does so, because the resulting structures enable the selective and highly flexible activation of action-effect complex depending on the current context and desired goal states.

Along similar lines, learning to control behavior was studied from a more neuroscience-motivated modeling perspective. Wolpert and Kawato have proposed that our brain may learn modular forward-inverse model pairs to acquire progressively more complex motor skills [26]. The proposition was implemented later on in the MOSAIC system [27]. The MOSAIC system learns sets of discrete, internal models, each consisting of a forward model, which predicts the sensory consequence of an action, and a paired inverse model, which generates the required motor commands. Similar to other approaches of modular learning, such as the mixture of experts model [28], the MOSAIC system uses the sensory prediction error of the forward models to gate the learning signal and differentiate the internal models.

In the cognitive robotics literature, the learning and task-dependent optimization of movement primitives has been investigated in an Actor-Critic framework [6]. It has been shown that complex movement primitives, in realistic settings, such as ‘hitting a baseball with a bat’ can achieve nearly optimal performance when applying policy gradient based optimization [29]. Various alternative approaches have been investigated and contrasted [30, 7, 31, 8, 32]. In all cases, the beginning and end of a movement primitive is predefined and not autonomously discovered by the system itself. Furthermore, initially, the systems do not learn via self-exploration but typically from demonstrations.

2.2 Towards Hierarchical Structures

While the outlined theories give an account on how behavior can be encoded, they do not explain how the continuous stream of sensorimotor information may be structured systematically to infer the underlying behavioral primitives. Event segmentation theory (EST) [33] gives a concrete formulation of how our brain might be able to segment the perceptual stream into discrete representations. According to EST, humans perceive activity in terms of discrete conceptual events. An event is defined as “a segment of time at a given location that is conceived by an observer to have a beginning and an end” [34, p. 3]. This definition of an event is rather general, containing both short sensorimotor events, such as ‘grasping a mug’, but also potentially long segments with multiple agents and ongoing activities, e.g., a concert. When considering the learning of behavioral primitives, we can focus solely on the individual sensorimotor level of events.

According to EST, our perceptual process is guided by a set of internal models, which continuously predict what is perceived next. A specific set of event models is active over the course of one event, i.e., until a transient increase in prediction error occurs. Such a transient error signal may result in a change in the currently active internal models. EST further suggests, that such a prediction error-based segmentation mechanism might occur on different levels of abstraction, resulting in a hierarchical, taxonomic organization of events [33, 34]. Hence, according to EST a cognitive plausible way to conceptualize the continuous sensorimotor stream into compositional behavioral models is based on transient error signals of internal predictive models – essentially a more concrete formalism that dovetails with the ABC theory.

Similar prediction error-based segmentation mechanisms have been studied in various computational models: Predicting movements in video sequences of actors performing everyday motions, paired with the dedicated processing of transient prediction error signals, led to the discovery and encoding of simple movement primitives in a recurrent neural network

[35]. Similarly, learning predictive models and using an unexpected increase in prediction error has been used to learn forward models of different object interaction events in simple, physics-based simulation environments [36, 37]. When considering different thresholds for the error-based segmentation, event-predictive forward models of different levels of abstraction emerged, resulting in a taxonomic organization of events, highly suitable for hierarchical, goal-directed planning [37]. In both systems, the prediction error-based detection mechanism works on-line. The basic principle can be closely related to a surprise-based perceptual processing mechanism, which has been shown to segment a hierarchically structured environment (four-rooms problem) into its sub-components (individual rooms) even in the case of very high noise [38]. Mechanisms that focus on more graph-based algorithms to detect transitions have been proposed as well [39, 40, 41].

Discovering behavioral primitives and applying them for high-level goal-directed control is closely related to hierarchical reinforcement learning and the options framework

[11, 42, 43]. An option is defined as a “generalization of primitive actions to include temporally extended courses of action” ([43], p. 186). In the right setting, i.e., an embodied, robotic agent with an elementary action corresponding to a single motor command, an option can resemble both a behavioral primitive or a series of behavioral primitives, e.g. ‘grasping an object’. In the options framework a particular option is typically defined with respect to a specific subgoal state. For example, the ‘grasping an object’-option might terminate when the object is held by the hand of the agent. An option can then be trained by comparing the outcome of performing the option with the desired subgoal to determine a pseudo-reward and updating the internal structures reward-dependently [42]. While recent implementations of hierarchical deep reinforcement learning have shown remarkable performance in rather challenging video gaming tasks [44], effective, self-motivated subgoal identification remains an open challenge.

Figure 1: Illustration of the SUBMODES architecture during the learning of behavior. An explorative controller generates motor commands based on the current proprioceptive input to explore self-organizing behavior. One of multiple, internal behavioral models attempts to predict the motor commands and sensory consequences of the ongoing behavior. The predicted sensorimotor state is compared to the actual state to compute the prediction error and update the active behavioral model. For each behavioral model an error model

is trained, estimating the prediction confidence. If surprise is detected, i.e., a strong error signal outside the usual prediction confidence, the system is allowed to exchange the active behavioral model. For each transition between two different behavioral models a

transition model is learned. During goal-directed control, the explorative controller is deactivated and the active behavioral model determines the next action (dashed line).

3 Overview of the SUBMODES architecture

We propose a computational architecture, termed SUrprise-based Behavioral MODularization into Event-predictive Structures (SUBMODES), to discover behavioral primitives and learn event-predictive models of the corresponding behavior for an embodied agent completely from scratch. The SUBMODES architecture uses different modular components to explore and learn behavioral primitives and detect transitions in behavior, illustrated in Fig. 1. In this section we give an overview of the system. In the Appendices AD further algorithmic details are provided.

The SUBMODES architecture is composed of different modular components, responsible for exploring behavior, learning models for different behavioral modes, and detecting and encoding transitions in behavior. The different behavioral primitives learned by the system are encoded in the behavioral models of our learning architecture. These models receive sensorimotor perceptions about the agent as an input and produce a predicted sensorimotor state, anticipating future sensorimotor perceptions and actions. We assume that the system switches between its behavioral modes in a predictable fashion, whereby the occurrence of such transitions is detected by error models. Upon detecting a transition, transition models are trained to encode the critical conditions that enable such a change in behavior and the sensory consequences thereof. Initially, behavioral exploration is bootstrapped by an explorative controller and the behavioral models are trained on the perceived sensorimotor experiences. At a later phase, the explorative controller is deactivated and the system can use its learned representations of behavior for anticipatory goal-directed control.

The SUBMODES system learns behavioral primitives based on the experienced sensorimotor time series. We bootstrap this learning process by invoking motor commands via a neural network controller that is updated using differential extrinsic plasticity (DEP) [10]. At every discrete time step the controller transforms proprioceptive sensor values into motor commands

. Here, we use a one-layered feed-forward neural network, as

(1)

for a motor neuron

, with the weight connecting input with the output neuron and a bias term .

With fixed weights the controller would continuously generate motor commands corresponding to one particular behavioral pattern. However, the network weights are constantly changed by applying the DEP-learning rule. This learning rule essentially updates the weights based on correlations of sensoric velocities over some time , i.e.,

(2)

with an inverse model describing the relationship between motor actions and proprioceptive sensor values (details in Appendix A). Besides the weight updates, changes in behavior can also arise from a bias dynamics, which after some time of inactivation shifts the bias value for the most inactive motor neurons .

When applying the explorative controller using the DEP learning rule to an embodied agent, the controller typically discovers different dynamic sensorimotor attractors, which correspond to behavioral dynamics that unfold relatively uniformly over time. These behavioral dynamics can be seen as behavioral primitives, since they typically correspond to simple elementary actions like ‘crawling’, ‘shaking hands’ or ‘wiping a table’ [45]. However, upon strong perturbations of the internal dynamics the controller might leave one sensorimotor attractor and some time later discovers a new one, resulting in a change in behavior. Such perturbations can be caused by a sudden change in interaction of the agent with its environment, e.g., by hitting an obstacle, or by changes within the sensorimotor loop, e.g., the activation of a bias neuron. This property makes the DEP-controller an ideal candidate for behavioral exploration of a complex, embodied agent.

The SUBMODES architecture encodes the explored behavioral primitives through a set of modular, predictive behavioral models . One behavioral model attempts to encode one particular behavioral primitive previously demonstrated by the explorative controller. Each model is a single-layered neural network (no hidden layer) receiving the current sensory state as an input and predicting the next motor command and the sensory consequence of this particular action . At a certain point in time only one model is active. The sensorimotor predictions produced by the active model are compared to the perceived change in sensory values and motor command and the prediction error is computed as the deviation between prediction and sensation. The error signal is then used to update the active model using delta-rule based gradient descent.

To maintain minimal statistics about the accuracy of the sensory predictions, the system contains a set of error models . For each behavioral model an error distribution

is learned, which we currently estimate by means of a normal distribution. Each error model

maintains a moving average

and variance

of the sensory prediction error, thus, estimating the first two moments of the prediction for each behavioral model.

During the ongoing execution of one particular behavioral primitive, one behavioral model is active, which predicts the unfolding sensorimotor consequences, and which is improved by gradient-descent based learning. The confidence of the model for predicting changes in sensory values is estimated by a Gaussian distribution.

We assume that changes in behavior result in a strong, unexpected increase in the sensory prediction error for the currently predicting model. The system detects such a surprise for time step if

(3)

with the current sensory prediction error111In practice, we compute over a short time frame of 25 time steps., the moving average and the moving error deviation of the currently active behavioral model and

a surprise threshold. Hence, a prediction error is considered ‘surprising’ if it exceeds a confidence interval, whose size is determined by the threshold

[38].

If a surprise signal is detected, the system is allowed to switch its active behavioral model . Thus, the system enters a searching period to determine the next behavioral model. All existing models are activated and the mean prediction errors for this searching period is monitored for every behavioral model. If at some point during searching there exists at least one model, for which the mean prediction error is not surprising (determined by Equation 3), this model takes over. If after a maximum amount of time steps the mean prediction error of every model is considered surprising a new model is generated and added to . In this way, the system is able to switch between previously learned behavioral models and to generate new models on the fly.

While transitions in behavior are initially detected based on strong increases in prediction error, we assume that the system switches between such behavioral primitives in a predictable fashion. For example, some transitions in behavior may only occur in a specific context, such as a transition from ‘walking’ to ‘swimming’ may only occur in shallow water. To model the critical conditions leading to a transition in behavior and, thus, enable the system to accurately predict such a transition, we train a set of transition models . For each transition from model to model a transition model is trained. One transition model attempts to identify the sensory state that allows this particular transition in behavior to take place and learns to predict how such a transition typically unfolds. Transition models are updated once a transition in behavior occurs (further described in Appendix B). Hence, by learning models of transition in behavior, the SUBMODES architecture does not only learn how one stable behavioral primitive unfolds – encoded by its behavioral models – but also how different behavioral primitives are connected through transitions in behavior – encoded through transition models .

After initially exploring and learning the own behavioral abilities, the SUBMODES architecture can use its behavioral encodings for model-predictive planning and goal-directed behavior. For goal-directed behavioral control, the explorative controller is deactivated and the motor command at time is determined directly by the active behavioral model . To plan behavior the system receives a desired sensory goal state at every time step . The system first considers which subset of behavioral models are applicable given the current sensory state using its transition models . Then, the system ‘imagines’ how the sensorimotor time series will unfold for each applicable behavior over a fixed time horizon (details in Appendix C). By comparing the predicted time series with the goal state, the system can activate the behavioral model whose predictions are closest to the goal state.

Figure 2: Spherical robot and its axis orientation sensors. (a) shows a screenshot from simulation. (b) shows a schematic illustration of how the axis orientation sensor values are determined (taken from [46])

4 Simulations

The experiments were conducted in the physically realistic rigid body simulator LPZRobots [47]. We tested the SUBMODES system on two robots, the Spherical robot and the Hexapod. The system was updated with a frequency of 50 Hz, each time receiving new sensor readings and setting motor commands.

The Spherical robot, illustrated in Fig. 2, has a ball shaped body, that contains three internal masses. The robot can move by shifting the three masses alongside three internal, orthogonal axes. The three motor command values define the nominal position of the masses along the axes, with corresponding to a centered position and or corresponding to the outer positions. The ‘proprioceptive’ sensory information for the three internal axis is measured as the projection of the axis’ direction onto the -component of the world-coordinate system, illustrated in Fig. 2 (b). The robot is equipped with a spherical head atop of its body to visualize the current rolling direction of the Spherical robot. The head does not physically interact with its body and always ‘hovers’ above the body. When the Spherical robot is in motion, the head is rotating around its -axis to face the current rolling direction.

The Hexapod is a six-legged robot inspired by a stick insect. It has 18 actuated degrees of freedom, 3 in each leg. Like in real stick insects, each leg is partitioned into three parts: femur, tibia, and tarsus. The femur is connected to the body by a two-dimensional coxa joint, which is able to perform forward-backward and upward-downward rotations of the leg with respect to the body. Femur and tibia are connected by an one-dimensional knee joint, which is able to rotate the tibia upward or downward with respect to the femur. The motor values correspond to nominal angles of the joint, where is associated with the minimal joint angle and with the maximal angle. Tarsi and antennae are attached by spring joints and are not actuated.

For both robots, the SUBMODES system receives the current proprioceptive sensory information as an input. When using the Hexapod, the delayed sensor values of the 12 coxa joints, with a small temporal delay of time steps, are additionally provided. Besides the proprioceptive sensory information, the velocity of the robot’s body movement and the current orientation are available sensory input. The orientation is provided in the form of and . Gaussian distributed noise is added to the proprioceptive sensor values () and motor commands (Spherical: , Hexapod: ).

5 Results

Figure 3: Exemplary surprise detection for the Spherical robot and the Hexapod shown through the development of the internal error statistics over time. In the upper rows the Spherical Robot is rolling by rotating around its green axis. After hitting a wall the robot changes its behavior and starts rolling by rotating around its red axis. The collision and subsequent change in behavior results in a strong increase in the prediction error above the confidence of the currently active model . Upon the detection of surprise, the SUBMODES architecture searches for a new behavioral model. Since no model is found that predicts the new behavior sufficiently well, a new model is created and trained. The lower row shows a behavioral transition for the Hexapod. Here the Hexapod moves in a straight line using the tripod gait until the bias dynamics of the DEP-controller is activated and the robot starts crawling in a left curve. This causes surprise, followed by a searching phase and a subsequent transition in behavioral models.

5.1 Learned behavioral primitives

In a first test, we examined which behavior is generated by the DEP-controller for the different robots, and how the SUBMODES architecture segments the explored stream of sensorimotor information into different behavioral primitives. For that purpose, we let the SUBMODES system explore different behaviors for 90 minutes simulation time.

The Spherical robot was tested in a large quadratic arena surrounded by walls. When applied to the Spherical robot, the DEP-controller typically generates different rolling motions, where one of its internal masses is kept fixed at the center of the respective axis, while the other two masses periodically oscillate between the minimal and maximal position with a certain phase shift. Thereby, the robot’s body rotates around one of its axis, while this axis is kept approximately parallel to the ground. If the robot hits a wall, the sensorimotor dynamics are strongly perturbed. These strong perturbations of the internal dynamics are amplified by the DEP-learning rule, which can result in the generation of a new rolling behavior. If the robot continues one rolling motion long enough the bias dynamics of the DEP-controller is activated and the previously centered weight is shifted to one side. This results in a turning motion where the robot turns either left or right while rotating around the axis with the shifted internal mass.

In 90 minutes simulation time of exploring behavior for the Spherical robot, the SUBMODES system learned on average 15 behavioral models () over 10 simulations. Surprise is typically detected by the system once the Spherical robot hits a wall or switches from rolling straight to driving a curve. The upper part of Fig. 3 shows the detection of surprise for one exemplary transition in behavior. In this example the robot first rolls in a straight line by rotating its body around its internal, green axis. Upon hitting a wall the previously demonstrated behavior stops and for a short period of time all internal masses start moving. This results in a strong increase in prediction error outside the confidence of the active model . After some time the motion of one of the internal masses decreases (red mass) until this mass stops moving and is kept fixed at the center of the axis. Since this behavior was demonstrated for the first time, no new model is found during the searching period and a new model is generated. While the system performs the new rolling behavior, the predictions of this new behavioral model improve and the prediction confidence of this model decreases. Further transitions in behavior are shown in Video 1 (https://youtu.be/DKblfeM2Jys).

Figure 4: Behavioral space of the Spherical robot discovered by the SUBMODES architecture in one simulation. (a) illustrates the angular velocity around the internal axes. Each point in (b) - (d) shows the behavior of the robot in terms of angular velocities at that time. (b) shows the behavior for rolling in a relative straight line , i.e., with changes in driving direction . (c) shows the behavior for turning left () and (d) shows the behavior for turning right (). The color of each point depicts which behavioral model was active and predicting the behavior at this time. For clarity only every 50th time step of the simulation is shown.

The behavior explored by the SUBMODES system for the Spherical robot can be described in terms of angular velocity for each axis . The angular velocity states how fast the body of the robot rotates around the internal axis , illustrated in Fig. 4 (a). Fig. 4 (b)-(d) depict rolling behaviors of the Spherical robot from one simulation in terms of angular velocities. Since the change of orientation is not reflected in , we separate the behavior for driving straight (Fig. 4 (b)), driving a right curve (Fig. 4 (c)), and driving a left curve (Fig. 4 (d)). Curved rolling corresponds to rotating around axis , where the mass of axis is shifted to the right or left side of axis . The color of each point shows the clustering of behavior through the behavioral models by the SUBMODES system. In this simulation the system learned 17 models. Here a clear partition can be observed, where different behavioral models are active depending on the angular velocities and the turning velocity (straight/left/right) of the point in behavioral space. Note, that both and were not directly available for the system, but instead the system used its internal predictions on changes of the sensory values to systematically structure the experienced behavior.

Figure 5: Exemplary gaits discovered by the SUBMODES system for the Hexapod. Each gait was encoded by a single behavioral model . (a)-(d) show gaits in an open field. (e) -(f) show gaits in different terrain (see section 5.3). In (e) a snow layer slows down leg movements within the snow. In (f) a low cave ceiling limits the upward movement range of the legs.

We tested the Hexapod robot in an open field without any obstacles. When applied to the Hexapod the DEP-controller, with a particular inverse model , generates different gaits with circular or oval forward movements of each leg. The performed gaits vary in the strength of leg movements and the relationships of the phases between leg movements. One of the emerging gaits for the Hexapod is the tripod gait, as previously observed in [10]. The tripod gait, shown in Fig. 5 (a), can be characterized as always having three legs on the ground and the ipsilateral front and back leg and the contralateral middle leg moving together and in phase [48]. Moreover, a synchronous trot gait could emerge, where two legs at opposing sides of the body move synchronously and hind and front leg movements are synchronized [10], as shown in Fig. 5 (b). Additionally, various hybrid forms of these gaits emerged, for example, front and middle legs moving as during the tripod gait and hind legs moving synchronized and in phase. When activating the bias dynamics of the DEP-controller, the legs on one side of the body are offset either dorsally or ventrally alongside the rotational axes of the coxa joints. This causes the legs on one side to rotate with a smaller amplitude, resulting in the robot crawling in a left or right curve, as shown in Fig. 5 (c)–(d).

In 90 minutes exploring behavior for the Hexapod, the SUBMODES system learned on average 18 behavioral models () over 10 simulations. Surprise is typically detected when the amplitude or phase-relation between the circular joint movements change, i.e., when the robot changes its gait, changes from crawling straight to crawling in a curve, or alters the overall velocity of the gait. An example of changing from tripod gait to curved locomotion with the respective surprise-detection is shown in the lower row of Fig. 3. Video 2 (https://youtu.be/qeUpOqs9PCo) shows more transitions in behavior for the Hexapod.

5.2 Goal-directed navigation

In a second test we analyzed how the SUBMODES system can use its learned behavioral encodings for goal-directed planning and control. We demonstrate this in a goal-directed locomotion task. Using an agent-centric frame of reference we define goals with respect to a target orientation and velocity. One simulation of this experiment consisted of 100 training episodes. Each episode was composed of an exploration phase, a training phase and a testing phase. During the exploration phase the system was allowed to discover and learn new types of behavior for 5 minutes simulation time. In this phase all motor command were generated by the DEP-controller. During the training phase the DEP-controller was deactivated and the motor commands were produced by the active behavioral models with the aim of reaching the given goal. In all experiments goals were small, circular areas. After either reaching the goal state or failing to reach it in time, the robot was reset and a new goal area was generated. In each training phase three goals were presented. During the training phase the internal models of the system were updated. Once this phase was concluded, the testing phase was initiated. Each testing phase consisted of five randomly generated goal areas. During testing no model updates took place and the system had to rely on the previously learned representations for goal-directed control. We use the results of the testing phase to measure the performance of the system over the course of the episodes.

The Spherical robot (diameter unit) was tested in a large quadratic arena (size units) surrounded by walls. Circular goal areas (radius unit) were randomly generated with a fixed distance around the center of the arena (distance units). The Spherical robot was given a maximum of 140 seconds to reach a goal area before being reset. Video 3 (https://youtu.be/i0oovLnqF9A) shows some exemplary runs of goal-directed navigation.

Figure 6:

Results for the goal-directed navigation task for the Spherical robot over the course of training episodes. (a) shows the average time spent per goal before the robot was reset. (b) shows the mean percentage of goal areas reached within the maximal time limit (140s). (c) shows the mean number of behavioral models discovered. The black line depicts the SUBMODES architecture with the shaded area showing the standard deviation. Other line styles and colors show different baselines (see text for further explanations).

Fig. 6 shows the results for the goal-directed navigation task for the Spherical robot, with the SUBMODES system shown in black. Fig. 6 (a) shows the average time spent to reach the goal area. Over the first 50 training episode the time required for goal-directed navigation continuously decreases. While in the first testing episodes the system required approximately 90 seconds for goal-directed navigation per goal, during the last testing episodes it took less than 60 seconds. The hypothetical optimal performance of approximately 30 seconds is never reached completely. Note here, that optimal performance assumes that no turning is required and the robot can simply drive towards the goal in a straight line with maximum speed.

Most of the behavioral models for the Spherical robot were discovered during the first 25 exploration phases, i.e., 125 minutes of exploring behavior. The number of behavioral models increased only slightly afterwards (see Fig. 6 (c)). Similarly, the percentage of goal areas reached within the maximal amount of time increased strongly over the first training episodes (see Fig. 6 (b)). Already after the second training episode the SUBMODES system managed to reach over 70% of the goal areas in time. After 25 training episodes the system is able to reach more than 90% of the goal areas.

Figure 7: Results for the goal-directed navigation task for the Hexapod over the course of the training episodes. (a) shows the average time spent per goal before the robot was reset. (b) shows the mean percentage of goal areas reached within the maximal time limit (200s). (c) shows the mean number of behavioral models discovered. The black line shows the performance of the SUBMODES architecture with the shaded area showing the standard deviation. Other line styles and colors show different baselines (see text for further explanations).

We compare the performance of the SUBMODES system to different ablations of the system, also plotted in Fig. 6. To determine the effectiveness of self-organized exploration combined with surprise-based segmentation, we compare the system to Model-predictive control (MPC) using random controllers

. In this setting, the system is equipped with 30 neural network controllers with fixed weights randomly generated following a uniform distribution (

), that can be used for planning and goal-directed control. Additionally, we compare the system to a random segmentation baseline. For this baseline the system is given 30 behavioral models and during exploration a randomly selected model is activated after each 5 seconds simulation time. This baseline is used to determine the effect of surprise-based segmentation compared to random time-based segmentation. Moreover, we tested the SUBMODES system without transition models. In this case, exploration and segmentation are applied normally, but no transition models are learned for the transitions in behavior. This setting is included to test the effect of learning transition models for goal-directed planning.

As shown in Fig. 6, the SUBMODES system clearly outperforms all of its ablations with respect to number of goals reached and time required per goal. In the MPC with random controller setting, the system learns that some of the controllers can be used for locomotion, however, it finds no reliable way of changing direction. As a result, using random controllers the robot only managed to reach goal areas if by chance it ended up with the right orientation towards the goal. This is strongly reflected in the percentage of goal areas reached in time, which is on average below 20% for all testing episodes. Similar results can be observed for the random segmentation setting. In this setting most of the learned models do not represent a consistent type of behavior. Hence, the system managed to reach goal areas only by chance and, as a result, on average reaches less than 20% of the goals during all episodes. Without transition models the system did not only take more time to reach the goal areas, but also on average only reached approximately 60% of the goal areas in time. We assume, that without transition models the system makes errors in planning when predicting changes in behavior resulting in a worse performance.

The Hexapod robot (length unit) was tested in a large area without any obstacles. Circular goal area (radius unit) were randomly generated around the reset point of the robot with a fixed distance (distance units). The Hexapod was reset if it did not reach a given goal area within 200 seconds simulation time. In Video 4 (https://youtu.be/1h083TjLDK8) some runs of goal-directed navigation are shown.

Fig. 7 depicts the results of the goal-directed navigation task for the Hexapod robot when using the SUBMODES architecture (black line). Already after the first training episode the system was able to reach 80% of the goal areas within the maximal amount of time. From the 20th training episode onward all presented goal areas were reached in time. The time required to reach the goal areas rapidly decreases over the first training episodes. In the last testing episodes the system needed on average 70 seconds simulation time to reach the goal areas. The hypothetical optimal performance of approximately 25 seconds simulation time is never reached completely. The system continuously discovers new behavioral models over the course of the exploration phases.

As before, we compare the performance of the SUBMODES system to different ablations of the system. When using MPC with random controllers, the Hexapod never managed to reach a goal area. While in some simulation we observed, that some of the random controllers could be used for changing the orientation of the robot, not once was a controller generated that could be used for locomotion. Thus, using random controllers the Hexapod never managed to actually move to the goal areas. When applying random segmentation the robot reached approximately 10–15% of the goal areas in time during the first two episodes, but only very rarely reached a goal area afterwards. The cause for this could be that without the surprise-based segmentation one specific behavioral model does not correspond to a particular behavioral primitives, but instead each model is trained on various different types of behavior. Even if by chance one model encodes a consistent behavioral primitive, it might get overwritten very quickly, resulting in a degeneration of performance. As for the Spherical robot, the system without transition models performs worse in the goal-directed navigation task in terms of time required to reach a goal area and number of goals reached in time.

5.3 Terrain-dependent locomotion

The previous tests showed that the SUBMODES system is able to identify self-explored behavioral primitives and learn models of these behavioral units and transitions thereof that can be applied for goal-directed navigation. In a third test we wanted to further examine if the system is also able to distinguish between different external events affecting the behavior of the robot. For this purpose, we tested the system applied to the Hexapod in an environment consisting of three different terrain types: a cave, a snow field, and an open field. The cave has a low ceiling above the ground with being the combined length of the Hexapod’s tibia and tarsus. Thus, the Hexapod is not able to fully lift its legs, when positioned in the cave. However, the ceiling and the floor of the cave have a low friction, which allows the Hexapod to locomote forward using mostly forward-backward motions of its legs. The second environment is a snow environment. In this environment a tall snow layer is covering the ground. All movements inside the snow layer are severly slowed down, by the factor , caused by the high friction of the snow. The third environment is an open field without obstacles and a floor with normal friction (as in the previous experiments).

Figure 8: Trajectories of the Hexapod for goal-directed navigation in different terrain. (a) illustrates the obstacle course consisting of three different environments. Textures depict the type of environment and black lines represent walls. White areas show possible goal positions. The first goal is always positioned at the exit of the cave, the second goal is positioned inside the snow. (b)-(d) show exemplary trajectories from the last testing phases of different simulations. The color of the line denotes in which environment the used behavioral model was first discovered.

The SUBMODES system was given 60 minutes simulation time of behavioral exploration in each of the three environments. Afterwards, the robot was placed in an obstacle course consisting of all three environment types (each with a size of units), shown in Fig. 8 (a), and had to use its previously learned models for goal-directed control. The robot started in the center of the cave facing the north wall. The first goal was to crawl out of the cave through an opening at the right side of the cave. After reaching the opening, a goal area was randomly positioned in the snow field and the task was to move over the open field and the snow layer to the goal position. Fig. 8 (a) shows the possible positions of the goal areas in white. If the robot reached the goal area or did not reach it within an upper time limit (400 seconds simulation time), the robot was reset inside the cave. Like in the previous tests, goal positions were defined with respect to the desired orientation and velocity of the robot. We tested the system for 100 training episodes, where each episode was composed of a training phase, during which one goal area was presented and the internal models were updated, and a testing phase, with five goal areas and without any model updates.

The SUBMODES system discovered new behavioral models for each of the three environments. In the cave the system found different crawling motions, which allowed the Hexapod to move using only little upward movements of the legs. One behavior that was discovered in the cave in every simulation is tripod crawling. During this behavior the legs are moved forward and backward as during the tripod gait, but only by slightly lifting its legs, as shown in Fig. 5 (f). In the snow environment, the system discovered interesting gaits for fast movement despite the high friction of the snow layer. During most of the gaits discovered in snow, at least two legs are periodically lifted outside of the snow while the other legs move only little and their feet constantly stay within the snow, as for example shown in Fig. 5 (e). Some behavioral models were activated in more than one type of environment, but these behaviors mostly resemble standing still or performing little leg movement. The system discovered on average behavioral models (, ) during the 180 minutes simulation time of exploration. On average models were discovered in the cave (), models were discovered in the snow environment (), and models in the open field ().

Figure 9:

Results for the terrain-dependent navigation task for the Hexapod over the course of training epochs. (a) shows the average time spent per goal before the robot was reset. (b) shows the mean percentage of goal areas reached within the maximal time limit (400s). The solid black line depicts the SUBMODES system with the shaded area showing standard deviation; the dashed blue line shows the performance of the system without transition models; the solid green line shows an estimate of hypothetical optimal performance.

The results for goal-directed navigation in the obstacle course are shown in Fig. 9, with the black line depicting the SUBMODES architecture. The time spent to reach a goal area and the percentage of goal areas reached by the system rapidly improves over the first couple of training episodes. Already after seven training episodes the system was able to reach more than 80% of the goal areas in time. The percentage of goal areas reached in time further increased, such that the system reached more than 95% of the goal areas during the last couple of episodes. Furthermore, time spent to reach a goal area is approximately halved over the course of training. Video 5 (https://youtu.be/xhEmmm6VMg8) shows one exemplary run of the Hexapod through the obstacle course.

In Fig. 8

(b)-(d) some trajectories generated by the SUBMODES system for this task are illustrated. The background pattern denotes the type of environment and the color of the lines show in which environment the active behavioral model was first discovered. One can see that the system mostly applies behavioral models in one specific environment that were first discovered in this particular environment. Hence, the system seems to distinguish between different types of behaviors based on the three different environments and learns which behaviors are applicable per environment. Note, that the system does not receive direct information about its current environment. The applicability of one behavioral primitive is determined purely by the prediction errors of the internal models and by learning the transition probabilities between different behavioral models. The necessity of transition models for this task is clearly reflected in the performance of the ablated system without transition models (see Fig.

9, blue line). Without learning transition models, the system takes longer to improve its performance for goal-directed navigation, and never reaches more than 30% of the goal areas in time.

6 Discussion and Future Work

We have proposed a novel computational architecture, the SUBMODES architecture for surprise-based learning of modular, event-predictive behavioral primitives. We showed through different simulations that this system is able to discover and detect a variety of behavioral primitives in highly complex, dynamic systems without the provision of any signal indicating the existence of a behavioral unit or the beginning or end of such an unit. Instead, the system uncovered different behavioral primitives from a continuous self-explored sensorimotor stream in a self-supervised fashion purely based on the detection of surprise and principles of event-predictive cognition [33, 4]. This allowed our system to discretize the continuous stream of information experienced by an embodied agent on-line, while simultaneously learning models of the performed behavior and transitions in behavior. In this way, the SUBMODES system was able to learn a repertoire of various realistic behaviors for two complex robotic agents completely from scratch.

In this work, the behavioral capabilities were initially explored by means of self-organizing behavior, that was generated by the differential extrinsic plasticity (DEP) controller [10]. This controller was able to produce various complex, highly-coordinated behavioral patterns for the two robots with completely different body kinematics. Without specifying a goal, various rolling motions for the Spherical robot and crawling behaviors for the Hexapod emerged, most notably the tripod gait also known from real insects [10]. However, the SUBMODES architecture does not rely on this particular controller. Other forms of behavioral exploration or demonstration could in principle be applied as well, such as predictive information maximization [49], intrinsically motivated goal exploration processes [50], or human demonstration.

Besides the segmentation and learning capabilities, we showed that the SUBMODES system can use its learned behavioral representations progressively more efficient for different goal-directed navigation tasks. The improvement in performance over time is accomplished by means of three main processes: (1.) The system discovers new types of behavior, that are more effective for the tested tasks. (2.) The system continues to improve the prediction accuracy of the behavioral models, allowing the system to anticipate the sensory consequences for each associated behavior more accurately. (3.) The system improves its predictive models about behavioral transitions, learning when transitions between different types of behavior can be applied and how a specific transition affects the sensory state.

While the system manages to improve its capabilities to perform goal-directed control both in terms of number of goal states reached and time required to reach these goals, the system currently does not quite achieve optimal performance for the examined tasks. However, the learned representations were not optimized for any of the tested objectives. Instead, the system learned general, abstract representations of behavior that can in principle be applied in various tasks.

If one wishes to further optimize the performance of the system with respect to a specific tasks, there are various methods that could be applied in addition to the already involved processes: Seeing that the learned models are differentiable, applying goal-directed active inference is possible [51, 52], adjusting the motor command of each behavioral model depending on the desired sensory outcome on the fly. Furthermore, if a criterion for successful performance in a specific task is known, for example achieving high velocity in a locomotion task, the models could be optimized to further achieve this criterion by means of reinforcement learning [53] and policy gradient approaches [29].

The SUBMODES system modularizes the experienced behavior by encoding behavioral primitives through discrete, individual models. While the modularization protects the system from catastrophic forgetting [54], the time required to learn different behaviors could be improved by sharing information among models. Hence, for future work we want to apply the principles investigated in this work in a more general forward architecture, akin to the network architectures in [55], and explore how behavioral representations can be modularized by selectively activating sub-components within the same network structure, as for example demonstrated by the REPRISE architecture [56, 57].

Furthermore, for future work we want to apply the principles described here to more complex tasks, such as object manipulation tasks with multiple intermediate steps that require hierarchical, non-greedy planning. Moreover, we want to expand our system to be able to deal with even more complex settings, by additionally providing visual sensory information and applying the system to real robots.

References

  • [1] Martin V. Butz and Esther F. Kutter. How the Mind Comes Into Being: Introducing Cognitive Science from a Functional and Computational Perspective. Oxford University Press, Oxford, UK, 2017.
  • [2] Tamar Flash and Binyamin Hochner. Motor primitives in vertebrates and invertebrates. Current Opinion in Neurobiology, 15(6):660 – 666, 2005.
  • [3] Peter Gärdenfors. The Geometry of Meaning: Semantics Based on Conceptual Spaces. MIT Press, London, England, 2014.
  • [4] Martin V. Butz. Towards a unified sub-symbolic computational theory of cognition. Frontiers in Psychology, 7(925), 2016.
  • [5] Darrin C Bentivegna, Christopher G Atkeson, and Gordon Cheng. Learning from observation and practice using primitives. In AAAI 2004 Fall Symposium on Real-life Reinforcement Learning. Citeseer, 2004.
  • [6] Stefan Schaal. Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive motion of animals and machines, pages 261–280. Springer, 2006.
  • [7] Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dynamical movement primitives: Learning attractor models for motor behaviors. Neural Computation, 25(2):328–373, 2013.
  • [8] Duy Nguyen-Tuong and Jan Peters. Model learning for robot control: a survey. Cognitive Processing, 12:319–340, 2011.
  • [9] F. Wörgötter, E. E. Aksoy, N. Krüger, J. Piater, A. Ude, and M. Tamosiunaite. A simple ontology of manipulation actions based on hand-object relations. Autonomous Mental Development, IEEE Transactions on, 5(2):117–134, 2013.
  • [10] Ralf Der and Georg Martius. Novel plasticity rule can explain the development of sensorimotor intelligence. Proceedings of the National Academy of Sciences, 112(45):E6224–E6232, 2015.
  • [11] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13:341–379, 2003.
  • [12] Matthew Botvinick and Ari Weinstein. Model-based hierarchical reinforcement learning and human action control. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 369(1655), 2014.
  • [13] Martin A. Giese and Tomaso Poggio. Neural mechanisms for the recogniton of biological movements. Nature Reviews Neuroscience, 4:179–192, 2003.
  • [14] Oliver Herbort, Martin V. Butz, and Joachim Hoffmann. Towards an adaptive hierarchical anticipatory behavioral control system. In From Reactive to Anticipatory Cognitive Embodied Systems: Papers from the AAAI Fall Symposium, pages 83–90, Menlo Park, CA, 2005. AAAI Press.
  • [15] Yuuya Sugita, Jun Tani, and Martin V Butz. Simultaneously emerging braitenberg codes and compositionality. Adaptive Behavior, 19:295–316, 2011.
  • [16] Jun Tani. Learning to perceive the world as articulated: An approach for hierarchical learning in sensory-motor systems. Neural Networks, 12:1131–1141, 1999.
  • [17] Ronald C Arkin. Motor schema—based mobile robot navigation. The International journal of robotics research, 8(4):92–112, 1989.
  • [18] Christoph Bregler. Learning and recognizing human dynamics in video sequences. In Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on, pages 568–574. IEEE, 1997.
  • [19] D. Kraft, N. Pugeault, E. Baseski, M. Popovic, D. Kragic, S. Kalkan, F. Wörgötter, and N. Krüger. Birth of the object: Detection of objectness and extraction of object shape through object action complexes. International Journal of Humanoid Robotics, 5(2):247–265, 2008.
  • [20] William James. The Principles of Psychology, volume I,II. Cambridge, MA: Harvard University Press, 1890.
  • [21] Joachim Hoffmann. Anticipatory behavioral control. In Anticipatory behavior in adaptive learning systems, pages 44–65. Springer, 2003.
  • [22] Armin Stock and Claudia Stock. A short history of ideo-motor action. Psychological research, 68(2-3):176–188, 2004.
  • [23] W. Prinz. A common coding approach to perception and action. In O. Neumann and W. Prinz, editors, Relationships between perception and action, pages 167–201. Springer Verlag, Berlin, 1990.
  • [24] Martin V. Butz and Joachim Hoffmann.

    Anticipations control behavior: Animal behavior in an anticipatory learning classifier system.

    Adaptive Behavior, 10:75–96, 2002.
  • [25] Martin V. Butz. Which structures are out there. In Thomas K. Metzinger and Wanja Wiese, editors, Philosophy and Predictive Processing, chapter 8. MIND Group, Frankfurt am Main, 2017.
  • [26] Daniel M Wolpert and Mitsuo Kawato. Multiple paired forward and inverse models for motor control. Neural networks, 11(7-8):1317–1329, 1998.
  • [27] Masahiko Haruno, Daniel M Wolpert, and Mitsuo Kawato. Mosaic model for sensorimotor learning and control. Neural computation, 13(10):2201–2220, 2001.
  • [28] Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  • [29] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.
  • [30] S. Calinon and A. Billard. Statistical learning by imitation of competing constraints in joint space and task space. Advanced Robotics, 23(15):2059–2076, 2009.
  • [31] Jens Kober and Jan Peters. Policy search for motor primitives in robotics. Machine Learning, 84:171–203, 2011.
  • [32] O. Sigaud, C. Salaun, and V. Padois. On-line regression algorithms for learning mechanical models of robots: a survey. Robotics and Autonomous Systems, 59(12):1115–1129, December 2011.
  • [33] Jeffrey M Zacks, Nicole K Speer, Khena M Swallow, Todd S Braver, and Jeremy R Reynolds. Event perception: a mind-brain perspective. Psychological bulletin, 133(2):273, 2007.
  • [34] Jeffrey M Zacks and Barbara Tversky. Event structure in perception and conception. Psychological bulletin, 127(1):3–21, 2001.
  • [35] Jeremy R Reynolds, Jeffrey M Zacks, and Todd S Braver. A computational model of event segmentation from perceptual prediction. Cognitive Science, 31(4):613–643, 2007.
  • [36] Christian Gumbsch, Jan Kneissler, and Martin V Butz. Learning behavior-grounded event segmentations. In Proceedings of the 38th Annual Meeting of the Cognitive Science Society, pages 1787–1792, 2016.
  • [37] Christian Gumbsch, Sebastian Otte, and Martin V Butz. A computational model for the dynamical learning of event taxonomies. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society, pages 452–457, 2017.
  • [38] Martin V Butz, Samarth Swarup, and David E Goldberg. Effective online detection of task-independent landmarks. Urbana, 51:61801, 2004.
  • [39] Anna C Schapiro, Timothy T Rogers, Natalia I Cordova, Nicholas B Turk-Browne, and Matthew M Botvinick. Neural representations of events arise from temporal community structure. Nat Neurosci, 16(4):486–492, April 2013.
  • [40] Özgür Şimşek and Andrew G. Barto. Using relative novelty to identify useful temporal abstractions in reinforcement learning. Proceedings of the Twenty-First International Conference on Machine Learning (ICML-2004), pages 751–758, 2004.
  • [41] Özgür Şimşek and Andrew G. Barto. Skill characterization based on betweenness. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1497–1504. Curran Associates, Inc., Red Hook, NY, 2009.
  • [42] Matthew Botvinick, Yael Niv, and Andrew C. Barto. Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition, 113(3):262 – 280, 2009.
  • [43] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211, 1999.
  • [44] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pages 3675–3683, 2016.
  • [45] Georg Martius, Rafael Hostettler, Alois Knoll, and Ralf Der. Compliant control for soft robots: emergent behavior of a tendon driven anthropomorphic arm. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pages 767–773. IEEE, 2016.
  • [46] Georg Martius. Goal-oriented control of self-organizing behavior in autonomous robots. PhD thesis, Göttingen University, 2010.
  • [47] Georg Martius, Frank Hesse, F Güttler, and Ralf Der. LPZRobots: A free and powerful robot simulator, 2010.
  • [48] JJ Collins and Ian Stewart. Hexapodal gaits and coupled nonlinear oscillator models. Biological cybernetics, 68(4):287–298, 1993.
  • [49] Georg Martius, Ralf Der, and Nihat Ay. Information driven self-organization of complex robotic behaviors. PloS one, 8(5):e63400, 2013.
  • [50] Sébastien Forestier, Yoan Mollard, and Pierre-Yves Oudeyer. Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190, 2017.
  • [51] Sebastian Otte, Theresa Schmitt, Karl Friston, and Martin V. Butz. Inferring adaptive goal-directed behavior within recurrent neural networks. 26th International Conference on Artificial Neural Networks (ICANN17), pages 227–235, 2017.
  • [52] Karl Friston, Francesco Rigoli, Dimitri Ognibene, Christoph Mathys, Thomas FitzGerald, and Giovanni Pezzulo. Active inference and epistemic value. Cognitive Neuroscience, 6:187–214, 2015.
  • [53] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
  • [54] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  • [55] Jun Tani. Exploring Robotic Minds. Oxford University Press, Oxford, UK, 2017.
  • [56] Martin V. Butz, David Bilkey, Alistair Knott, and Sebastian Otte. Reprise: A retrospective and prospective inference scheme. Proceedings of the 40th Annual Meeting of the Cognitive Science Society, 2018.
  • [57] Martin V Butz, David Bilkey, Dania Humaidan, Alistair Knott, and Sebastian Otte. Learning, planning, and control in a monolithic neural event inference architecture. arXiv preprint arXiv:1809.07412, 2018.
  • [58] Georg Martius and J Michael Herrmann. Variants of guided self-organization for robot control. Theory in Biosciences, 131(3):129–137, 2012.
  • [59] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.
  • [60] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.

Appendix A Behavioral exploration using DEP

Figure 10: Network architecture of the DEP-controller (adapted from [45]). The left side illustrates the neural network controller generating motor commands based on the proprioceptive sensory input . The right side shows the DEP learning rule, multiplying the derivative of a sensor value with the inferred motor changes , generated by the inverse model from some future input’s derivative .

For the parametric setup of the DEP-controller we follow [10]. The complete controller architecture is illustrated in Fig. 10. The DEP-controller receives an -dimensional sensory input and generates an -dimensional motor command at every discrete time step . We assume that the system has a basic understanding of the causal relationship between motor actions and proprioceptive sensor values [10]. This ‘understanding’ is imprinted into an inverse model , which relates sensory values back to motor commands with a certain time lag . When focusing on changes in sensory values and motor values, we get

(4)

where is the inverse model, simplified as a linear model in the form of a matrix, and the time lag .

The controller weights are then updated using the differential extrinsic plasticity rule (DEP):

(5)

where is a learning rate and is a damping term. Since

is a linear transformation of

, the synaptic weights of the controller change based on correlations between changes in sensor values with a time lag . Thereby, the inverse model states how correlations between and impact the weights .

As in [10] we use an appropriate normalization of the controller weights . There are two options to perform weight normalization: global normalization and individual normalization. For global normalization the entire weight matrix is normalized:

(6)

with an empirical gain factor and a regularization term that becomes effective near the singularities ( or ). In individual normalization each weight is normalized individually, with

(7)

where is the norm of the th row of , consisting of all weights that connect to the motor neuron . The type of normalization applied has a strong effect on the resulting behavior: While individual normalization leads to behaviors that involve all motors, global normalization restricts the overall activity to a subset of motors. For the Spherical robot we apply global normalization, which results in the behavior in which two internal masses are constantly moved while the third mass is stationary. For the Hexapod robot we apply individual normalization, resulting in all joints being involved for locomotion. The gain factor regulates the overall feedback strength of the sensorimotor loop, which we set to for the Spherical robot and to for the Hexapod.

The controller additionally uses a bias dynamics that changes the values of one bias neuron every time steps. The bias ought to be altered is chosen as the bias neuron connecting to the motor neuron that has had the fewest changes in the controller weights connected to the neuron

. Based on this heuristics we introduce activity to motor neurons that did not change their activity much in the recent past. When the bias neuron is activated its activity is randomly set to either

or . After time steps all bias neurons are deactivated again and this process is repeated.

For the Spherical robot we set (100 seconds). Since for the Hexapod robot the behavior demonstrated by the DEP-controller changes more often naturally, we chose a larger time horizon of (200 seconds). The DEP-controller applied to the Spherical robot uses three bias neurons, one for each motor neuron. For the Hexapod we use four bias neurons. Two bias neurons are connected to the forward-backward coxa joints of either the right legs or the left legs and two bias neurons are connected to the upward-downward coxa joints of either the right or left legs. With this wiring, the activation of one bias neuron can offset the coxa joint positions of the legs on one body side to four different directions (upward, downward, forward, backward).

The inverse model of the DEP-controller, states how sensory changes relate back to changes in the motor commands of the system, as defined by Equation 4. If the DEP-controller uses only proprioceptive sensory information as an input and motor commands of the same joints as an output, we can set

to the identity matrix

. This design corresponds to the idea that changes in the proprioception of joint are caused by changes in the motor command of joint . This setting can be considered the standard case of applying the DEP-controller, which we also use for the Spherical robot.

Figure 11: Prestructuring of the inverse model of the DEP-controller when using the Hexapod. depicts the dimension of the coxa joint responsible for up-down movements, depicts the forward-backward dimension. An arrow from joint to describes the entry of the inverse model matrix. -arrows represent a positive connection (), -arrows represent a negative connection ().

However, the inverse model can also be prestructured, by adding connections between joints within the inverse model where correlations or anticorrelations of the joint velocities are desired. The underlying idea is, that we can add connections for joints and to increase either positive correlations () or negative correlations () between their velocities over time. We apply this form of guided self-organization of behavior [58] when using the Hexapod, as it was previously done by [10]. For the Hexapod, the inverse model assumes a positive correlation between changes of joint angles and changes in motor commands for the same joint, i.e., for a joint . Furthermore, the time-delayed sensor for forward-backward angles is positively linked to the downward-upward angle of the same coxa joint (see Fig. 11 (a)). This connection facilitates circular leg motions over time [10], e.g., once the leg moves forward it is desired that the leg moves downward some time later. To further facilitate locomotion we additionally want to obtain antiphasic forward-backward motions of subsequent legs on the same side. For this purpose, negative links are included in between the forward-backward sensors and motors of subsequent legs of the same side (see Fig. 11 (b)).

Appendix B Learning transitions in behavior

To enable the accurate prediction of a sensorimotor time series consisting of a variety of different behaviors, it is important to not only consider how the sensorimotor information unfolds during each stable behavioral mode, but to also model the transitions between two subsequent behavioral primitives. For this purpose, the SUBMODES system incorporates a set of transition models . If a transition from model to model occurs the transition model is updated. Each transition model consists of three subcomponents: , , and .

Some transitions require a specific context to occur, e.g., a transition from ‘walking’ to ‘swimming’ can only occur if the agent is standing in shallow water. To model the critical conditions for a transition in behavior, contains a transition probability network . This network aims to predict the probability of a successful transition from to given the current sensory state . is a single layered feed-forward neural network mapping a sensory state to a probability . If a transition was initiated at time step , then receives as an input to train the network. If after the transition the system activated model , then is trained on the deviation of its prediction from the target probability . If the system planned to reach when initiating the transition, but ended up using a different model, is updated using the target probability . Thus, the network estimates the probability of being able to switch from to given the current sensory state.

Transitions in behavior may take different times to be completed, since every transition in behavior is preceded by a searching period. Hence, contains the component , an estimation of the time required to perform a transition from to . Currently is computed as the mean time steps which passed between the initiation of a transition from model and successively activating model .

Transitions in behavior can also entail a strong sudden sensory change. For example a transition from ‘running’ to ‘standing still’ typically results in a strong decrease in velocity. To predict the sensory changes occurring during a transition between models an additional single-layered feed-forward neural network is trained. learns a mapping from a sensory state to a change in sensory states . When a transition from model is initiated at time and model is activated at time step then is trained on the input and the nominal output . Hence, predicts how the sensory state will change from the onset of a transition until the transition is finished.

Overall, one transition model can be used to estimate (1.) where in sensory space such a transition is applicable, (2.) how long the transition in behavior will take until the next model is active, and (3.) how the sensory state will change over the course of the transition, by means of , , and , respectively. This results in a directed graph representation of behavioral primitives, as illustrated in Fig. 12 (a). Each node of the graph represents a stable behavioral mode with uniformly unfolding sensorimotor dynamics, encoded by a single behavioral model . The edges between two nodes are transitions in behavior, represented by a transition model . The availability of an edge given the current sensory state, is encoded by the transition probability model . This graph representation of behavior and transitions in behavior is crucial to allow hierarchical, goal-directed planning of behavior.

Figure 12: Illustration of the representations of behavior learned by the SUBMODES architecture and their use for goal-directed planning. (a) The learned representations form a directed graph with the behavioral models as nodes. Each edge represents one step of sensory prediction, either by staying in the same model, or by transitioning to a new behavior. A transition to behavior from current behavior is considered in a stochastic fashion according to the probability given the current sensory state . In this example is active and and can be reached. (b) shows how the prediction can be used for greedy planning. marks a goal state and the dotted lines show the predicted trajectory when using an associated behavioral model . In this example the system chooses since a part of the predicted trajectory, marked by a grey background, has the lowest mean distance to the goal state. (c) shows how replanning allows the system to concatenate different behavioral primitives for accurate goal-directed control.

Appendix C Goal-directed planning

When switching into goal-directed control, the self-organizing controller is deactivated and behavioral models and model transitions are invoked purposefully for minimizing the difference between anticipated and desired perceptions. This process of greedy planning is schematically illustrated in Fig. 12 (b).

During goal-directed control at every time the system considers which subset of behaviors is applicable given the current sensory state and the currently active model . Whether a behavior is an element of is determined stochastically using the transition probability network . The system determines the probability of being an element of as

(8)

with the active model and the current sensory state.

As a next step the system predicts how its sensory state will change when transitioning from the current behavior to a new behavior . The sensory state , describing the sensory state after a transition to from the active model , is determined as

(9)

with the estimated time required for the transition and the transition network predicting the sensory change during a transition from to .

Then, the system predicts how the sensory information will evolve over a planning horizon when staying in . The succeeding sensory states are computed iteratively via

(10)

starting with until .

Given a goal state the distance of a predicted sensory state with respect to the goal can be computed as

(11)

for some metric . In the current experiments was chosen as the squared distance between task-relevant sensory information of and . In our examples, the task-relevant coordinates are the orientation and the velocity .

The next behavioral model is chosen as

(12)

Hence, the next model is determined, as the applicable behavior that predicts the sensory time series with the lowest mean distance to the goal. The predicted time series has a maximal length of to ensure an upper limit on the computational complexity, which we set to .

After activating the next model , the system initiates a searching period to determine whether the transition to this model was successful, as described in section 3. The transition model is then updated, depending on the success of the initiated transition. As soon as the system is certain about which behavioral model is currently active it is allowed to replan. In this way, the system can serially concatenate single behavioral primitives to form a chain of more complex behavior that allows the system to accurately reach a given goal state, as illustrated in Fig. 12 (c).

Appendix D Parametric setup

All neural network models of our system, i.e., , are single-layered neural networks mapping directly from sensory input space to their respective output spaces (). For networks predicting sensory changes or motor commands, i.e., and , output neurons use a

-activation function and a squared error loss is used for back propagation. To enforce sparsity in the network weights, a

weight regularization term is added to the loss-functions

[59], with the regularization constant . For networks predicting probabilities, i.e., , output neurons use a sigmoid activation function and perform back-propagation based on a balanced cross-entropy loss [60]. The different types of networks use different learning rates (). To enable fast learning while avoiding local overfitting, each network is equipped with a replay buffer with a large capacity (capacity = 10000) that stores a new input-output pair in each training step. During each network update additional samples are randomly drawn from the buffer and the neural network models are additionally trained on the drawn samples. For the Spherical robot samples are additionally drawn during each network update. Seeing that the behaviors change faster for the Hexapod, we used a larger sampling rate of in that scenario.

The error models are estimated as a normal distribution. The normal distributions are initialized with and . To allow each error model to quickly keep track of the prediction accuracy of its respective behavioral model, and are updated by means of an exponential moving average and variance, with a timescale of 1000 steps ().

To enable the detection of surprise we compute the prediction error as a simple moving average of the sensory prediction error over a very short time interval (25 time steps or 0.5 seconds). Comparing the prediction error to the active error model , allows the system to detect surprise (as defined in equation 3). The surprise threshold determines the size of the confidence interval that an error needs to exceed to be considered ‘surprising’. Seeing that we face highly noisy scenarios in our experiments, we chose a small threshold to achieve a fine-grained segmentation. However, depending on the general predictability of the scenario and the desired level of abstraction, a larger can be applied as well [37].

Upon detecting surprise, the system enters a searching period to determine the next behavioral model. In our simulations the searching period takes maximally 700 time steps (14 seconds) before a new model is created. Seeing that the DEP-controller maintains one type of behavior for a relatively long time (typically longer than a minute), we can use such a long searching period. In this way, small irregularities in behavior, such as the Hexapod stumbling, are ignored instead of resulting in the generation of a new behavioral model. However, for other exploration mechanisms with faster changes in behavior a shorter searching period is recommended.

Appendix E Processing sensoric changes

To allow the surprise-based segmentation to take all sensory dimensions into account equally, it is necessary that every sensory dimension changes at a similar rate. This can be achieved in two ways: (1.) choosing an appropriate time frame for determining the change in sensory information , (2.) scaling each dimension of by a constant factor , such that all are within the same interval.

For the Spherical robot is computed as the change of sensory information over one time step (i.e., ). For the Hexapod is computed as the mean change over 10 time steps. By computing in this way, changes in proprioception are typically within the same interval (). To assure that other changes in sensory information are within this interval as well, , , and are multiplied with a constant factor ( for the Spherical robot, for the Hexapod).