Towards a self-organizing pre-symbolic neural model representing sensorimotor primitives

by   Junpei Zhong, et al.

The acquisition of symbolic and linguistic representations of sensorimotor behavior is a cognitive process performed by an agent when it is executing and/or observing own and others' actions. According to Piaget's theory of cognitive development, these representations develop during the sensorimotor stage and the pre-operational stage. We propose a model that relates the conceptualization of the higher-level information from visual stimuli to the development of ventral/dorsal visual streams. This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model. We exemplify this model through a robot passively observing an object to learn its features and movements. During the learning process of observing sensorimotor primitives, i.e. observing a set of trajectories of arm movements and its oriented object features, the pre-symbolic representation is self-organized in the parametric units. These representational units act as bifurcation parameters, guiding the robot to recognize and predict various learned sensorimotor primitives. The pre-symbolic representation also accounts for the learning of sensorimotor primitives in a latent learning context.



There are no comments yet.


page 1

page 8


Learning Topological Motion Primitives for Knot Planning

In this paper, we approach the challenging problem of motion planning fo...

Inherent Biases of Recurrent Neural Networks for Phonological Assimilation and Dissimilation

A recurrent neural network model of phonological pattern learning is pro...

Learning Predictive Models for Ergonomic Control of Prosthetic Devices

We present Model-Predictive Interaction Primitives – a robot learning fr...

The evolution of representation in simple cognitive networks

Representations are internal models of the environment that can provide ...

Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures

Neural networks have achieved success in a wide array of perceptual task...

Neural Algebra of Classifiers

The world is fundamentally compositional, so it is natural to think of v...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Keywords:

Pre-symbolic Communication, Sensorimotor Integration, Recurrent Neural Networks, Parametric Biases, Horizontal Product

2 Introduction

Although infants are not supposed to acquire the symbolic representational system at the sensorimotor stage, based on Piaget’s definition of infant development, the preparation of language development, such as a pre-symbolic representation for conceptualization, has been set at the time when the infant starts babbling (Mandler (1999)). Experiments have shown that infants have established the concept of animate and inanimate objects, even if they have not yet seen the objects before (Gelman and Spelke (1981)). Similar phenomena also include the conceptualization of object affordances such as the conceptualization of containment (Bonniec (1985)). This conceptualization mechanism is developed at the sensorimotor stage to represent sensorimotor primitives and other object-affordance related properties.

During an infants’ development at the sensorimotor stage, one way to learn affordances is to interact with objects using tactile perception, observe the object from visual perception and thus learn the causality relation between the visual features, affordance and movements as well as to conceptualize them. This learning starts with the basic ability to move an arm towards the visual-fixated objects in new-born infants (Von Hofsten (1982)), continues through object-directed reaching at the age of 4 months (Streri et al. (1993); Corbetta and Snapp-Childs (2009)), and can also be found during the object exploration of older infants (c.f. Mandler (1992); Ruff (1984)). From these interactions leading to visual and tactile percepts, infants gain experience through the instantiated ‘bottom-up’ knowledge about object affordances and sensorimotor primitives. Building on this, infants at the age of around 8-12 months gradually expand the concept of object features, affordances and the possible causal movements in the sensorimotor context (Gibson (1988); Newman et al. (2001); Rocha et al. (2006)). For instance, they realize that it is possible to pull a string that is tied to a toy car to fetch it instead of crawling towards it. An associative rule has also been built that connects conceptualized visual feature inputs, object affordance and the corresponding frequent auditory inputs of words, across various contexts (Romberg and Saffran (2010)). At this stage, categories of object features are particularly learned in different contexts due to their affordance-invariance (Bloom et al. (1993)).

Therefore the integrated learning process of the object’s features, movements according to the affordances, and other knowledge is a globally conceptualized process through visual and tactile perception. This conceptualized learning is a precursor of a pre-symbolic representation of language development. This learning is the process to form an abstract and simplified representation for information exchange and sharing111For comparison of conceptualization between engineering and language perspectives, see Gruber and Olsen (1994); Bowerman and Levinson (2001).. To conceptualize from visual perception, it usually includes a planning process: first the speaker receives and segments visual knowledge in the perceptual flow into a number of states on the basis of different criteria, then the speaker selects essential elements, such as the units to be verbalized, and last the speaker constructs certain temporal perspectives when the events have to be anchored and linked (c.f. Habel and Tappe (1999); von Stutterheim and Nuse (2003)). Assuming this planning process is distributed between ventral and dorsal streams, the conceptualization process should also emerge from the visual information that is perceived in each stream, associating the distributed information in both streams. As a result, the candidate concepts of visual information are statistically associated with the input stimuli. For instance, they may represent a particular visual feature with a particular class of label (e.g. a particular visual stimuli with an auditory wording ‘circle’) (Chemla et al. (2009)). Furthermore, the establishment of such links also strengthens the high-order associations that generate predictions and generalize to novel visual stimuli (Yu (2008)). Once the infants have learned a sufficient number of words, they begin to detect a particular conceptualized cue with a specific kind of wording. At this stage, infants begin to use their own conceptualized visual ‘database’ of known words to identify a novel meaning class and possibly to extend their wording vocabulary (Smith et al. (2002)

). Thus, this associative learning process enables the acquisition and the extension of the concepts of domain-specific information (e.g. features and movements in our experiments) with the visual stimuli.

This conceptualization will further result in a pre-symbolic way for infants to communicate when they encounter a conceptualized object and intend to execute a correspondingly conceptualized well-practised sensorimotor action towards that object. For example, behavioral studies showed that when 8-to-11-month-old infants are unable to reach and pick up an empty cup, they may point it out to the parents and execute an arm movement intending to bring it to their lips. The conceptualized shape of a cup reminds infants of its affordance and thus they can communicate in a pre-symbolic way. Thus, the emergence from the conceptualized visual stimuli to the pre-symbolic communication also gives further rise to the different periods of learning nouns and verbs in infancy development (c.f. Gentner (1982); Tardif (1996); Bassano (2000)). This evidence supports that the production of verbs and nouns are not correlated to the same modality in sensory perception: experiments performed by Kersten (1998) suggest that nouns are more related to the movement orientation caused by the intrinsic properties of an object, while verbs are more related to the trajectories of an object. Thus we argue that such differences of acquisitions in lexical classes also relate to the conceptualized visual ventral and dorsal streams. The finding is consistent with Damasio and Tranel (1993)’s hypothesis that verb generation is modulated by the perception of conceptualization of movement and its spatio-temporal relationship.

For this reason, we propose that the conceptualized visual information, which is a prerequisite for the pre-symbolic communication, is also modulated by perception in two visual streams. Although there have been studies of modeling the functional modularity in the development of ventral and dorsal streams (e.g. Jacobs et al. (1991); Mareschal et al. (1999)), the bilinear models of visual routing (e.g. Olshausen et al. (1993); Bergmann and von der Malsburg (2011); Memisevic and Hinton (2007)

), in which a set of control neurons dynamically modifies the weights of the ‘what’ pathway on a short time scale, or transform-invariance models (e.g. 

Földiák (1991); Wiskott and Sejnowski (2002)) by encouraging the neurons to fire invariantly while transformations are performed in their input stimuli. However, a model that explains the development of conceptualization from both streams and results in an explicit representation of conceptualization of both streams while the visual stimuli is presented is still missing in the literature. This conceptualization should be able to encode the same category for information flows in both ventral and dorsal streams like ‘object files’ in the visual understanding (Fields (2011)) so that they could be discriminated in different contexts during language development.

On the other hand, this conceptualized representation that is distributed in two visual streams is also able to predict the tendency of appearance of an action-oriented object in the visual field, which causes some sensorimotor phenomena such as object permanence (Tomasello and Farrar (1986)) showing the infants’ attention usually is driven by the object’s features and movements. For instance, when infants are observing the movement of the object, recording showed an increase of the looking times when the visual information after occlusion is violated in either surface features or location (Mareschal and Johnson (2003)). Also the words and sounds play a top-down role in the early infants’ visual attention (Sloutsky and Robinson (2008)). This could hint at the different development stages of the ventral and dorsal streams and their effect on the conceptualized prediction mechanism in the infant’s consciousness. Accordingly, the model we propose about the conceptualized visual information should also be able to explain the emergence of a predictive function in the sensorimotor system, e.g. the ventral stream attempts to track the object and the dorsal stream processes and predicts the object’s spatial location, when the sensorimotor system is involved in an object interaction. We have been aware of that this build-in predictive function in a forward sensorimotor system is essential: neuroimaging research has revealed the existence of internal forward models in the parietal lobe and the cerebellum that predict sensory consequences from efference copies of motor commands (Kawato et al. (2003)) and supports fast motor reactions (e.g. Hollerbach (1982)

). Since the probable position and the movement pattern of the action should be predicted on a short time scale, sensory feedback produced by a forward model with negligible delay is necessary in this sensorimotor loop.

Particularly, the predictive sensorimotor model we propose is suitable to work as one of the building modules that takes into account the predictive object movement in a forward sensorimotor system to deal with object interaction from visual stimuli input as Fig. 1 shows. This system is similar to Wolpert et al. (1995)

’s sensorimotor integration, but it includes an additional sensory estimator (the lower brown block) which takes into account the visual stimuli from the object so that it is able to predict the dynamics of both the end-effector (which is accomplished by the upper brown block) and the sensory input of the object. This object-predictive module is essential in a sensorimotor system to generate sensorimotor actions like tracking and avoiding when dealing with fast-moving objects, e.g. in ball sports. We also assert that the additional inclusion of forward models in the visual perception of the objects can explain some predictive developmental sensorimotor phenomena, such as object permanence.

Figure 1: Diagram of sensorimotor integration with the object interaction. The lower forward model predicts the object movement, while the upper forward model extracts the end-effector movement from sensory information in order to accomplish a certain task (e.g. object interaction).

In summary, we propose a model that establishes links between the development of ventral/dorsal visual streams and the emergence of the conceptualization in visual streams, which further leads to the predictive function of a sensorimotor system. To validate this proof-of-concept model, we also conducted experiments in a simplified robotics scenario. Two NAO robots were employed in the experiments: one of them was used as a ‘presenter’ and moved its arm along pre-programmed trajectories as motion primitives. A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the ‘observer’. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account. Though we could also use one robot and a human presenter to run the same tasks, we used two identical robots, due to the following reasons: 1. the object movement trajectories can be done by a pre-programmed machinery so that the types and parameters of it can be adjusted; 2. the use of two identical robots allows to interchange the roles of the presenter and observer in an easier manner. As other humanoid robots, a sensorimotor cycle that is composed of cameras and motors also exists in NAO robots. Although its physical configurations and parameters of sensory and motor systems are different from those in human beings’ or other biological systems, our model only handles the pre-processed information extracted from visual stimuli. Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.

Figure 2: The RNNPB-horizontal network architecture, where layers represent different types of features. Size of indicates the transitional information of the object.

3 Material and Methods

3.1 Network Model

A similar forward model exhibiting sensory prediction for visual object perception has been proposed in our recently published work (Zhong et al. (2012b)) where we suggested an RNN implementation of the sensory forward model. Together with a CACLA trained multi-layer network as a controller model, the forward model embodied in a robot receiving visual landmark percepts enabled a smooth and robust robot behavior. However, one drawback of this work was its inability to store multiple sets of spatial-temporal input-output mappings, i.e. the learning did not converge if there appeared several spatial-temporal mapping sequences in the training. Consequently, a simple RNN network was not able to predict different sensory percepts for different reward-driven tasks. Another problem was that it assumed only one visual feature appeared in the robot’s visual field, and that was the only visual cue it could learn during development. To solve the first problem, we further augment the RNN with parametric bias (PB) units. They are connected like ordinary biases, but the internal values are also updated through back-propagation. Comparing to the generic RNN, the additional PB units in this network act as bifurcation parameters for the non-linear dynamics. According to Cuijpers et al. (2009), a trained RNNPB can successfully retrieve and recognize different types of pre-learned, non-linear oscillation dynamics. Thus, this bifurcation function can be regarded as an expansion of the storage capability of working-memory within the sensory system. Furthermore, it adds the generalization ability of the PB units, in terms of recognizing and generating non-linear dynamics. To tackle the second problem, in order to realize sensorimotor prediction behaviors such as object permanence, the model should be able to learn objects’ features and object movements separately in the ventral and dorsal visual streams, as we have shown in Zhong et al. (2012a).

Merging these two ideas, in the context of sensorimotor integration on object interaction, the PB units can be considered as a small set of high-level conceptualized units that describe various types of non-linear dynamics of visual percepts, such as features and movements. This representation is more related to the ‘natural prototypes’ from visual perception, for instance, than a specific language representation (Rosch (1973)).

The development of PB units can also be seen as the pre-symbolic communication that emerges during sensorimotor learning. The conceptualization, on the other hand, could also result in the prediction of future visual percepts of moving objects in sensorimotor integration.

In this model (Fig. 2), we propose a three-layer, horizontal product Elman network with PB units. Similar to the original RNNPB model, the network is capable of being executed under three running modes, according to the pre-known conditions of inputs and outputs: learning, recognition and prediction. In learning mode, the representation of object features and movements are first encoded in the weights of both streams, while the bifurcation parameters with a smaller number of dimensions are encoded in the PB units. This is consistent with the emergence of the conceptualization at the sensorimotor stage of infant development.

Apart from the PB units, another novelty in the network is that the visual object information is encoded in two neural streams and is further conceptualized in PB units. Two streams share the same set of input neurons, where the coordinates of the object in the visual field are used as identities of the perceived images. The appearance of values in different layers represents different visual features: in our experiment, the color of the object detected by the yellow filter appears in the first layer whereas the color detected by the green filter appears in the second layer; the other layer remains zero. For instance, the input represents a green object at coordinates in the visual field. The hidden layer contains two independent sets of units representing dorsal-like ‘’ and ventral-like ‘’ neurons respectively. These two sets of neurons are inspired by the functional properties of dorsal and ventral streams: (i) fast responding dorsal-like units predict object position and hence encode movements; (ii) slow responding ventral-like units represent object features. The recurrent connection in the hidden layers also helps to predict movements in layer and to maintain a persistent representation of an object’s feature in layer . The horizontal product brings both pathways together again in the output layer with one-step ahead predictions. Let us denote the output layer’s input from layer and layer as and , respectively. The network output is obtained via the horizontal product as


where indicates element-wise multiplication, so each pixel is defined by the product of two independent parts, i.e. for output unit it is .

3.2 Neural Dynamics

We use to represent the activation and to represent the activation of the dorsal/ventral PB units at the time-step . In some of the following equations, the time-index is omitted if all activations are from the same time-step. The inputs to the hidden units in the ventral stream and in the dorsal stream are defined as


where , represent the weighting matrices between dorsal/ventral layers and the input layer, , represent the weighting matrices between PB units and the two hidden layers, and and indicate the recurrent weighting matrices within the hidden layers.

The transfer functions in both hidden layers and the PB units all employ the sigmoid function recommended by 

LeCun et al. (1998),


where represent the internal values of the PB units.

The terms of the horizontal products of both pathways can be presented as follows:


The output of the two streams composes a horizontal product for the network output as we defined in Eq. 1.

3.2.1 Learning mode

The training progress is basically determined by the cost function:


where is the one-step ahead input (as well as the desired output), is the current output, is the total number of available time-step samples in a complete sensorimotor sequence and is the number of output nodes, which is equal to the number of input nodes. Following gradient descent, each weight update in the network is proportional to the negative gradient of the cost with respect to the specific weight that will be updated:


where is the adaptive learning rate of the weights between neuron and

, which is adjusted in every epoch (

Kleesiek et al. (2013)). To determine whether the learning rate has to be increased or decreased, we compute the changes of the weight in consecutive epochs:


The update of the learning rate is

where and represent the increasing/decreasing rate of the adaptive learning rates, with and as lower and upper bounds, respectively. Thus, the learning rate of a particular weight increases by to speed up the learning when the changes of that weight from two consecutive epochs have the same sign, and vice versa.

Besides the usual weight update according to back-propagation through time, the accumulated error over the whole time-series also contributes to the update of the PB units. The update for the

-th unit in the PB vector for a time-series of length

is defined as:


where is the error back-propagated to the PB units, is th time-step in the whole time-series (e.g. epoch), is PB units’ adaptive updating rate which is proportional to the absolute mean value of the back-propagation error at the -th PB node over the complete time-series of length :


The reason for applying the adaptive technique is that it was realized that the PB units converge with difficulty. Usually a smaller learning rate is used in the generic version of RNNPB to ensure the convergence of the network. However, this results in a trade-off in convergence speed. The adaptive learning rate is an efficient technique to overcome this trade-off (Kleesiek et al. (2013)).

3.2.2 Recognition mode

The recognition mode is executed with a similar information flow as the learning mode: given a set of the spatio-temporal sequences, the error between the target and the real output is back-propagated through the network to the PB units. However, the synaptic weights remain constant and only the PB units will be updated, so that the PB units are self-organized as the pre-trained values after certain epochs. Assuming the length of the observed sequence is , the update rule is defined as:


where is the error back-propagated from a certain sensory information sequence to the PB units and is the updating rate of PB units in recognition mode, which should be larger than the adaptive rate at the learning mode.

3.2.3 Prediction mode

The values of the PB units can also be manually set or obtained from recognition, so that the network can generate the upcoming sequence with one-step prediction.

Parameters Parameter’s Descriptions Value
Learning Rate in Ventral Stream
Learning Rate in Dorsal Stream
Maximum Value of Learning Rate
Minimum Value of Learning Rate
Proportionality Constant of PB Units Updating Rate
Size of PB Unit 1
Size of PB Unit 2
Size of Ventral-like Layer
Size of Dorsal-like Layer
Decreasing Rate of Learning Rate
Increasing Rate of Learning Rate
Table 1: Network parameters

4 Results

In this experiment, as we introduced, we examined this network by implementing it on two NAO robots. They were placed face-to-face in a rectangle box of as shown in Fig. 3. These distances were carefully adjusted so that the observer was able to keep track of movement trajectories in its visual field during all experiments using the images from the lower camera. The NAO robot has two cameras. We use the lower one to capture the images because its installation angle is more suitable to track the balls when they are held in the other NAO’s hand.

Figure 3: Experimental Scenario: two NAOs are standing face-to-face with in a rectangle box.


-diameter balls with yellow/green color were used for the following experiments. The presenter consecutively held each of the balls to present the object interaction. The original image, received from the lower camera of the observer, was pre-processed with thresholding in HSV color-space and the coordinates of its centroid in the image moment were calculated. Here we only considered two different colors as the only feature to be encoded in the ventral stream, as well as two sets of movement trajectories encoded in the dorsal stream. Although we have only tested a few categories of trajectories and features, we believe the results can be extrapolated to multiple categories in future applications.

4.1 Learning

The two different trajectories are defined as below,

The cosine curve,


and the square curve,


where the 3-dimension tuple are the coordinates (centimetres) of the ball w.r.t the torso frame of the NAO presenter. loops between . In each loop, we calculated data points to construct trajectories with sleeping time between every two data points. Note that although we have defined the optimal desired trajectories, the arm movement was not ideally identical to the optimal trajectories due to the noisy position control of the end-effector of the robot. On the observer side, the coordinates of the color-filtered moment of the ball in the visual field were recorded to form a trajectory with sampling time of . Five trajectories, in the form of tuple w.r.t the torso frame of the NAO observer were recorded with each color and each curve, so total trajectories were available for training.

In each training epoch, these trajectories, in the form of tuples, were fed into the input layer one after another for training, with the tuples of the next time-step serving as a training target. The parameters are listed in Tab. 1. The final PB values were examined after the training was done, and the values were shown in Fig. 4. It can be seen that the first PB unit, along with the dorsal stream, was approximately self-organized with the color information, while the second PB unit, along with the ventral stream, was self-organized with the movement information.

Figure 4: Values of two sets of PB units in the two streams after training. The square markers represent those PB units after the square curves training and the triangle markers represent those of the cosine curves training. The colors of the markers, yellow and green, represent the colors of the balls used for training.
(a) PB value 1
(b) PB value 2
Figure 5: Update of the PB values while executing the recognition mode

4.2 Recognition

Another four trajectories were presented in the recognition experiment, in which the length of the sliding-window is equal to the length of the whole time-series, i.e. in Eq. 12. The update of the PB units were shown in Fig. 5. Although we used the complete time-series sequence for the recognition, it should also be possible to use only part of the sequence, e.g. through the sliding-window approach with a smaller number of to fulfil the real-time requirement in the future.

(a) Cosine curve, yellow ball
(b) Cosine curve, green ball
(c) Square curve, yellow ball
(d) Square curve, green ball
Figure 6: Generated Values: the dots denote the true values for comparison, curves show the estimated ones. Yellow and red colors represent the values of the two neurons in the first layer (yellow), the colors green and clan represent those in the second layer (green).

4.3 Prediction

In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object. Then, the one-step prediction from the output units were again applied to the input at the next time-step, so that the whole time-series corresponding to the object’s movements and features were obtained. Fig. 6 presents the comparisons between the true values (the same as used in recognition) and the predicted ones.

From Fig. 6, it can be observed that the estimation was biased quite largely to the true value within the first few time-steps, as the RNN needs to accumulate enough input values to access its short-term memory. However, the error became smaller and it kept track of the true value in the following time-steps. Considering that the curves are automatically generated given the PB units and the values at the first time-step, the error between the true values and the estimated ones are acceptable. Moreover, this result show clearly that the conceptualization affects the (predictive) visual perception.

Error of Outputs Unit 1 Unit 2 Unit 3 Unit 4
cosine, yellow
cosine, green
square, yellow
square, green
Table 2: Prediction error

4.4 Generalization in Recognition

To testify whether our new computational model has the generalization ability as Cuijpers et al. (2009) proposed, we recorded another set of sequences of a circle trajectory. The trajectory is defined as:


The yellow and green balls were still used. We ran the recognition experiment again with the weight previously trained. The update of the PB units were shown in Fig.7. Comparing to Fig. 4, we can observe that the positive and negative signs of PB values are similar as the square trajectory. This is probably because the visual perception of circle and square movements have more similarities than those between circle and cosine movements.

(a) PB value 1
(b) PB value 2
Figure 7: Update of the PB values while executing the recognition mode with an untrained feature (circle)

4.5 PB representation with different speeds

We further generated trajectories with the same data functions (Eqs. 13 - 18) but with a slower sampling time. In other words, the movement of the balls seemed to be faster with robot’s observation. The final PB values after training were shown in Fig. 8.

It can be seen that generally the PB values were smaller comparing to Fig. 4, which was probably because there was less error being propagated during training. Moreover, the corresponding PB values corresponding to colors (green and yellow) and movements (cosine and square) were interchanged within the same PB unit (i.e. along the same axis) due to the difference of random initial parameters of the network. But the PB unit along with the dorsal stream still encoded color information, while the PB unit along with ventral stream encoded movement information. The network was still able to show properties of spatio-temporal sequences data in the PB units’ representation.

Figure 8: Values of two sets of PB units in the two streams after training with faster speed. The representation of the markers is the same as Fig. 4

5 Discussion

5.1 Neural Dynamics

An advancement of the HP-RNN model is that it can learn and encode the ‘what’ and ‘where’ information separately in two streams (more specifically, in two hidden layers). Both streams are connected through horizontal products, which means fewer connections than full multiplication (as the conventional bilinear model) (Zhong et al. (2012a)). In this paper, we further augmented the HP-RNN with the PB units. One set of units, connecting to one visual stream, reflects the dynamics of sequences in the other stream. This is an interesting result since it shows the neural dynamics in the hybrid combination of the RNNPB units and the horizontal product model. Taking the dorsal-like hidden layer for example, the error of the attached PB units is


where and are the derivatives of the linear and sigmoid transfer functions. Since we have the linear output, according to the definition of the horizontal product, the equation becomes,


The update of the internal values of the PB units becomes

where the term refers to the contribution of the weighted summation from the ventral-like layer at time . Note that the term is actually constant within one epoch and it is only updated after each epoch with a relatively small updating rate. Therefore, from the experimental perspective, given the same object movement but different object features, the difference of the PB values mostly reflects the dynamic changes in the hidden layer of the ventral stream. The same holds for the PB units attached to the ventral-like layer. This brief analysis shows the PB units for one modularity in RNNPB networks with horizontal product connections, effectively accumulating the non-linear dynamics of other modularities.

5.2 Conceptualization in visual perception

The visual conceptualization and perception are intertwined processes. As experiments from Schyns and Oliva (1999) show, when the visual observation is not clear, the brain automatically extrapolates the visual percept and updates the categorization labels on various levels according to what has been gained from the visual field. On the other hand, this conceptualization also affects the immediate visual perception in a top-down predictive manner. For instance, the identity conceptualization of a human face predictively spreads conceptualizations in other levels (e.g. face emotion). This top-down process propagates from object identity to other local conceptualizations, such as object affordance, motion, edge detection and other processes at the early stages of visual processing. This can be tested by classic illusions, such as ‘the goblet illusion’, where perception depends largely on top-down knowledge derived from past experiences rather than direct observation. This kind of illusion may be explained by the error in the first few time steps of the prediction experiment of our model. Therefore, our model to some extent also demonstrates the integrated process between the conceptualization and the spatio-temporal visual perception. This top-down predictive perception may also arouse other visual based predictive behaviors such as object permanence.

Particularly, the PB units act as a high-level conceptualization representation, which is continuously updated with the partial sensory information perceived in a short-time scale. The prediction process of the RNNPB is assisted by the conceptualized PB units of visual perception, which is identical to the integration conceptualization and (predictive) visual perception. This is the reason why PB units were not processed as a binary representation, as Ogata et al. (2007) did for human-robot-interaction; the original values of PB units are more accurate in generating the prediction of the next time-step and performing generalization tasks. As we mentioned, this model is merely a proof-of-concept model that bridges conceptualized visual streams and sensorimotor prediction. For more complex tasks, besides expanding of the network size as we mentioned, more complex networks that are capable of extracting and predicting higher-level spatio-temporal structures (e.g. predictive recurrent networks owning large learning capacity by Tani and colleagues: Yamashita and Tani (2008); Murata et al. (2013)) can be also applied. It should be interesting to further investigate the functional modularity representation of these network models when they are interconnected with horizontal product too.

Furthermore, the neuroscience basis that supports this paper, in the context of the mirror neuron system based on object-oriented-actions (grasping), can be stated as the ‘data-driven’ models such as MNS (Oztop and Arbib (2002)) and MNS2 (Bonaiuto et al. (2007); Bonaiuto and Arbib (2010)), although the main hypothesis in our model is not taken from the mirror neuron system theory. In the MNS review paper by Oztop et al. (2006), the action generation mode of the RNNPB model was considered to be excessive as there has no evidence yet to show that the mirror neuron system participates in action generation. However, in our model the generation mode has a key role of conceptualized PB units in the sensorimotor integration of object interaction. Nevertheless, the similar network architecture (RNNPB) used in modeling mirror neurons (Tani et al. (2004)) and our pre-symbolic sensorimotor integration models may imply a close relationship between language (pre-symbolic) development, object-oriented actions, and the mirror neuron theory.

6 Conclusion

In this paper a recurrent network architecture integrating the RNNPB model and the horizontal product model has been presented, which sheds light on the feasibility of linking the conceptualization of ventral/dorsal visual streams, the emergence of pre-symbol communication, and the predictive sensorimotor system.

Based on the horizontal product model, here the information in the dorsal and ventral streams is separately encoded in two network streams and the predictions of both streams are brought together via the horizontal product while the PB units act as a conceptualization of both streams. These PB units allow for storing multiple sensory sequences. After training, the network is able to recognize the pre-learned conceptualized information and to predict the up-coming visual perception. The network also shows robustness and generalization abilities. Therefore, our approach offers preliminary concepts for a similar development of conceptualized language in pre-symbolic communication and further in infants’ sensorimotor-stage learning.


The authors thank Sven Magg, Cornelius Weber, Katja Kösters as well as reviewers (Matthew Schlesinger, Stefano Nolfi) for improvement of the paper, Erik Strahl for technical support in Hamburg and Torbjorn Dahl for the generous allowance to use the NAOs in Plymouth.


This research has been partly supported by the EU projects RobotDoC under 235065 ROBOT-DOC from the 7th Framework Programme (FP7), Marie Curie Action ITN, and KSERA funded from FP7 for Research and Technological Development under grant agreement n°2010-248085, POETICON++ under grant agreement 288382 and UK EPSRC project BABEL.


  • Bassano [2000] D. Bassano.

    Early development of nouns and verbs in french: Exploring the interface between lexicon and grammar.

    Journal of Child Language, 27(3):521–559, 2000.
  • Bergmann and von der Malsburg [2011] U. Bergmann and C. von der Malsburg. Self-organization of topographic bilinear networks for invariant recognition. Neural Computation, pages 1–28, 2011.
  • Bloom et al. [1993] L. Bloom, E. Tinker, and C. Margulis. The words children learn: Evidence against a noun bias in early vocabularies. Cognitive Development, 8(4):431–450, 1993.
  • Bonaiuto and Arbib [2010] J. Bonaiuto and M.A. Arbib. Extending the mirror neuron system model, II: what did I just do? a new role for mirror neurons. Biological cybernetics, 102(4):341–359, 2010.
  • Bonaiuto et al. [2007] J. Bonaiuto, E. Rosta, and M. Arbib. Extending the mirror neuron system model, I. Biological cybernetics, 96(1):9–38, 2007.
  • Bonniec [1985] P. Bonniec. From visual-motor anticipation to conceptualization: Reaction to solid and hollow objects and knowledge of the function of containment. Infant Behavior and Development, 8(4):413–424, 1985.
  • Bowerman and Levinson [2001] M. Bowerman and S.C. Levinson. Language acquisition and conceptual development, volume 3. Cambridge University Press, 2001.
  • Chemla et al. [2009] E. Chemla, T.H Mintz, S. Bernal, and A. Christophe. Categorizing words using ‘frequent frames’: what cross-linguistic analyses reveal about distributional acquisition strategies. Developmental Science, 12(3):396–406, 2009.
  • Corbetta and Snapp-Childs [2009] D. Corbetta and W. Snapp-Childs. Seeing and touching: the role of sensory-motor experience on the development of infant reaching. Infant Behavior and Development, 32(1):44–58, 2009.
  • Cuijpers et al. [2009] R. Cuijpers, F. Stuijt, and I. Sprinkhuizen-Kuyper. Generalisation of action sequences in RNNPB networks with mirror properties. In Proceedings of the European Symposium on Neural Networks (ESANN), 2009.
  • Damasio and Tranel [1993] A. R. Damasio and D. Tranel. Nouns and verbs are retrieved with differently distributed neural systems. Proceedings of the National Academy of Sciences, 90(11):4957–4960, 1993.
  • Fields [2011] C. Fields. Trajectory recognition as the basis for object individuation: a functional model of object file instantiation and object-token encoding. Frontiers in psychology, 2, 2011.
  • Földiák [1991] P. Földiák. Learning invariance from transformation sequences. Neural Computation, 3:194–200, 1991.
  • Gelman and Spelke [1981] R. Gelman and E.S. Spelke. The development of thoughts about animate and inanimate objects: Implications for research on social cognition. Social cognitive development: Frontiers and possible futures, pages 43–66, 1981.
  • Gentner [1982] D. Gentner. Why nouns are learned before verbs: Linguistic relativity versus natural partitioning. In Lawrence Erlbaum, editor, Language development, volume 2, pages 301–334. Hillsdale, N. J., 1982.
  • Gibson [1988] E.J. Gibson. Exploratory behavior in the development of perceiving, acting, and the acquiring of knowledge. Annual Review of Psychology, 1988.
  • Gruber and Olsen [1994] T.R. Gruber and G.R. Olsen. An ontology for engineering mathematics. KR, 94:258–269, 1994.
  • Habel and Tappe [1999] C. Habel and H. Tappe. Processes of segmentation and linearization in describing events. In Christiane von Stutterheim Ralf Klabunde, editor, Representations and Processes in Language Production. Deutscher Universitatsverlag, 1999.
  • Hollerbach [1982] J. M. Hollerbach. Computers, brains and the control of movement. Trends in Neurosciences, 5:189–192, 1982.
  • Jacobs et al. [1991] R.A. Jacobs, M.I Jordan, and A.G. Barto. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cognitive Science, 15(2):219–250, 1991.
  • Kawato et al. [2003] M. Kawato, T. Kuroda, H. Imamizu, E. Nakano, S. Miyauchi, and T. Yoshioka. Internal forward models in the cerebellum: fMRI study on grip force and load force coupling. Progress in brain research, 142:171–188, 2003.
  • Kersten [1998] A. W. Kersten. An examination of the distinction between nouns and verbs: Associations with two different kinds of motion. Memory & cognition, 26(6):1214–1232, 1998.
  • Kleesiek et al. [2013] J. Kleesiek, S. Badde, S. Wermter, and A. K. Engel. Action-driven perception for a humanoid. In

    Agents and Artificial Intelligence

    , pages 83–99. Springer, 2013.
  • LeCun et al. [1998] Y. LeCun, L. Bottou, G. B. Orr, and K. Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 1998.
  • Mandler [1992] J. M. Mandler. The foundations of conceptual thought in infancy. Cognitive Development, 7(3):273–285, 1992.
  • Mandler [1999] J. M. Mandler. Preverbal representation and language. Language and space, page 365, 1999.
  • Mareschal and Johnson [2003] D. Mareschal and M. H. Johnson. The what and where of object representations in infancy. Cognition, 88(3):259–276, 2003.
  • Mareschal et al. [1999] D. Mareschal, K. Plunkett, and P. Harris. A computational and neuropsychological account of object-oriented behaviours in infancy. Developmental Science, 2(3):306–317, 1999.
  • Memisevic and Hinton [2007] R. Memisevic and G. Hinton. Unsupervised learning of image transformations. In

    2007 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1–8, 2007.
  • Murata et al. [2013] S. Murata, J. Namikawa, H. Arie, S. Sugano, and J. Tani. Learning to reproduce fluctuating time series by inferring their time-dependent stochastic properties: Application in robot learning via tutoring. 2013.
  • Newman et al. [2001] C. Newman, J. Atkinson, and O. Braddick. The development of reaching and looking preferences in infants to objects of different sizes. Developmental Psychology, 37(4):561, 2001.
  • Ogata et al. [2007] T. Ogata, S. Matsumoto, J. Tani, K. Komatani, and H.G. Okuno. Human-robot cooperation using quasi-symbols generated by RNNPB model. In 2007 IEEE International Conference on Robotics and Automation, pages 2156–2161, 2007.
  • Olshausen et al. [1993] B. Olshausen, C.H. Anderson, and D.C. Van Essen. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. The Journal of Neuroscience, 13(11):4700–4719, 1993.
  • Oztop and Arbib [2002] E. Oztop and M.A. Arbib. Schema design and implementation of the grasp-related mirror neuron system. Biological cybernetics, 87(2):116–140, 2002.
  • Oztop et al. [2006] E. Oztop, M. Kawato, and M. Arbib. Mirror neurons and imitation: A computationally guided review. Neural Networks, 19(3):254–271, 2006.
  • Rocha et al. [2006] N. Rocha, F. Silva, and E. Tudella. The impact of object size and rigidity on infant reaching. Infant Behavior and Development, 29(2):251–261, 2006.
  • Romberg and Saffran [2010] A.R Romberg and J.R Saffran. Statistical learning and language acquisition. Wiley Interdisciplinary Reviews: Cognitive Science, 1(6):906–914, 2010.
  • Rosch [1973] E.H Rosch. Natural categories. Cognitive psychology, 4(3):328–350, 1973.
  • Ruff [1984] H. A. Ruff. Infants’ manipulative exploration of objects: Effects of age and object characteristics. Developmental Psychology, 20(1):9, 1984.
  • Schyns and Oliva [1999] P. G. Schyns and A. Oliva. Dr. Angry and Mr. Smile: When categorization flexibly modifies the perception of faces in rapid visual presentations. Cognition, 69(3):243–265, 1999.
  • Sloutsky and Robinson [2008] V. M Sloutsky and C. W Robinson. The role of words and sounds in infants’ visual processing: From overshadowing to attentional tuning. Cognitive Science, 32(2):342–365, 2008.
  • Smith et al. [2002] L.B Smith, S.S Jones, B. Landau, L. Gershkoff-Stowe, and L. Samuelson. Object name learning provides on-the-job training for attention. Psychological Science, 13(1):13–19, 2002.
  • Streri et al. [1993] A. Streri, T. Pownall, and S. Kingerlee. Seeing, reaching, touching: The relations between vision and touch in infancy. Harvester Wheatsheaf Oxford, 1993.
  • Tani et al. [2004] J. Tani, M. Ito, and Y. Sugita. Self-organization of distributedly represented multiple behavior schemata in a mirror system: reviews of robot experiments using RNNPB. Neural Networks, 17(8-9):1273–1289, 2004.
  • Tardif [1996] T. Tardif. Nouns are not always learned before verbs: Evidence from mandarin speakers’ early vocabularies. Developmental Psychology, 32(3):492, 1996.
  • Tomasello and Farrar [1986] M. Tomasello and M. J. Farrar. Object permanence and relational words: A lexical training study. Journal of Child Language, 13(03):495–505, 1986.
  • Von Hofsten [1982] C. Von Hofsten. Eye–hand coordination in the newborn. Developmental psychology, 18(3):450, 1982.
  • von Stutterheim and Nuse [2003] C. von Stutterheim and R Nuse. Processes of conceptualization in language production: language-specific perspectives and event construal. Linguistics, 41(5; ISSU 387):851–882, 2003.
  • Wiskott and Sejnowski [2002] L. Wiskott and T.J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715–770, 2002.
  • Wolpert et al. [1995] D.M. Wolpert, Z. Ghahramani, and M. I. Jordan. An internal model for sensorimotor integration. Science, pages 1880–1880, 1995.
  • Yamashita and Tani [2008] Y. Yamashita and J. Tani. Emergence of functional hierarchy in a multiple timescale neural network model: a humanoid robot experiment. PLoS computational biology, 4(11):e1000220, 2008.
  • Yu [2008] C. Yu. A statistical associative account of vocabulary growth in early word learning. Language learning and Development, 4(1):32–62, 2008.
  • Zhong et al. [2012a] J. Zhong, C. Weber, and S. Wermter. Learning features and predictive transformation encoding based on a horizontal product model.

    Artificial Neural Networks and Machine Learning, ICANN

    , pages 539–546, 2012a.
  • Zhong et al. [2012b] J. Zhong, C. Weber, and S. Wermter. A predictive network architecture for a robust and smooth robot docking behavior. Paladyn, 3(4):172–180, 2012b.