Controlling Recurrent Neural Networks by Conceptors

03/13/2014 ∙ by Herbert Jaeger, et al. ∙ Jacobs University Bremen 0

The human brain is a dynamical system whose extremely complex sensor-driven neural processes give rise to conceptual, logical cognition. Understanding the interplay between nonlinear neural dynamics and concept-level cognition remains a major scientific challenge. Here I propose a mechanism of neurodynamical organization, called conceptors, which unites nonlinear dynamics with basic principles of conceptual abstraction and logic. It becomes possible to learn, store, abstract, focus, morph, generalize, de-noise and recognize a large number of dynamical patterns within a single neural system; novel patterns can be added without interfering with previously acquired ones; neural noise is automatically filtered. Conceptors help explaining how conceptual-level information processing emerges naturally and robustly in neural systems, and remove a number of roadblocks in the theory and applications of recurrent neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 33

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Overview

Scientific context.

Research on brains and cognition unfolds in two directions. Top-down

oriented research starts from the “higher” levels of cognitive performance, like rational reasoning, conceptual knowledge representation, command of language. These phenomena are typically described in symbolic formalisms developed in mathematical logic, artificial intelligence (AI), computer science and linguistics. In the

bottom-up direction, one departs from “low-level” sensor data processing and motor control, using the analytical tools offered by dynamical systems theory, signal processing and control theory, statistics and information theory. The human brain obviously has found a way to implement high-level logical reasoning on the basis of low-level neuro-dynamical processes. How this is possible, and how the top-down and bottom-up research directions can be united, has largely remained an open question despite long-standing efforts in neural networks research and computational neuroscience [80, 87, 33, 2, 36, 43]

, machine learning

[35, 47], robotics [11, 81], artificial intelligence [83, 104, 8, 10], dynamical systems modeling of cognitive processes [94, 98, 105], cognitive science and linguistics [22, 96], or cognitive neuroscience [5, 26].

Summary of contribution.

Here I establish a fresh view on the neuro-symbolic integration problem. I show how dynamical neural activation patterns can be characterized by certain neural filters which I call conceptors. Conceptors derive naturally from the following key observation. When a recurrent neural network (RNN) is actively generating, or is passively being driven by different dynamical patterns (say ), its neural states populate different regions of neural state space. These regions are characteristic of the respective patterns. For these regions, neural filters (the conceptors) can be incrementally learnt. A conceptor representing a pattern can then be invoked after learning to constrain the neural dynamics to the state region , and the network will select and re-generate pattern

. Learnt conceptors can be blended, combined by Boolean operations, specialized or abstracted in various ways, yielding novel patterns on the fly. Conceptors can be economically represented by single neurons (addressing patterns by neurons, leading to explicit command over pattern generation), or they may be constituted spontaneously upon the presentation of cue patterns (content-addressing, leading to pattern imitation). The logical operations on conceptors admit a rigorous semantical interpretation; conceptors can be arranged in conceptual hierarchies which are structured like semantic networks known from artificial intelligence. Conceptors can be economically implemented by single neurons (addressing patterns by neurons, leading to explicit command over pattern generation), or they may self-organize spontaneously and quickly upon the presentation of cue patterns (content-addressing, leading to pattern imitation). Conceptors can also be employed to “allocate free memory space” when new patterns are learnt and stored in long-term memory, enabling incremental life-long learning without the danger of freshly learnt patterns disrupting already acquired ones. Conceptors are robust against neural noise and parameter variations. The basic mechanisms are generic and can be realized in any kind of dynamical neural network. All taken together, conceptors offer a principled, transparent, and computationally efficient account of how neural dynamics can self-organize in conceptual structures.

Going bottom-up: from neural dynamics to conceptors.

The neural model system in this report are standard recurrent neural networks (RNNs, Figure 1 A) whose dynamics is mathematically described be the state update equations

Time here progresses in unit steps . The network consists of neurons (typically in the order of a hundred in this report), whose activations at time are collected in an -dimensional

state vector

. The neurons are linked by random synaptic connections, whose strengths are collected in a weight matrix of size . An input signal is fed to the network through synaptic input connections assembled in the input weight matrix . The “S-shaped” function squashes the neuronal activation values into a range between and . The second equation specifies that an ouput signal can be read from the network activation state by means of output weights . These weights are pre-computed such that the output signal just repeats the input signal . The output signal plays no functional role in what follows; it merely serves as a convenient 1-dimensional observer of the high-dimensional network dynamics.

The network-internal neuron-to-neuron connections are created at random. This will lead to the existence of cyclic (“recurrent”) connection pathways inside the network. Neural activation can reverberate inside the network along these cyclic pathways. The network therefore can autonomously generate complex neurodynamical patterns even when it receives no input. Following the terminology of the reservoir computing [56, 6], I refer to such randomly connected neural networks as reservoirs.

Figure 1: Deriving conceptors from network dynamics. A. Network layout. Arrows indicate synaptic links. B. Driving the reservoir with four different input patterns. Left panels: 20 timesteps of input pattern (black thin line) and conceptor-controlled output (bold light gray). Second column: 20 timesteps of traces

of two randomly picked reservoir neurons. Third column: the singular values

of the reservoir state correlation matrix in logarithmic scale. Last column: the singular values of the conceptors in linear plotting scale. C. From pattern to conceptor. Left: plots of value pairs (dots) of the two neurons shown in first row of B and the resulting ellipse with axis lengths . Right: from (thin light gray) to conceptor (bold dark gray) by normalizing axis lengths to .

For the sake of introducing conceptors by way of an example, consider a reservoir with neurons. I drive this system with a simple sinewave input (first panel in first row in Fig. 1 B). The reservoir becomes entrained to this input, each neuron showing individual variations thereof (Fig. 1 B second panel). The resulting reservoir state sequence can be represented as a cloud of points in the 100-dimensional reservoir state space. The dots in the first panel of Fig. 1 C

show a 2-dimensional projection of this point cloud. By a statistical method known as principal component analysis, the shape of this point cloud can be captured by an

-dimensional ellipsoid whose main axes point in the main scattering directions of the point cloud. This ellipsoid is a geometrical representation of the correlation matrix of the state points. The lengths of the ellipsoid axes are known as the singular values of . The directions and lengths of these axes provide a succinct characterization of the geometry of the state point cloud. The lengths resulting in this example are log-plotted in Fig. 1 B, third column, revealing an exponential fall-off in this case.

As a next step, these lengths are normalized to become , where is a design parameter that I call aperture. This normalization ensures that all are not larger than 1 (last column in Fig. 1 B). A new ellipsoid is obtained (Fig. 1 C right) which is located inside the unit sphere. The normalized ellipsoid can be described by a -dimensional matrix , which I call a conceptor matrix. can be directly expressed in terms of by , where

is the identity matrix.

When a different driving pattern is used, the shape of the state point cloud, and subsequently the conceptor matrix , will be characteristically different. In the example, I drove the reservoir with four patterns (rows in Fig. 1B). The first two patterns were sines of slightly different frequencies, the last two patterns were minor variations of a 5-periodic random pattern. The conceptors derived from the two sine patterns differ considerably from the conceptors induced by the two 5-periodic patterns (last column in Fig. 1B). Within each of these two pairs, the conceptor differences are too small to become visible in the plots.

There is an instructive alternative way to define conceptors. Given a sequence of reservoir states , the conceptor which characterizes this state point cloud is the unique matrix which minimizes the cost function , where is the sum of all squared matrix entries. The first term in this cost would become minimal if were the identity map, the second term would become minimal if would be the all-zero map. The aperture strikes a balance between these two competing cost components. For increasing apertures, will tend toward the identity matrix

; for shrinking apertures it will come out closer to the zero matrix. In the terminology of machine learning,

is hereby defined as a regularized identity map. The explicit solution to this minimization problem is again given by the formula .

Summing up: if a reservoir is driven by a pattern , a conceptor matrix can be obtained from the driven reservoir states as the regularized identity map on these states. can be likewise seen as a normalized ellipsoid characterization of the shape of the point cloud. I write to denote a conceptor derived from a pattern using aperture , or to denote that was obtained from a state correlation matrix .

Loading a reservoir.

With the aid of conceptors a reservoir can re-generate a number of different patterns that it has previously been driven with. For this to work, these patterns have to be learnt by the reservoir in a special sense, which I call loading a reservoir with patterns. The loading procedure works as follows. First, drive the reservoir with the patterns in turn, collecting reservoir states (where ). Then, recompute the reservoir connection weights into such that optimally balances between the following two goals. First, should be such that for all times and patterns . That is, should allow the reservoir to “simulate” the driving input in the absence of the same. Second,

should be such that the weights collected in this matrix become as small as possible. Technically this compromise-seeking learning task amounts to computing what is known as a regularized linear regression, a standard and simple computational task. This idea of “internalizing” a driven dynamics into a reservoir has been independently (re-)introduced under different names and for a variety of purposes (

self-prediction [72], equilibration [55], reservoir regularization [90], self-sensing networks [100], innate training [61]) and appears to be a fundamental RNN adaptation principle.

Going top-down: from conceptors to neural dynamics.

Assume that conceptors have been derived for patterns , and that these patterns have been loaded into the reservoir, replacing the original random weights by . Intuitively, the loaded reservoir, when it is run using (no input!) should behave exactly as when it was driven with input earlier, because has been trained such that . In fact, if only a single pattern had been loaded, the loaded reservoir would readily re-generate it. But if more than one patter had been loaded, the autonomous (input-free) update will lead to an entirely unpredictable dynamics: the network can’t “decide” which of the loaded patterns it should re-generate! This is where conceptors come in. The reservoir dynamics is filtered through . This is effected by using the augmented update rule . By virtue of inserting into the feedback loop, the reservoir states become clipped to fall within the ellipsoid associated with . As a result, the pattern will be re-generated: when the reservoir is observed through the previously trained output weights, one gets . The first column of panels in Fig. 1 B shows an overlay of the four autonomously re-generated patterns with the original drivers used in that example. The recovery of the originals is quite accurate (mean square errors 3.3e-05, 1.4e-05, 0.0040, 0.0019 for the four loaded patterns). Note that the first two and the last two patterns are rather similar to each other. The filtering afforded by the respective conceptors is “sharp” enough to separate these twin pairs. I will later demonstrate that in this way a remarkably large number of patterns can be faithfully re-generated by a single reservoir.

Morphing and generalization.

Given a reservoir loaded with patterns , the associated conceptors can be linearly combined by creating mixture conceptors , where the mixing coefficients must sum to 1. When the reservoir is run under the control of such a morphed conceptor , the resulting generated pattern is a morph between the original “pure” patterns . If all are non-negative, the morph can be considered an interpolation between the pure patterns; if some are negative, the morph extrapolates beyond the loaded pure patterns. I demonstrate this with the four patterns used in the example above, setting , and letting vary from to in increments of . Fig. 2 shows plots of observer signals

obtained when the reservoir is generating patterns under the control of these morphed conceptors. The innermost 5 by 5 panels show interpolations between the four pure patterns, all other panels show extrapolations.

In machine learning terms, both interpolation and extrapolation are cases of generalization. A standard opinion in the field states that generalization by interpolation is what one may expect from learning algorithms, while extrapolation beyond the training data is hard to achieve.

Morphing and generalizing dynamical patterns is a common but nontrivial task for training motor patterns in robots. It typically requires training demonstrations of numerous interpolating patterns [88, 17, 68]. Conceptor-based pattern morphing appears promising for flexible robot motor pattern learning from a very small number of demonstrations.

Figure 2: Morphing between, and generalizing beyond, four loaded patterns. Each panel shows a 15-step autonomously generated pattern (plot range between and ). Panels with bold frames: the four loaded prototype patterns (same patterns as in Fig. 1 B.)
Aperture adaptation.

Choosing the aperture appropriately is crucial for re-generating patterns in a stable and accurate way. To demonstrate this, I loaded a 500-neuron reservoir with signals derived from four classical chaotic attractors: the Lorenz, Rössler, Mackey-Glass, and Hénon attractors. Note that it used to be a challenging task to make an RNN learn any single of these attractors [56]; to my knowledge, training a single RNN to generate several different chaotic attractors has not been attempted before. After loading the reservoir, the re-generation was tested using conceptors where for each attractor pattern a number of different values for were tried. Fig. 3 A shows the resulting re-generated patterns for five apertures for the Lorenz attractor. When the aperture is too small, the reservoir-conceptor feedback loop becomes too constrained and the produced patterns de-differentiate. When the aperture is too large, the feedback loop becomes over-excited.

A
B
C

Figure 3: Aperture adaptation for re-generating four chaotic attractors. A Lorenz attractor. Five versions re-generated with different apertures (values inserted in panels) and original attractor (gray background). B Best re-generations of the other three attractors (from left to right: Rössler, Mackey-Glass, and Hénon, originals on gray background). C Log10 of the attenuation criterion plotted against the log10 of aperture. Dots mark the apertures used for plots in A and B.

An optimal aperture can be found by experimentation, but this will not be an option in many engineering applications or in biological neural systems. An intrinsic criterion for optimizing is afforded by a quantity that I call attenuation: the damping ratio which the conceptor imposes on the reservoir signal. Fig. 3 C plots the attenuation against the aperture for the four chaotic signals. The minimum of this curve marks a good aperture value: when the conceptor dampens out a minimal fraction of the reservoir signal, conceptor and reservoir are in good “resonance”. The chaotic attractor re-generations shown in Fig. 3 B were obtained by using this minimum-attenuation criterion.

The aperture range which yields visibly good attractor re-generations in this demonstration spans about one order of magnitude. With further refinements (zeroing small singular values in conceptors is particularly effective), the viable aperture range can be expanded to about three orders of magnitude. While setting the aperture right is generally important, fine-tuning is unnecessary.

Boolean operations and conceptor abstraction.

Assume that a reservoir is driven by a pattern

which consists of randomly alternating epochs of two patterns

and . If one doesn’t know which of the two patterns is active at a given time, all one can say is that the pattern currently is OR it is . Let be conceptors derived from the two partial patterns and the “OR” pattern , respectively. Then it holds that . Dropping the division by 2, this motivates to define an OR (mathematical notation: ) operation on conceptors by putting . The logical operations NOT () and AND () can be defined along similar lines. Fig. 4 shows two-dimensional examples of applying the three operations.

Figure 4: Boolean operations on conceptors. Red/blue (thin) ellipses represent source conceptors . Magenta (thick) ellipses show , , (from left to right).

Boolean logic is the mathematical theory of . Many laws of Boolean logic also hold for the operations on conceptors: the laws of associativity, commutativity, double negation, de Morgan’s rules, some absorption rules. Furthermore, numerous simple laws connect aperture adaptation to Boolean operations. Last but not least, by defining if and only if there exists a conceptor such that , an abstraction ordering is created on the set of all conceptors of dimension .

Neural memory management.

Boolean conceptor operations afford unprecedented flexibility of organizing and controlling the nonlinear dynamics of recurrent neural networks. Here I demonstrate how a sequence of patterns can be incrementally loaded into a reservoir, such that (i) loading a new pattern does not interfere with previously loaded ; (ii) if a new pattern is similar to already loaded ones, the redundancies are automatically detected and exploited, saving memory capacity; (iii) the amount of still “free” memory space can be logged.

Let be the conceptor associated with pattern . Three ideas are combined to implement the memory management scheme. First, keep track of the “already used” memory space by maintaining a conceptor . The sum of all singular values of , divided by the reservoir size, gives a number that ranges between 0 and 1. It is an indicator of the portion of reservoir “space” which has been used up by loading , and I call it the quota claimed by . Second, characterize what is “new” about (not being already represented by previously loaded patterns) by considering the conceptor . The logical difference operator can be re-written as . Third, load only that which is new about into the still unclaimed reservoir space, that is, into . These three ideas can be straightforwardly turned into a modification of the basic pattern loading algorithm.

For a demonstration, I created a series of periodic patterns whose integer period lengths were picked randomly between 3 and 15, some of these patterns being sines, others random patterns. These patterns were incrementally loaded in a 100-neuron reservoir, one by one. Fig. 5 shows the result. The “used space” panels monitor the successive filling-up of reservoir space. Since patterns were identical replicas of patterns , no additional space was consumed when these patterns were (re-)loaded. The “driver and ” panels document the accuracy of autonomously re-generating patterns using conceptors . Accuracy was measured by the normalized root mean square error (NRMSE), a standard criterion for comparing the similarity between two signals. The NRMSE jumps from very small values to a high value when the last pattern is loaded; the quota of 0.98 at this point indicates that the reservoir is “full”. The re-generation testing and NRMSE computation was done after all patterns had been loaded. An attempt to load further patterns would be unsuccessful, but it also would not harm the re-generation quality of the already loaded ones.

Figure 5:

Incremental pattern storing in a neural memory. Each panel shows a 20-timestep sample of the correct training pattern

(black line) overlaid on its reproduction (green line). The memory fraction used up until pattern is indicated by the panel fraction filled in red; the quota value is printed in the left bottom corner of each panel.

This ability to load patterns incrementally solves a notorious problem in neural network training, known as catastrophic forgetting, which manifests itself in a disruption of previously learnt functionality when learning new functionality. Although a number of proposals have been made which partially alleviate the problem in special circumstances [32, 42], catastrophic forgetting was still listed as an open challenge in an expert’s report solicited by the NSF in 2007 [21] which collected the main future challenges in learning theory.

Recognizing dynamical patterns.

Boolean conceptor operations enable the combination of positive and negative evidence in a neural architecture for dynamical pattern recognition. For a demonstration I use a common benchmark, the

Japanese vowel recognition task [60]. The data of this benchmark consist in preprocessed audiorecordings of nine male native speakers pronouncing the Japanese di-vowel /ae/. The training data consist of 30 recordings per speaker, the test data consist of altogether 370 recordings, and the task is to train a recognizer which has to recognize the speakers of the test recordings. This kind of data differs from the periodic or chaotic patterns that I have been using so far, in that the patterns are non-stationary (changing in their structure from beginning to end), multi-dimensional (each recording consisting of 12 frequency band signals), stochastic, and of finite duration. This example thus also demonstrates that conceptors can be put to work with data other than single-channel stationary patterns.

A small (10 neurons) reservoir was created. It was driven with all training recordings from each speaker in turn (), collecting reservoir response signals, from which a conceptor characteristic of speaker was computed. In addition, for each speaker , a conceptor was computed. characterizes the condition “this speaker is not any of the other eight speakers”. Patterns need not to be loaded into the reservoir for this application, because they need not be re-generated.

In testing, a recording from the test set was fed to the reservoir, collecting a reservoir response signal . For each of the conceptors, a positive evidence was computed. is a non-negative number indicating how well the signal fits into the ellipsoid of . Likewise, the negative evidence that the sample was not uttered by any of the eight speakers other than speaker was computed. Finally, the combined evidence was computed. This gave nine combined evidences . The pattern

was then classified as speaker

by choosing the speaker index whose combined evidence was the greatest among the nine collected evidences.

In order to check for the impact of the random selection of the underlying reservoir, this whole procedure was repeated 50 times, using a freshly created random reservoir in each trial. Averaged over these 50 trials, the number of test misclassifications was 3.4. If the classification would have been based solely on the positive or negative evidences, the average test misclassification numbers would have been 8.4 and 5.9 respectively. The combination of positive and negative evidence, which was enabled by Boolean operations, was crucial.

State-of-the-art machine learning methods achieve between 4 and 10 misclassifications on the test set (for instance [91, 97, 79, 15]). The Boolean-logic-conceptor-based classifier thus compares favorably with existing methods in terms of classification performance. The method is computationally cheap, with the entire learning procedure taking a fraction of a second only on a standard notebook computer. The most distinctive benefit however is incremental extensibility. If new training data become available, or if a new speaker would be incorporated into the recognition repertoire, the additional training can be done using only the new data without having to re-run previous training data. This feature is highly relevant in engineering applications and in cognitive modeling and missing from almost all state-of-the-art classification methods.

Autoconceptors and content-addressable memories.

So far I have been describing examples where conceptors associated with patterns were computed at training time, to be later plugged in to re-generate or classify patterns. A conceptor matrix has the same size as the reservoir connection matrix . Storing conceptor matrices means to store network-sized objects. This is implausible under aspects of biological modeling. Here I describe how conceptors can be created on the fly, without having to store them, leading to content-addressable neural memories.

If the system has no pre-computed conceptors at its disposal, loaded patterns can still be re-generated in a two-stage process. First, the target pattern is selected by driving the system with a brief initial “cueing” presentation of the pattern (possibly in a noisy version). During this phase, a preliminary conceptor is created by an online adaptation process. This preliminary already enables the system to re-generate an imperfect version of the pattern . Second, after the cueing phase has ended, the system continues to run in an autonomous mode (no external cue signal), initially using , to continuously generate a pattern. While this process is running, the conceptor in the loop is continuously adapted by a simple online adaptation rule. This rule can be described in geometrical terms as “adapt the current conceptor such that its ellipsoid matches better the shape of the point cloud of the current reservoir state dynamics”. Under this rule one obtains a reliable convergence of the generated pattern toward a highly accurate replica of the target pattern that was given as a cue.

A    B
C    D

Figure 6: Content-addressable memory. A First three of five loaded patterns. Left panels show the leading 20 singular values of (black) and (gray). Right panels show an overlay of the original driver pattern (black, thin) and the reconstruction at the end of auto-adaptation (gray, thick). B Pattern reconstruction errors directly after cueing (black squares) and at end of auto-adaptation (gray crosses). C

Reconstruction error of loaded patterns (black) and novel patterns drawn from the same parametric family (gray) versus the number of loaded patterns, averaged over 5 repetitions of the entire experiment and 10 patterns per plotting point. Error bars indicate standard deviations.

D Autoconceptor adaptation dynamics described as evolution toward a plane attractor (schematic).

Results of a demonstration are illustrated in Figure 6. A 200-neuron reservoir was loaded with 5 patterns consisting of a weighted sum of two irrational-period sines, sampled at integer timesteps. The weight ratio and the phaseshift were chosen at random; the patterns thus came from a family of patterns parametrized by two parameters. The cueing time was 30 timesteps, the free-running auto-adaptation time was 10,000 timesteps, leading to an auto-adapted conceptor at the end of this process. On average, the reconstruction error improved from about -0.4 (log10 NRMSE measured directly after the cueing) to -1.1 (at the end of auto-adaptation). It can be shown analytically that the auto-adaptation process pulls many singular values down to zero. This effect renders the combined reservoir-conceptor loop very robust against noise, because all noise components in the directions of the nulled singular values become completely suppressed. In fact, all results shown in Figure 6 were obtained with strong state noise (signal-to-noise ratio equal to 1) inserted into the reservoir during the post-cue auto-adaptation.

The system functions as a content-addressable memory (CAM): loaded items can be recalled by cueing them. The paradigmatic example of a neural CAM are auto-associative neural networks (AANNs), pioneered by Palm [80] and Hopfield [48]. In contrast to conceptor-based CAM, which store and re-generate dynamical patterns, AANNs store and cue-recall static patterns. Furthermore, AANNs do not admit an incremental storing of new patterns, which is possible in conceptor-based CAMs. The latter thus represent an advance in neural CAMs in two fundamental aspects.

To further elucidate the properties of conceptor CAMs, I ran a suite of simulations where the same reservoir was loaded with increasing numbers of patterns, chosen at random from the same 2-parametric family (Figure 6 C). After loading with patterns, the reconstruction accuracy was measured at the end of the auto-adaptation. Not surprisingly, it deteriorated with increasing memory load (black line). In addition, I also cued the loaded reservoir with patterns that were not loaded, but were drawn from the same family. As one would expect, the re-construction accuracy of these novel patterns was worse than for the loaded patterns – but only for small . When the number of loaded patterns exceeded a certain threshold, recall accuracy became essentially equal for loaded and novel patterns. These findings can be explained in intuitive terms as follows. When few patterns are loaded, the network memorizes individual patterns by “rote learning”, and subsequently can recall these patterns better than other patterns from the family. When more patterns are loaded, the network learns a representation of the entire parametric class of patterns. I call this the class learning effect.

The class learning effect can be geometrically interpreted in terms of a plane attractor [24] arising in the space of conceptor matrices (Figure 6 D). The learnt parametric class of patterns is represented by a -dimensional manifold in this space, where is the number of defining parameters for the pattern family (in our example, ). The cueing procedure creates an initial conceptor in the vicinity of , which is then attracted toward by the auto-adaptation dynamics. While an in-depth analysis of this situation reveals that this picture is not mathematically correct in some detail, the plane attractor metaphor yields a good phenomenal description of conceptor CAM class learning.

Plane attractors have been invoked as an explanation for a number of biological phenomena, most prominently gaze direction control [24]. In such phenomena, points on the plane attractor correspond to static fixed points (for instance, a direction of gaze). In contrast, points on correspond to conceptors which in turn define temporal patterns. Again, the conceptor framework “dynamifies” concepts that have previously been worked out for static patterns only.

Toward biological feasibility: random feature conceptors.

Several computations involved in adapting conceptor matrices are non-local and therefore biologically infeasible. It is however possible to approximate matrix conceptors with another mechanism which only requires local computations. The idea is to project (via random projection weights ) the reservoir state into a random feature space which is populated by a large number of neurons ; execute the conceptor operations individually on each of these neurons by multiplying a conception weight into its state; and finally to project back to the reservoir by another set of random projection weights (Figure 7).

The original reservoir-internal random connection weigths are replaced by a dyade of two random projections of first , then , and the original reservoir state segregates into a reservoir state and a random feature state . The conception weights

assume the role of conceptors. They can be learnt and adapted by procedures which are directly analog to the matrix conceptor case. What had to be non-local matrix computations before now turns into local, one-dimensional (scalar) operations. These operations are biologically feasible in the modest sense that any information needed to adapt a synaptic weight is locally available at that synapse. All laws and constructions concerning Boolean operations and aperture carry over.

Figure 7: Random feature conceptors. This neural architecture has two pools of neurons, the reservoir and the feature space.

A set of conception weights corresponding to a particular pattern can be neurally represented and “stored” in the form of the connections of a single neuron to the feature space. A dynamical pattern thus can be represented by a single neuron. This enables a highly compact neural representation of dynamical patterns. A machine learning application is presented below.

I re-ran with such random feature conceptors a choice of the simulations that I did with matrix conceptors, using a number of random features that was two to five times as large as the reservoir. The outcome of these simulations: the accuracy of pattern re-generation is essentially the same as with matrix conceptors, but setting the aperture is more sensitive.

A hierarchical classification and de-noising architecture.

Here I present a system which combines in a multi-layer neural architecture many of the items introduced so far. The input to this system is a (very) noisy signal which at a given time is being generated by one out of a number of possible candidate pattern generators. The task is to recognize the current generator, and simultaneously to re-generate a clean version of the noisy input pattern.

A B

Figure 8: Simultaneous signal de-noising and classification. A. Schema of architecture. B. Simulation results. Panels from above: first three panels: hypothesis vectors in the three layers. Color coding: blue, green, red, cyan. Fourth panel: trust variables (blue) and (green). Fifth panel: signal reconstruction errors (log10 NRMSE) of (blue), (green) and (red) versus clean signal . Black line: linear baseline filter. Bottom panels: 20-step samples from the end of the two presentation periods. Red: noisy input; black: clean input; thick gray: cleaned output signal .

I explain the architecture with an example. It uses three processing layers to de-noise an input signal , with being one of the four patterns used before in this report (shown for instance in Figure 1 B). The architecture implements the following design principles (Figure 8 A). (i) Each layer is a random feature based conceptor system (as in Figure 7 B). The four patterns are initially loaded into each of the layers, and four prototype conceptor weight vectors corresponding to the patterns are computed and stored. (ii) In a bottom-up processing pathway, the noisy external input signal is stagewise de-noised, leading to signals on layers , where should be a highly cleaned-up version of the input (subscripts refer to layers, bottom layer is ). (iii) The top layer auto-adapts a conceptor which is constrained to be a weighted OR combination of the four prototype conceptors. In a suggestive notation this can be written as . The four weights sum to one and represent a hypothesis vector expressing the system’s current belief about the current driver . If one of these approaches 1, the system has settled on a firm classification of the current driving pattern. (iv) In a top-down pathway, conceptors from layers are passed down to the respective layers below. Because higher layers should have a clearer conception of the current noisy driver pattern than lower layers, this passing-down of conceptors “primes” the processing in layer with valuable contextual information. (v) Between each pair of layers , a trust variable is adapted by an online procedure. These trust variables range between 0 and 1. A value of indicates maximal confidence that the signal comes closer to the clean driver than the signal does, that is, the stage-wise denoising actually functions well when progressing from layer to . The trust evolves by comparing certain noise ratios that are observable locally in layers and . (vi) Within layer , an internal auto-adaptation process generates a candidate de-noised signal and a candidate local autoconceptor

. The local estimate

is linearly mixed with the signal , where the trust sets the mixing rate. The mixture is the effective signal input to layer . If the trust reaches its maximal value of 1, layer will ignore the signal from below and work entirely by self-generating a pattern. (vii) In a similar way, the effective conceptor in layer is a trust-negotiated mixture . Thus if the trust is maximal, layer will be governed entirely by the passed-down conceptor .

Summarizing, the higher the trusts inside the hierarchy, the more will the system be auto-generating conceptor-shaped signals, or conversely, at low trust values the system will be strongly permeated from below by the outside driver. If the trust variables reach their maximum value of 1, the system will run in a pure “confabulation” mode and generate an entirely noise-free signal – at the risk of doing this under an entirely misguided hypothesis . The key to make this architecture work thus lies in the trust variables. It seems to me that maintaining a measure of trust (or call it confidence, certainty, etc.) is an intrinsically necessary component in any signal processing architecture which hosts a top-down pathway of guiding hypotheses (or call them context, priors, bias, etc.).

Figure 8 B shows an excerpt from a simulation run. The system was driven first by an initial 4000 step period of , followed by 4000 steps of . The signal-to-noise ratio was 0.5 (noise twice as strong as signal). The system successfully settles on the right hypothesis (top panel) and generates very clean de-noised signal versions (bottom panel). The crucial item in this figure is the development of the trust variable . At the beginning of each 4000 step period it briefly drops, allowing the external signal to permeate upwards through the layers, thus informing the local auto-adaptation loops about “what is going on outside”. After these initial drops the trust rises to almost 1, indicating that the system firmly “believes” to have detected the right pattern. It then generates pattern versions that have almost no mix-in from the noisy external driver.

As a baseline comparison I also trained a standard linear transversal filter which computed a de-noised input pattern point based on the preceding input values. The filter length was set equal to the number of trainable parameters in the neural architecture. The performance of this linear de-noising filter (black line in Figure 8) is inferior to the architecture’s performance both in terms of accuracy and response time.

It is widely believed that top-down hypothesis-passing through a processing hierarchy plays a fundamental role in biological cognitive systems [33, 16]. However, the current best artificial pattern recognition systems [39, 59] use purely bottom-up processing – leaving room for further improvement by including top-down guidance. A few hierarchical architectures which exploit top-down hypothesis-passing have been proposed [33, 47, 43, 35]. All of these are designed for recognizing static patterns, especially images. The conceptor-based architecture presented here appears to be the first hierarchical system which targets dynamical patterns and uses top-down hypothesis-passing. Furthermore, in contrast to state-of-the-art pattern recognizers, it admits an incremental extension of the pattern repertoire.

Intrinsic conceptor logic.

In mathematical logics the semantics (“meaning”) of a symbol or operator is formalized as its extension. For instance, the symbol cow in a logic-based knowledge representation system in AI is semantically interpreted by the set of all (physical) cows, and the OR-operator is interpreted as set union: cow horse would refer to the set comprising all cows and horses. Similarly, in cognitive science, concepts are semantically referring to their extensions, usually called categories in this context [74]. Both in mathematical logic and cognitive science, extensions need not be confined to physical objects; the modeler may also define extensions in terms of mathematical structures, sensory perceptions, hypothetical worlds, ideas or facts. But at any rate, there is an ontological difference between the two ends of the semantic relationship.

This ontological gap dissolves in the case of conceptors. The natural account of the “meaning” of a matrix conceptor is the shape of the neural state cloud it is derived from. This shape is given by the correlation matrix of neural states. Both and have the same mathematical format: positive semi-definite matrices of identical dimension. Figure 9 visualizes the difference between classical extensional semantics of logics and the system-internal conceptor semantics. The symbol is the standard mathematical notation for the semantical meaning relationship.

Figure 9: Contrasting the extensional semantics of classical knowledge representation formalisms (upper half of graphics) with conceptor semantics (lower half).

I have cast these intuitions into a formal specification of an intrinsic conceptor logic (ICL), where the semantic relationship outlined above is formalized within the framework of institutions [37]. This framework has been developed in mathematics and computer science to provide a unified view on the multitude of existing “logics”. By formalizing ICL as an institution, conceptor logic can be rigorously compared to other existing logics. I highlight two findings. First, an ICL cast as an institution is a dynamcial system in its own right: the symbols used in this logic evolve over time. This is very much different from traditional views on logic, where symbols are static tokens. Second, it turns out that ICL is a logic which is decidable. Stated in intuitive terms, in a decidable logic it can be calculated whether a “concept” subsumes a concept (as in “a cow is an animal”). Deciding concept subsumption is a core task in AI systems and human cognition. In most logic-based AI systems, deciding concept subsumption can become computationally expensive or even impossible. In ICL it boils down to determining whether all components of a certain conception weight vector are smaller or equal to the corresponding components of another such vector, which can be done in a single processing step. This may help explaining why humans can make classification judgements almost instantaneously.

Discussion.

The human brain is a neurodynamical system which evidently supports logico-rational reasoning [49]. Since long this has challenged scientists to find computational models which connect neural dynamics with logic. Very different solutions have been suggested. At the dawn of computational neuroscience, McCulloch and Pitts have already interpreted networks of binary-state neurons as carrying out Boolean operations [73]. Logical inferences of various kinds have been realized in localist connectionist networks where neurons are labelled by concept names [82, 96]. In neurofuzzy modeling, feedforward neural networks are trained to carry out operations of fuzzy logic on their inputs [12]. In a field known as neuro-symbolic computation, deduction rules of certain formal logic systems are coded or trained into neural networks [8, 65, 10]. The combinatorial/compositional structure of symbolic knowledge has been captured by dedicated neural circuits to enable tree-structured representations [83] or variable-binding functionality [104].

All of these approaches require interface mechanisms. These interface mechanisms are non-neural and code symbolic knowledge representations into the numerical activation values of neurons and/or the topological structure of networks. One could say, previous approaches code logic into specialized neural networks, while conceptors instantiate the logic of generic recurrent neural networks. This novel, simple, versatile, computationally efficient, neurally not infeasible, bi-directional connection between logic and neural dynamics opens new perspectives for computational neuroscience and machine learning.

2 Introduction

In this section I expand on the brief characterization of the scientific context given in Section 1, and introduce mathematical notation.

2.1 Motivation

Intelligent behavior is desired for robots, demonstrated by humans, and studied in a wide array of scientific disciplines. This research unfolds in two directions. In “top-down” oriented research, one starts from the “higher” levels of cognitive performance, like rational reasoning, conceptual knowledge representation, planning and decision-making, command of language. These phenomena are described in symbolic formalisms developed in mathematical logic, artificial intelligence (AI), computer science and linguistics. In the “bottom-up” direction, one departs from “low-level” sensor data processing and motor control, using the analytical tools offered by dynamical systems theory, signal processing and control theory, statistics and information theory. For brevity I will refer to these two directions as the conceptual-symbolic and the data-dynamical sets of phenomena, and levels of description. The two interact bi-directionally. Higher-level symbolic concepts arise from low-level sensorimotor data streams in short-term pattern recognition and long-term learning processes. Conversely, low-level processing is modulated, filtered and steered by processes of attention, expectations, and goal-setting in a top-down fashion.

Several schools of thought (and strands of dispute) have evolved in a decades-long quest for a unification of the conceptual-symbolic and the data-dynamical approaches to intelligent behavior. The nature of symbols in cognitive processes has been cast as a philosophical issue [95, 29, 44]. In localist connectionistic models, symbolically labelled abstract processing units interact by nonlinear spreading activation dynamics [22, 96]. A basic tenet of behavior-based AI is that higher cognitive functions emerge from low-level sensori-motor processing loops which couple a behaving agent into its environment [11, 81]. Within cognitive science, a number of cognitive pheneomena have been described in terms of self-organization in nonlinear dynamical systems [94, 98, 105]. A pervasive idea in theoretical neuroscience is to interpret attractors in nonlinear neural dynamics as the carriers of conceptual-symbolic representations. This idea can be traced back at least to the notion of cell assemblies formulated by Hebb [45], reached a first culmination in the formal analysis of associative memories [80, 48, 4], and has since then diversified into a range of increasingly complex models of interacting (partial) neural attractors [109, 103, 87, 101]. Another pervasive idea in theoretical neuroscience and machine learning is to consider hierarchical neural architectures, which are driven by external data at the bottom layer and transform this raw signal into increasingly abstract feature representations, arriving at conceptual representations at the top layer of the hierarchy. Such hierarchical architectures mark the state of the art in pattern recognition technology [66, 38]. Many of these systems process their input data in a uni-directional, bottom-to-top fashion. Two notable exceptions are systems where each processing layer is designed according to statistical principles from Bayes’ rule [33, 47, 16], and models based on the iterative linear maps of map seeking circuits [35, 111], both of which enable top-down guidance of recognition by expectation generation. More generally, leading actors in theoretical neuroscience have characterized large parts of their field as an effort to understand how cognitive phenomena arise from neural dynamics [2, 36]. Finally, I point out two singular scientific efforts to design comprehensive cognitive brain models, the ACT-R architectures developed by Anderson et al. [5] and the Spaun model of Eliasmith et al. [26]. Both systems can simulate a broad selection of cognitive behaviors. They integrate numerous subsystems and processing mechanisms, where ACT-R is inspired by a top-down modeling approach, starting from cognitive operations, and Spaun from a bottom-up strategy, starting from neurodynamical processing principles.

Despite this extensive research, the problem of integrating the conceptual-symbolic with the data-dynamical aspects of cognitive behavior cannot be considered solved. Quite to the contrary, two of the largest current research initiatives worldwide, the Human Brain Project [1] and the NIH BRAIN initiative [51], are ultimately driven by this problem. There are many reasons why this question is hard, ranging from experimental challenges of gathering relevant brain data to fundamental oppositions of philosophical paradigms. An obstinate stumbling block is the different mathematical nature of the fundamental formalisms which appear most natural for describing conceptual-symbolic versus data-dynamical phenomena: symbolic logic versus nonlinear dynamics. Logic-oriented formalisms can easily capture all that is combinatorially constructive and hierarchically organized in cognition: building new concepts by logical definitions, describing nested plans for action, organizing conceptual knowledge in large and easily extensible abstraction hierarchies. But logic is inherently non-temporal, and in order to capture cognitive processes

, additional, heuristic “scheduling” routines have to be introduced which control the order in which logical rules are executed. This is how ACT-R architectures cope with the integration problem. Conversely, dynamical systems formalisms are predestined for modeling all that is continuously changing in the sensori-motor interface layers of a cognitive system, driven by sensor data streams. But when dynamical processing modules have to be combined into compounds that can solve complex tasks, again additional design elements have to be inserted, usually by manually coupling dynamical modules in ways that are informed by biological or engineering insight on the side of the researcher. This is how the Spaun model has been designed to realize its repertoire of cognitive functions. Two important modeling approaches venture to escape from the logic-dynamics integration problem by taking resort to an altogether different mathematical framework which can accomodate both sensor data processing and concept-level representations: the framework of Bayesian statistics and the framework of iterated linear maps mentioned above. Both approaches lead to a unified formal description across processing and representation levels, but at the price of a double weakness in accounting for the embodiment of an agent in a dynamical environment, and for the combinatorial aspects of cognitive operations. It appears that current mathematical methods can instantiate only one of the three: continuous dynamics, combinatorial productivity, or a unified level-crossing description format.

The conceptor mechanisms introduced in this report bi-directionally connect the data-dynamical workings of a recurrent neural network (RNN) with a conceptual-symbolic representation of different functional modes of the RNN. Mathematically, conceptors are linear operators which characterize classes of signals that are being processed in the RNN. Conceptors can be represented as matrices (convenient in machine learning applications) or as neural subnetworks (appropriate from a computational neuroscience viewpoint). In a bottom-up way, starting from an operating RNN, conceptors can be learnt and stored, or quickly generated on-the-fly, by what may be considered the simplest of all adaptation rules: learning a regularized identity map. Conceptors can be combined by elementary logical operations (AND, OR, NOT), and can be ordered by a natural abstraction relationship. These logical operations and relations are defined via a formal semantics. Thus, an RNN engaged in a variety of tasks leads to a learnable representation of these operations in a logic formalism which can be neurally implemented. Conversely, in a top-down direction, conceptors can be inserted into the RNN’s feedback loop, where they robustly steer the RNN’s processing mode. Due to their linear algebra nature, conceptors can be continuously morphed and “sharpened” or “defocussed”, which extends the discrete operations that are customary in logics into the domain of continuous “mental” transformations. I highlight the versatility of conceptors in a series of demonstrations: generating and morphing many different dynamical patterns with a single RNN; managing and monitoring the storing of patterns in a memory RNN; learning a class of dynamical patterns from presentations of a small number of examples (with extrapolation far beyond the training examples); classification of temporal patterns; de-noising of temporal patterns; and content-addressable memory systems. The logical conceptor operations enable an incremental extension of a trained system by incorporating new patterns without interfering with already learnt ones. Conceptors also suggest a novel answer to a perennial problem of attractor-based models of concept representations, namely the question of how a cognitive trajectory can leave an attractor (which is at odds with the very nature of an attractor). Finally, I outline a version of conceptors which is biologically plausible in the modest sense that only local computations and no information copying are needed.

2.2 Mathematical Preliminaries

I assume that the reader is familiar with properties of positive semidefinite matrices, the singular value decomposition, and (in some of the analysis of adaptation dynamics) the usage of the Jacobian of a dynamical system for analysing stability properties.

denote the closed (open, half-open) interval between real numbers and .

or denotes the transpose of a matrix or vector . is the identity matrix (the size will be clear from the context or be expressed as ). The th unit vector is denoted by (dimension will be clear in context). The trace of a square matrix is denoted by . The singular value decomposition of a matrix is written as , where are orthonormal and is the diagonal matrix containing the singular values of , assumed to be in descending order unless stated otherwise. is the pseudoinverse of . All matrices and vectors will be real and this will not be explicitly mentioned.

I use the Matlab notation to address parts of vectors and matrices, for instance is the third column of a matrix and picks from the submatrix consisting of rows 2 to 4. Furthermore, again like in Matlab, I use the operator diag in a “toggling” mode: returns the diagonal vector of a square matrix , and constructs a diagonal matrix from a vector of diagonal elements. Another Matlab notation that will be used is “” for the element-wise multiplication of vectors and matrices of the same size, and “” for element-wise exponentation of vectors and matrices.

and denote the range and null space of a matrix . For linear subspaces of , is the orthogonal complement space of and is the direct sum of and . is the dimensional projection matrix on a linear subspace of . For a -dimensional linear subspace of , denotes any dimensional matrix whose columns form an orthonormal basis of . Such matrices will occur only in contexts where the choice of basis can be arbitrary. It holds that .

denotes the expectation (temporal average) of a stationary signal (assuming it is well-defined, for instance, coming from an ergodic source).

For a matrix , is the Frobenius norm of . For real , it is the square root of the summed squared elements of . If is positive semidefinite with SVD , is the same as the 2-norm of the diagonal vector of , i.e. . Since in this report I will exclusively use the Frobenius norm for matrices, I sometimes omit the subscript and write for simplicity.

In a number of simulation experiments, a network-generated signal will be matched against a target pattern . The accuracy of the match will be quantified by the normalized root mean square error (NRMSE), , where is the mean operator over data points .

The symbol is reserved for the size of a reservoir (= number of neurons) throughout.

3 Theory and Demonstrations

This is the main section of this report. Here I develop in detail the concepts, mathematical analysis, and algorithms, and I illustrate various aspects in computer simulations.

Figure 10 gives a navigation guide through the dependency tree of the components of this section.

Figure 10: Dependency tree of subsections in Section 3.

The program code (Matlab) for all simulations can be retrieved from
http://minds.jacobs-university.de/sites/default/files/uploads/
…SW/ConceptorsTechrepMatlab.zip.

3.1 Networks and Signals

Throughout this report, I will be using discrete-time recurrent neural networks made of simple neurons, which will be driven by an input time series . In the case of 1-dimensional input, these networks consist of (i) a “reservoir” of recurrently connected neurons whose activations form a state vector , (ii) one external input neuron that serves to drive the reservoir with training or cueing signals and (iii) another external neuron which serves to read out a scalar target signal from the reservoir (Fig. 11). The system operates in discrete timesteps according to the update equations

(1)
(2)

where is the matrix of reservoir-internal connection weights, is the sized vector of input connection weights, is the vector of readout weights, and is a bias. The

is a sigmoidal function that is applied to the network state

component-wise. Due to the , the reservoir state space or simply state space is .

The input weights and the bias are fixed at random values and are not subject to modification through training. The output weights are learnt. The reservoir weights are learnt in some of the case studies below, in others they remain fixed at their initial random values. If they are learnt, they are adapted from a random initialization denoted by . Figure 11 A illustrates the basic setup.

I will call the driving signals patterns. In most parts of this report, patterns will be periodic. Periodicity comes in two variants. First, integer-periodic patterns have the property that for some positive integer . Second, irrational-periodic patterns are discretely sampled from continuous-time periodic signals, where the sampling interval and the period length of the continuous-time signal have an irrational ratio. An example is . These two sorts of drivers will eventually lead to different kinds of attractors trained into reservoirs: integer-periodic signals with period length yield attractors consisting of points in reservoir state space, while irrational-periodic signals give rise to attracting sets which can be topologically characterized as one-dimensional cycles that are homeomorphic to the unit cycle in .

Figure 11: A. Basic system setup. Through input connections , an input neuron feeds a driving signal to a “reservoir” of neurons which are recurrently connected to each other through connections . From the -dimensional neuronal activation state , an output signal is read out by connections . All broken connections are trainable. B. During the initial driving of the reservoir with driver , using initial random weights , neuron produces its signal (thick gray line) based on external driving input and feeds from other neurons from within the reservoir (three shown). C. After training new reservoir weights , the same neuron should produce the same signal based only on the feeds from other reservoir neurons.

3.2 Driving a Reservoir with Different Patterns

A basic theme in this report is to develop methods by which a collection of different patterns can be loaded in, and retrieved from, a single reservoir. The key for these methods is an elementary dynamical phenomenon: if a reservoir is driven by a pattern, the entrained network states are confined to a linear subspace of network state space which is characteristic of the pattern. In this subsection I illuminate this phenomenon by a concrete example. This example will be re-used and extended on several occasions throughout this report.

I use four patterns. The first two are irrational periodic and the last two are integer-periodic: (1) a sinewave of period sampled at integer times (pattern ) (2) a sinewave of period (period of plus 1), (3) a random 5-periodic pattern and (4) a slight variation thereof (Fig. 12 left column).

A reservoir with neurons is randomly created. At creation time the input weights and the bias are fixed at random values; these will never be modified thereafter. The reservoir weights are initialized to random values ; in this first demonstration they will not be subsequently modified either. The readout weights are initially undefined (details in Section 4.1).

In four successive and independent runs, the network is driven by feeding the respective pattern as input (), using the update rule

After an initial washout time, the reservoir dynamics becomes entrained to the driver and the reservoir state exhibits an involved nonlinear response to the driver . After this washout, the reservoir run is continued for steps, and the obtained states are collected into sized state collection matrices for subsequent use.

The second column in Fig. 12 shows traces of three randomly chosen reservoir neurons in the four driving conditions. It is apparent that the reservoir has become entrained to the driving input. Mathematically, this entrainment is captured by the concept of the echo state property: any random initial state of a reservoir is “forgotten”, such that after a washout period the current network state is a function of the driver. The echo state property is a fundamental condition for RNNs to be useful in learning tasks [54, 13, 46, 110, 99, 71]. It can be ensured by an appropriate scaling of the reservoir weight matrix. All networks employed in this report possess the echo state property.

Figure 12: The subspace phenomenon. Each row of panels documents situation when the reservoir is driven by a particular input pattern. “Driver and y”: the driving pattern (thin black line) and the signals retrieved with conceptors (broad light gray line). Number inset is the NRMSE between original driver and retrieved signal. “Reservoir states”: activations of three randomly picked reservoir neurons. “Log10 PC energy”: of reservoir signal energies in the principal component directions. “Leading PC energy”: close-up on first ten signal energies in linear scale. Notice that the first two panels in each row show discrete-time signals; points are connected by lines only for better visual appearance.

A principal component analysis (PCA) of the 100 reservoir signals reveals that the driven reservoir signals are concentrated on a few principal directions. Concretely, for each of the four driving conditions, the reservoir state correlation matrix was estimated by , and its SVD was computed, where the columns of

are orthonormal eigenvectors of

(the principal component (PC) vectors), and the diagonal of contains the singular values of , i.e. the energies (mean squared amplitudes) of the principal signal components. Figure 12 (third and last column) shows a plot of these principal component energies. The energy spectra induced by the two irrational-period sines look markedly different from the spectra obtained from the two 5-periodic signals. The latter lead to nonzero energies in exactly 5 principal directions because the driven reservoir dynamics periodically visits 5 states (the small but nonzero values in the plots in Figure 12 are artefacts earned from rounding errors in the SVD computation). In contrast, the irrational-periodic drivers lead to reservoir states which linearly span all of (Figure 12, upper two plots). All four drivers however share a relevant characteristic (Figure 12, right column): the total reservoir energy is concentrated in a quite small number of leading principal directions.

When one inspects the excited reservoir dynamics in these four driving conditions, there is little surprise that the neuronal activation traces look similar to each other for the first two and in the second two cases (Figure 12, second column). This “similarity” can be quantified in a number of ways. Noting that the geometry of the “reservoir excitation space” in driving condition is characterized by a hyperellipsoid with main axes and axis lengths , a natural way to define a similarity between two such ellipsoids is to put

(3)

The measure ranges in . It is 0 if and only if the reservoir signals populate orthogonal linear subspaces, and it is 1 if and only if for some scaling factor . The measure can be understood as a generalized squared cosine between and . Figure 13 A shows the similarity matrix obtained from (3). The similarity values contained in this matrix appear somewhat counter-intuitive, inasmuch as the reservoir responses to the sinewave patterns come out as having similarities of about 0.6 with the 5-periodic driven reservoir signals; this does not agree with the strong visual dissimilarity apparent in the state plots in Figure 12. In Section 3.5 I will introduce another similarity measure which agrees better with intuitive judgement.

A    B    C

Figure 13: Matrix plots of pairwise similarity between the subspaces excited in the four driving conditions. Grayscale coding: 0 = black, 1 = white. A: similarity based on the data correlation matrices . B,C: similarities based on conceptors for two different values of aperture . For explanation see text.

3.3 Storing Patterns in a Reservoir, and Training the Readout

One of the objectives of this report is a method for storing several driving patterns in a single reservoir, such that these stored patterns can later be retrieved and otherwise be controlled or manipulated. In this subsection I explain how the initial “raw” reservoir weights are adapted in order to “store” or “memorize” the drivers, leading to a new reservoir weight matrix . I continue with the four-pattern-example used above.

The guiding idea is to enable the reservoir to re-generate the driven responses in the absence of the driving input. Consider any neuron (Fig. 11B). During the driven runs , it has been updated per

where is the -th row in , is the -th element of , and is the th bias component. The objective for determining new reservoir weights is that the trained reservoir should be able to oscillate in the same four ways as in the external driving conditions, but without the driving input. That is, the new weights leading to neuron should approximate

as accurately as possible, for . Concretely, we optimize a mean square error criterion and compute

(4)

where is the number of patterns to be stored (in this example

). This is a linear regression task, for which a number of standard algorithms are available. I employ ridge regression (details in Section

4.1).

The readout neuron serves as passive observer of the reservoir dynamics. The objective to determine its connection weights is simply to replicate the driving input, that is, is computed (again by ridge regression) such that it minimizes the squared error , averaged over time and the four driving conditions.

I will refer to this preparatory training as storing patterns in a reservoir, and call a reservoir loaded after patterns have been stored.

3.4 Conceptors: Introduction and Basic Usage in Retrieval

How can these stored patterns be individually retrieved again? After all, the storing process has superimposed impressions of all patterns on all of the re-computed connection weights of the network – very much like the pixel-wise addition of different images would yield a mixture image in which the individual original images are hard to discern. One would need some sort of filter which can disentangle again the superimposed components in the connection weights. In this section I explain how such filters can be obtained.

The guiding idea is that for retrieving pattern from a loaded reservoir, the reservoir dynamics should be restricted to the linear subspace which is characteristic for that pattern. For didactic reasons I start with a simplifying assumption (to be dropped later). Assume that there exists a (low-dimensional) linear subspace such that all state vectors contained in the driven state collection lie in . In our example, this is actually the case for the two 5-periodic patterns. Let be the projector matrix which projects on . We may then hope that if we run the loaded reservoir autonomously (no input), constraining its states to using the update rule

(5)

it will oscillate in a way that is closely related to the way how it oscillated when it was originally driven by .

However, it is not typically the case that the states obtained in the original driving conditions are confined to a proper linear subspace of the reservoir state space. Consider the sine driver in our example. The linear span of the reservoir response state is all of (compare the PC energy plots in Figure 12). The associated projector would be the identity, which would not help to single out an individual pattern in retrieval. But actually we are not interested in those principal directions of reservoir state space whose excitation energies are negligibly small (inspect again the quick drop of these energies in the third column, top panel in Figure 12 – it is roughly exponential over most of the spectrum, except for an even faster decrease for the very first few singular values). Still considering the sinewave pattern : instead of we would want a projector that projects on the subspace spanned by a “small” number of leading principal components of the “excitation ellipsoid” described by the sine-driver-induced correlation matrix . What qualifies as a “small” number is, however, essentially arbitrary. So we want a method to shape projector-like matrices from reservoir state correlation matrices in a way that we can adjust, with a control parameter, how many of the leading principal components should become registered in the projector-like matrix.

At this point I give names to the projector-like matrices and the adjustment parameter. I call the latter the aperture parameter, denoted by . The projector-like matrices will be called conceptors and generally be denoted by the symbol . Since conceptors are derived from the ellipsoid characterized by a reservoir state corrlation matrix , and parametrized by the aperture parameter, I also sometimes write to make this dependency transparent.

There is a natural and convenient solution to meet all the intuitive objectives for conceptors that I discussed up to this point. Consider a reservoir driven by a pattern , leading to driven states collected (as columns) in a state collection matrix , which in turn yields a reservoir state correlation matrix . We define a conceptor with the aid of a cost function , whose minimization yields . The cost function has two components. The first component reflects the objective that should behave as a projector matrix for the states that occur in the pattern-driven run of the reservoir. This component is , the time-averaged deviation of projections from the state vectors . The second component of adjusts how many of the leading directions of should become effective for the projection. This component is . This leads to the following definition.

Definition 1

Let be an correlation matrix and . The conceptor matrix associated with and is

(6)

The minimization criterion (6) uniquely specifies . The conceptor matrix can be effectively computed from and . This is spelled out in the following proposition, which also lists elementary algebraic properties of conceptor matrices:

Proposition 1

Let be a correlation matrix and . Then,

  1. can be directly computed from and by

    (7)
  2. if is the SVD of , then the SVD of can be written as , i.e. has the same principal component vector orientation as ,

  3. the singular values of relate to the singular values of by ,

  4. the singular values of range in ,

  5. can be recovered from and by

    (8)

The proof is given in Section 5.1. Notice that all inverses appearing in this proposition are well-defined because is assumed, which implies that all singular values of are properly smaller than 1. I will later generalize conceptors to include the limiting cases and (Section 3.8.1).

In practice, the correlation matrix is estimated from a finite sample , which leads to the approximation , where is a matrix containing reservoir states collected during a learning run.

Figure 14 shows the singular value spectra of for various values of , for our example cases of (irrational-period sine driver) and (5-periodic driver). We find that the nonlinearity inherent in (7) makes the conceptor matrices come out “almost” as projector matrices: the singular values of are mostly close to 1 or close to 0. In the case of the 5-periodic driver, where the excited network states populate a 5-dimensional subspace of , increasing lets converge to a projector onto that subspace.

Figure 14: How the singular values of a conceptor depend on . Singular value spectra are shown for the first sinewave pattern and the first 5-periodic random pattern. For explanation see text.

If one has a conceptor matrix derived from a pattern through the reservoir state correlation matrix associated with that pattern, the conceptor matrix can be used in an autonomous run (no external input) using the update rule

(9)

where the weight matrix has been shaped by storing patterns among which there was . Returning to our example, four conceptors were computed with and the loaded reservoir was run under rule (9) from a random initial state . After a short washout period, the network settled on stable periodic dynamics which were closely related to the original driving patterns. The network dynamics was observed through the previously trained output neuron. The left column in Figure 12 shows the autonomous network output as a light bold gray line underneath the original driver. To measure the achieved accuracy, the autonomous output signal was phase-aligned with the driver (details in Section 4.1) and then the NRMSE was computed (insets in Figure panels). The NRMSEs indicate that the conceptor-constrained autonomous runs could successfully separate from each other even the closely related pattern pairs versus and versus .

A note on terminology.

Equation (9) shows a main usage of conceptor matrices: they are inserted into the reservoir state feedback loop and cancel (respectively, dampen) those reservoir state components which correspond to directions in state space associated with zero (or small, respectively) singular values in the conceptor matrix. In most of this report, such a direction-selective damping in the reservoir feedback loop will be effected by way of inserting matrices like in Equation (9). However, inserting a matrix is not the only way by which such a direction-selective damping can be achieved. In Section 3.14, which deals with biological plausibility issues, I will propose a neural circuit which achieves a similar functionality of direction-specific damping of reservoir state components by other means and with slightly differing mathematical properties. I understand the concept of a “conceptor” as comprising any mechanism which effects a pattern-specific damping of reservoir signal components. Since in most parts of this report this will be achieved with conceptor matrices, as in (9), I will often refer to these matrices as “conceptors” for simplicity. The reader should however bear in mind that the notion of a conceptor is more comprehensive than the notion of a conceptor matrix. I will not spell out a formal definition of a “conceptor”, deliberately leaving this concept open to become instantiated by a variety of computational mechanisms of which only two are formally defined in this report (via conceptor matrices, and via the neural circuit given in Section 3.14).

3.5 A Similarity Measure for Excited Network Dynamics

In Figure 13 A a similarity matrix is presented which compares the excitation ellipsoids represented by the correlation matrices by the similarity metric (3). I remarked at that time that this is not a fully satisfactory metric, because it does not agree well with intuition. We obtain a more intuitively adequate similiarity metric if conceptor matrices are used as descriptors of “subspace ellipsoid geometry” instead of the raw correlation matrices, i.e. if we employ the metric

(10)

where is the SVD of . Figure 13 B,C shows the similarity matrices arising in our standard example for and . The intuitive dissimilarity between the sinewave and the 5-periodic patterns, and the intuitive similarity between the two sines (and the two 5-periodic pattern versions, respectively) is revealed much more clearly than on the basis of .

When interpreting similarities or one should bear in mind that one is not comparing the original driving patterns but the excited reservoir responses.

3.6 Online Learning of Conceptor Matrices

The minimization criterion (6) immediately leads to a stochastic gradient online method for adapting :

Proposition 2

Assume that a stationary source of -dimensional reservoir states is available. Let be any matrix, and a learning rate. Then the stochastic gradient adaptation

(11)

will lead to .

The proof is straightforward if one employs generally known facts about stochastic gradient descent and the fact that

is positive definite quadratic in the -dimensional space of elements of (shown in the proof of Proposition 1), and hence provides a Lyapunov function for the gradient descent (11). The gradient of with respect to is

(12)

which immediately yields (11).

The stochastic update rule (11) is very elementary. It is driven by two components, (i) an error signal which simply compares the current state with its -mapped value, and (ii) a linear decay term. We will make heavy use of this adaptive mechanism in Sections 3.13.1 ff. This observation is also illuminating the intuitions behind the definition of conceptors. The two components strike a compromise (balanced by ) between (i) the objective that should leave reservoir states from the target pattern unchanged, and (ii) should have small weights. In the terminology of machine learning one could say, “a conceptor is a regularized identity map”.

3.7 Morphing Patterns

Conceptor matrices offer a way to morph RNN dynamics. Suppose that a reservoir has been loaded with some patterns, among which there are and with corresponding conceptors . Patterns that are intermediate between and can be obtained by running the reservoir via (9), using a linear mixture between and :

(13)

Still using our four-pattern example, I demonstrate how this morphing works out for morphing (i) between the two sines, (ii) between the two 5-periodic patterns, (iii) between a sine and a 5-periodic pattern.

Frequency Morphing of Sines

In this demonstration, the morphing was done for the two sinewave conceptors and . The morphing parameter was allowed to range from to (!). The four-pattern-loaded reservoir was run from a random initial state for 500 washout steps, using (13) with . Then recording was started. First, the run was continued with the intial for 50 steps. Then, was linearly ramped up from to during 200 steps. Finally, another 50 steps were run with the final setting .

Note that morph values and correspond to situations where the reservoir is constrained by the original conceptors and , respectively. Values correspond to interpolation. Values and correspond to extrapolation. The extrapolation range on either side is twice as long as the interpolation range.

In addition, for eight equidistant values in , the reservoir was run with a mixed conceptor for 500 steps, and the obtained observation signal was plotted in a delay-embedded representation, yielding “snapshots” of the reservoir dynamics at these values (a delay-embedding plot of a 1-dimensional signal creates a 2-dimensional plot by plotting value pairs with a delay chosen to yield an appealing visual appearance).

Figure 15: Morphing between (and beyond) two sines. The morphing range was . Black circular dots in the two bottom panels mark the points and , corresponding to situations where the two original conceptors were active in unadulterated form. Top: Delay-embedding plots of network observation signal (delay = 1 step). Thick points show 25 plotted points, thin points show 500 points (appearing as connected line). The eight panels have a plot range of . Triangles in center panel mark the morph positions corresponding to the delay embedding “snapshots”. Center: the network observation signal of a morph run. Bottom: Thin black line: the period length obtained from morphing between (and extrapolating beyond) the original period lengths. Bold gray line: period lengths measured from the observation signal .

Figure 15 shows the findings. The reservoir oscillates over the entire inter/extrapolation range with a waveform that is approximately equal to a sampled sine. At the morph values and (indicated by dots in the Figure), the system is in exactly the same modes as they were plotted earlier in the first two panels of the left column in Figure 12. Accordingly the fit between the original driver’s period lenghtes and the autonomously re-played oscillations is as good as it was reported there (i.e. corresponding to a steady-state NRMSE of about 0.01). In the extrapolation range, while the linear morphing of the mixing parameter does not lead to an exact linear morphing of the observed period lengths, still the obtained period lengths steadily continue to decrease (going left from ) and to increase (going right from ).

In sum, it is possible to use conceptor-morphing to extend sine-oscillatory reservoir dynamics from two learnt oscillations of periods to a range between (minimal and maximal values of period lengths shown in the Figure). The post-training sinewave generation thus extrapolated beyond the period range spanned by the two training samples by a factor of about 4.4. From a perspective of machine learning this extrapolation is remarkable. Generally speaking, when neural pattern generators are trained from demonstration data (often done in robotics, e.g. [52, 89]), interpolation of recallable patterns is what one expects to achieve, while extrapolation is deemed hard.

From a perspective of neurodynamics, it is furthermore remarkable that the dimension of interpolation/extrapolation was the speed of the oscillation. Among the infinity of potential generalization dimensions of patterns, speedup/slowdown of pattern generation has a singular role and is particularly difficult to achieve. The reason is that speed cannot be modulated by postprocessing of some underlying generator’s output – the prime generator itself must be modulated [108]. Frequency adaptation of neural oscillators is an important theme in research on biological pattern generators (CPGs) (reviews: [40, 50]). Frequency adaptation has been modeled in a number of ways, among which (i) to use a highly abstracted CPG model in the form of an ODE, and regulate speed by changing the ODE’s time constant; (ii) to use a CPG model which includes a pacemaker neuron whose pace is adaptive; (iii) to use complex, biologically quite detailed, modular neural architectures in which frequency adapatation arises from interactions between modules, sensor-motoric feedback cycles, and tonic top-down input. However, the fact that humans can execute essentially arbitrary motor patterns at different speeds is not explained by these models. Presumably this requires a generic speed control mechanism which takes effect already at higher (cortical, planning) layers in the motor control hierarchy. Conceptor-controlled frequency adaptation might be of interest as a candidate mechanism for such a “cognitive-level” generic speed control mechanism.

Shape Morphing of an Integer-Periodic Pattern

In this demonstration, the conceptors and from the 5-periodic patterns and were morphed, again with . Figure 16 depicts the network observer for a morph run of 95 steps which was started with and ended with , with a linear ramping in between. It can be seen that the differences between the two reference patterns (located at the points marked by dots) become increasingly magnified in both extrapolation segments. At each of the different points in each 5-cycle, the “sweep” induced by the morphing is however neither linear nor of the same type across all 5 points of the period (right panel). A simple algebraic rule that would describe the geometric characteristics of such morphings cannot be given. I would like to say, it is “up to the discretion of the network’s nonlinear dynamics” how the morphing command is interpreted; this is especially true for the extrapolation range. If reservoirs with a different initial random are used, different morphing geometries arise, especially at the far ends of the extrapolation range (not shown).

Figure 16: Morphing between, and extrapolating beyond, two versions of a 5-periodic random pattern. The morphing range was . Bottom: Network observation from a morphing run. Dots mark the points and , corresponding to situations where the two original conceptors were active in unadulterated form. The network observation signal is shown. Top: Delay-embedding “snapshots”. Figure layout similar to Figure 15.

The snapshots displayed in Figure 16 reveal that the morphing sweep takes the reservoir through two bifurcations (apparent in the transition from snapshot 2 to 3, and from 7 to 8). In the intermediate morphing range (snapshots 3 – 7), we observe a discrete periodic attractor of 5 points. In the ranges beyond, on both sides the attracting set becomes topologically homomorphic to a continuous cycle. From a visual inspection, it appears that these bifurcations “smoothly” preserve some geometrical characteristics of the observed signal . A mathematical characterisation of this phenomenological continuity across bifurcations remains for future investigations.

Heterogeneous Pattern Morphing

Figure 17 shows a morph from the 5-periodic pattern to the irrational-periodic sine (period length ). This time the morphing range was , (no extrapolation). The Figure shows a run with an initial 25 steps of , followed by a 50-step ramp to and a tail of 25 steps at the same level. One observes a gradual change of signal shape and period along the morph. From a dynamical systems point of view this gradual change is unexpected. The reservoir is, mathematically speaking, an autonomous system under the influence of a slowly changing control parameter . On both ends of the morph, the system is in an attractor. The topological nature of the attractors (seen as subsets of state space) is different (5 isolated points vs. a homolog of a 1-dim circle), so there must be a at least one bifurcation taking place along the morphing route. Such a bifurcations would usually be accompanied by a sudden change of some qualitative characteristic of the system trajectory. We find however no trace of a dynamic rupture, at least not by visual inspection of the output trajectory. Again, a more in-depth formal characterization of what geometric properties are “smoothly” carried through these bifurcations is left for future work.

Figure 2 in Section 1 is a compound demonstration of the three types of pattern morphing that I here discussed individually.

A possible application for the pattern morphing by conceptors is to effect smooth gait changes in walking robots, a problem that is receiving some attention in that field.

Figure 17: Morphing from a 5-periodic random pattern to an irrational-periodic sine. The morphing range was . Figure layout otherwise is as in Figure 15.

3.8 Understanding Aperture

3.8.1 The Semantics of as “Aperture”

Here I show how the parameter can be interpreted as a scaling of signal energy, and motivate why I call it “aperture”.

We can rewrite as follows:

(14)