1 Introduction to Reinforcement Learning (RL) with Recurrent
Neural Networks (RNNs) in Partially Observable Environments^{1}^{1}1Parts
of this introduction are similar to parts of a much more extensive recent
Deep Learning overview [245] which has many additional references.
General Reinforcement Learning (RL) agents must discover, without the aid of a teacher, how to interact with a dynamic, initially unknown, partially observable environment in order to maximize their expected cumulative reward signals, e.g., [123, 272, 310]. There may be arbitrary, a priori unknown delays between actions and perceivable consequences. The RL problem is as hard as any problem of computer science, since any task with a computable description can be formulated in the RL framework, e.g., [109].
To become a general problem solver that is able to run arbitrary problemsolving programs, the controller of a robot or an artificial agent must be a generalpurpose computer [67, 35, 282, 194]. Artificial recurrent neural networks (RNNs) fit this bill. A typical RNN consists of many simple, connected processors called neurons, each producing a sequence of realvalued activations. Input neurons get activated through sensors perceiving the environment, other neurons get activated through weighted connections or wires from previously active neurons, and some neurons may affect the environment by triggering actions. Learning or credit assignment is about finding realvalued weights that make the NN exhibit desired behavior, such as driving a car. Depending on the problem and how the neurons are connected, such behavior may require long causal chains of computational stages, where each stage transforms the aggregate activation of the network, often in a nonlinear manner.
Unlike feedforward NNs (FNNs; [95, 23]) and Support Vector Machines (SVMs; [287, 253]), RNNs can in principle interact with a dynamic partially observable environment in arbitrary, computable ways, creating and processing memories of sequences of input patterns [258]. The weight matrix of an RNN is its program. Without a teacher, rewardmaximizing programs of an RNN must be learned through repeated trial and error.
1.1 RL through Direct and Indirect Search in RNN Program Space
It is possible to train small RNNs with a few 100 or 1000 weights using evolutionary algorithms
[200, 255, 105, 56, 68] to search the space of NN weights [165, 307, 44, 321, 180, 259, 320, 164, 173, 69, 71, 187, 121, 313, 66, 270, 269, 305], or through policy gradients (PGs) [314, 315, 316, 274, 18, 1, 63, 128, 313, 210, 192, 191, 256, 85, 312, 190, 82, 93][245, Sec. 6.6]. For example, our evolutionary algorithms outperformed traditional, Dynamic Programming [20]based RL methods [272][245, Sec. 6.2] in partially observable environments, e.g., [72]. However, these techniques by themselves are insufficient for solving complex control problems involving highdimensional sensory inputs such as video, from scratch. The program search space for networks of the size required for these tasks is simply too large.However, the search space can often be reduced dramatically by evolving compact encodings of neural networks (NNs), e.g., through Lindenmeyer Systems [115], graph rewriting [127], Cellular Encoding [83], HyperNEAT [268], and other techniques [245, Sec. 6.7]. In very general early work, we used universal assemblerlike languages to encode NNs [235], later coefficients of a Discrete Cosine Transform (DCT) [132]. The latter method, Compressed RNN Search [132], was used to successfully evolve RNN controllers with over a million weights (the largest ever evolved) to drive a simulated car in a video game, based solely on a highdimensional video stream [132]—learning both control and visual processing from scratch, without unsupervised pretraining of a vision system. This was the first published Deep Learner to learn control policies directly from highdimensional sensory input using RL.
One can further facilitate the learning task of controllers through certain types of supervised learning (SL) and unsupervised learning (UL) based on gradient descent techniques. In particular, UL/SL can be used to compress the search space, and to build predictive world models to accelerate RL, as will be discussed later. But first let us review the relevant NN algorithms for SL and UL.
1.2 Deep Learning in NNs: Supervised & Unsupervised Learning (SL & UL)
The term Deep Learning
was first introduced to Machine Learning in 1986
[49] and to NNs in 2000 [3, 244]. The first deep learning NNs, however, date back to the 1960s
[113, 245] (certain more recent developments are covered in a survey [139]).To maximize differentiable objective functions of SL and UL, NN researchers almost invariably use backpropagation (BP)
[125, 30, 52]in discrete graphs of nodes with differentiable activation functions
[151, 265][245, Sec. 5.5]. Typical applications include BP in FNNs [297], or BP through time (BPTT) and similar methods in RNNs, e.g., [299, 317, 208][245]. BP and BPTT suffer from the Fundamental Deep Learning Problem first discovered and analyzed in my lab in 1991: with standard activation functions, cumulative backpropagated error signals decay exponentially in the number of layers, or they explode [98, 99]. Hence most early FNNs [297, 211] had few layers. Similarly, early RNNs [245, Sec. 5.6.1] could not generalize well under both short and long time lags between relevant events. Over the years, several ways of overcoming the Fundamental Deep Learning Problem have been explored. For example, deep stacks of unsupervised RNNs [228] or FNNs [13, 96, 139] help to accelerate subsequent supervised learning through BPTT [228, 230] or BP [96]. One can also “distill” or compress the knowledge of a teacher RNN into a student RNN by forcing the student to predict the hidden units of the teacher [228, 230].Long ShortTerm Memory (LSTM; [101, 61, 77]) alleviates the Fundamental Deep Learning Problem, and was the first RNN architecture to win international contests (in connected handwriting), e.g., [79, 247][245]. Connectionist Temporal Classification (CTC) [76]
is a widely used gradientbased method for finding RNN weights that maximize the probability of teacherprovided label sequences, given (typically much longer and more highdimensional) streams of realvalued input vectors. For example, CTC was used by Baidu to break an important speech recognition record
[88]. Many recent stateoftheart results in sequence processing are based on LSTM, which learned to control robots [159], and was used to set benchmark records in prosody contour prediction [55] (IBM), texttospeech synthesis [54] (Microsoft), large vocabulary speech recognition [213] (Google), and machine translation [271] (Google). CTCtrained LSTM greatly improved Google Voice [214] and is now available to over a billion smartphone users. Nevertheless, at least in some applications, other RNNs may sometimes yield better results than gradientbased LSTM [158, 217, 323, 116, 250, 186, 133]. Alternative NNs with differentiable memory have been proposed [229, 47, 175, 232, 231, 103, 80, 303].Today’s faster computers, such as GPUs, mitigate the Fundamental Deep Learning Problem for FNNs [181, 34, 198, 38, 40]
. In particular, many recent computer vision contests were won by fully supervised MaxPooling Convolutional NNs (MPCNNs), which consist of alternating convolutional
[58, 19] and maxpooling [296] layers topped off by standard fully connected output layers. All weights are trained by backpropagation [140, 199, 220, 245]. Ensembles [218, 28] of GPUbased MPCNNs [40, 41] achieved dramatic improvements of longstanding benchmark records, e.g., MNIST (2011), won numerous competitions [247, 38, 41, 39, 161, 42, 36, 134, 322, 37, 245], and achieved the first humancompetitive or even superhuman results on wellknown benchmarks, e.g., [247, 42, 245]. There are many recent variations and improvements [64, 74, 124, 75, 277, 266, 245]. Supervised Transfer Learning from one dataset to another [32, 43] can speed up learning. A combination of Convolutional NNs (CNNs) and LSTM led to best results in automatic image caption generation [288].1.3 Gradient DescentBased NNs for RL
Perhaps the most wellknown RL application is Tesauro’s backgammon player [280] from 1994 which learned to achieve the level of human world champions, by playing against itself. It uses a reactive (memoryfree) policy based on the simplifying assumption of Markov Decision Processes: the current input of the RL agent conveys all information necessary to compute an optimal next output event or decision. The policy is implemented as a gradientbased FNN trained by the method of temporal differences [272][245, Sec. 6.2]. During play, the FNN learns to map board states to predictions of expected cumulative reward, and selects actions leading to states with maximal predicted reward. A very similar approach (also based on over 20yearold methods) employed a CNN (see Sec. 1.2) to play several Atari video games directly from 8484 pixel 60 Hz video input [167], using Neural Fitted QLearning (NFQ) [201] based on experience replay (1991) [149]. Even better results were achieved by using (slow) Monte Carlo tree planning to train comparatively fast deep NNs [86].
Such FNN approaches cannot work in realistic partially observable environments where memories of previous inputs have to be stored for a priori unknown time intervals. This triggered work on partially observable Markov decision problems (POMDPs) [223, 222, 227, 204, 205, 206, 316, 148, 278, 122, 152, 25, 114, 160, 126, 308, 309, 183]. Traditional RL techniques [272][245, Sec. 6.2] based on Dynamic Programming [20] can be combined with gradient descent methods to train an RNN as a valuefunction approximator that maps entire event histories to predictions of expected cumulative reward [227, 148]. LSTM [101, 61, 189, 78, 77] (see Sec. 1.2) was used in this way for RL robots [12].
Gradientbased UL may be used to reduce an RL controller’s search space by feeding it only compact codes of highdimensional inputs [118, 142, 46][245, Sec. 6.4]. For example, NFQ [201] was applied to realworld control tasks [138, 202]
where purely visual inputs were compactly encoded in hidden layers of deep autoencoders
[245, Sec. 5.7 and and 5.15]. RL combined with unsupervised learning based on Slow Feature Analysis [318, 131] enabled a humanoid robot to learn skills from raw video streams [154]. A RAAM RNN [193] was employed as a deep unsupervised sequence encoder for RL [65].1.3.1 Early RNN Controllers with Predictive RNN World Models
One important application of gradientbased UL is to obtain a predictive world model, , that a controller, , may use to achieve its goals more efficiently, e.g., through cheap, “mental” based trials, as opposed to expensive trials in the real world [301, 273]. The first combination of an RL RNN and an UL RNN was ours and dates back to 1990 [223, 222, 226, 227], generalizing earlier similar controller/model systems ( systems) based on FNNs [298, 179]; compare related work [177, 119, 301, 300, 209, 120, 178, 302, 73, 45, 144, 166, 153, 196, 60][245, Sec. 6.1]. tries to learn to predict ’s inputs (including reward signals) from previous inputs and actions. is also temporarily used as a surrogate for the environment: and form a coupled RNN where ’s outputs become inputs of , whose outputs (actions) in turn become inputs of . Now a gradient descent technique [299, 317, 208](see Sec. 1.2) can be used to learn and plan ahead by training in a series of simulated trials to produce output action sequences achieving desired input events, such as high realvalued reward signals (while the weights of remain fixed). An RL active vision system, from 1991 [249], used this basic principle to learn sequential shifts (saccades) of a fovea to detect targets in a visual scene, thus learning a rudimentary version of selective attention.
Those early systems, however, did not yet use powerful RNNs such as LSTM. A more fundamental problem is that if the environment is too noisy, will usually only learn to approximate the conditional expectations of predicted values, given parts of the history. In certain noisy environments, Monte Carlo Tree Sampling (MCTS; [29]) and similar techniques may be applied to to plan successful future action sequences for . All such methods, however, are about simulating possible futures time step by time step, without profiting from humanlike hierarchical planning or abstract reasoning, which often ignores irrelevant details.
1.3.2 Early Predictive RNN World Models Combined with Traditional RL
In the early 1990s, an RNN as in Sec. 1.3.1 was also combined [227, 150] with traditional temporal difference methods [122, 272][245, Sec. 6.2] based on the Markov assumption (Sec. 1.3). While is processing the history of actions and observations to predict future inputs and rewards, the internal states of are used as inputs to a temporal differencebased predictor of cumulative predicted reward, to be maximized through appropriate action sequences. One of our systems described in 1991 [227] actually collapsed the cumulative reward predictor into the predictive world model, .
1.4 Hierarchical & Multitask RL and Algorithmic Transfer Learning
Work on NNbased Hierarchical RL (HRL) without predictive world models has been published since the early 1990s. In particular, gradientbased subgoal discovery with RNNs decomposes RL tasks into subtasks for submodules [225]. Numerous alternative HRL techniques have been proposed [204, 206, 117, 279, 295, 171, 195, 50, 162, 51, 15, 215, 11, 260]. While HRL frameworks such as Feudal RL [48] and options [275, 16, 261] do not directly address the problem of automatic subgoal discovery, HQLearning [309] automatically decomposes problems in partially observable environments into sequences of simpler subtasks that can be solved by memoryless policies learnable by reactive subagents. Related methods include incremental NN evolution [70], hierarchical evolution of NNs [306, 285], and hierarchical Policy Gradient algorithms [63]. Recent HRL organizes potentially deep NNbased RL submodules into selforganizing, 2dimensional motor control maps [203] inspired by neurophysiological findings [81]. The methods above, however, assign credit in hierarchical fashion by limited fixed schemes that are not themselves improved or adapted in problemspecific ways. The next sections will describe novel systems that overcome such drawbacks of abovementioned methods.
General methods for incremental multitask RL and algorithmic transfer learning that are not NNspecific include the evolutionary ADATE system [182], the SuccessStory Algorithm for SelfModifying Policies running on generalpurpose computers [233, 252, 251], and the Optimal Ordered Problem Solver [238], which learns algorithmic solutions to new problems by inspecting and exploiting (in arbitrary computable fashion) solutions to old problems, in a way that is asymptotically timeoptimal. And PowerPlay [243, 267] incrementally learns to become a more and more general algorithmic problem solver, by continually searching the space of possible pairs of new tasks and modifications of the current solver, until it finds a more powerful solver that, unlike the unmodified solver, solves all previously learned tasks plus the new one, or at least simplifies/compresses/speeds up previous solutions, without forgetting any.
2 Algorithmic Information Theory (AIT) for RNNbased AIs
Our early RNNbased systems (1990) mentioned in Sec. 1.3.1
learn a predictive model of their initially unknown environment. Real brains seem to do so too, but are still far superior to present artificial systems in many ways. They seem to exploit the model in smarter ways, e.g., to plan action sequences in hierarchical fashion, or through other types of abstract reasoning, continually building on earlier acquired skills, becoming increasingly general problem solvers able to deal with a large number of diverse and complex tasks. Here we describe RNNbased Artificial Intelligences (RNNAIs) designed to do the same by “learning to think.”
^{3}^{3}3The terminology is partially inspired by our RNNAISSANCE workshop at NIPS 2003 [246].While FNNs are traditionally linked [23] to concepts of statistical mechanics and information theory [24, 257, 136], the programs of general computers such as RNNs call for the framework of Algorithmic Information Theory (AIT) [263, 130, 33, 145, 264, 147] (own AIT work: [234, 235, 236, 237, 238]). Given some universal programming language [67, 35, 282, 194] for a universal computer, the algorithmic information content or Kolmogorov complexity of some computable object is the length of the shortest program that computes it. Since any program for one computer can be translated into a functionally equivalent program for a different computer by a compiler program of constant size, the Kolmogorov complexity of most objects hardly depends on the particular computer used. Most computable objects of a given size, however, are hardly compressible, since there are only relatively few programs that are much shorter. Similar observations hold for practical variants of Kolmogorov complexity that explicitly take into account program runtime [146, 6, 291, 147, 235, 237]. Our RNNAIs are inspired by the following argument.
2.1 Basic AIT Argument
According to AIT, given some universal computer, , whose programs are encoded as bit strings, the mutual information between two programs and is expressed as , the length of the shortest program that computes , given , ignoring an additive constant of depending on (in practical applications the computation will be timebounded [147]). That is, if is a solution to problem , and is a fast (say, linear time) solution to problem , and if is small, and is both fast and much shorter than , then asymptotically optimal universal search [146, 238] for a solution to , given , will generally find first (to compute and solve ), and thus solve much faster than search for from scratch [238].
2.2 One RNNLike System Actively Learns to Exploit Algorithmic Information of Another
The AIT argument 2.1 above has broad applicability. Let both and be RNNs or similar general parallelsequential computers [229, 47, 175, 232, 231, 103, 80, 303]. ’s vector of learnable realvalued parameters is trained by any SL or UL or RL algorithm to perform a certain welldefined task in some environment. Then is frozen. Now the goal is to train ’s parameters by some learning algorithm to perform another welldefined task whose solution may share mutual algorithmic information with the solution to ’s task. To facilitate this, we simply allow to learn to actively inspect and reuse (in essentially arbitrary computable fashion) the algorithmic information conveyed by and .
Let us consider a trial during which makes an attempt to solve its given task within a series of discrete time steps . ’s learning algorithm may use the experience gathered during the trial to modify in order to improve ’s performance in later trials. During the trial, we give an opportunity to explore and exploit or ignore by interacting with it. In what follows, , , , , , , , denote vectors of real values; denote computable [67, 35, 282, 194] functions.
At any time , and denote ’s and ’s current states, respectively. They may represent current neural activations or fast weights [229, 232, 231] or other dynamic variables that may change during information processing. is the current input from the environment (including reward signals if any); a part of encodes the current output to the environment, another a memory of previous events (if any). Parts of and intersect in the sense that both and also encode ’s current to , and ’s current to (in response to previous queries), thus representing an interface between and .
and are initialized by default values. For ,
with learnable parameters ; is a computable function of and may influence , and with fixed parameters . So both and are computable functions of previous events including queries and answers transmitted through the learnable .
According to the AIT argument, provided that conveys substantial algorithmic information about ’s task, and the trainable interface between and allows to address and extract and exploit this information quickly, and is small compared to the fixed , the search space of ’s learning algorithm (trying to find a good through a series of trials) should be much smaller than the one of a similar competing system that has no opportunity to query but has to learn the task from scratch.
For example, suppose that has learned to represent (e.g., through predictive coding [228, 248]) videos of people placing toys in boxes, or to summarize such videos through textual outputs. Now suppose ’s task is to learn to control a robot that places toys in boxes. Although the robot’s actuators may be quite different from human arms and hands, and although videos and videodescribing texts are quite different from desirable trajectories of robot movements, is expected to convey algorithmic information about ’s task, perhaps in form of connected highlevel spatiotemporal feature detectors representing typical movements of hands and elbows independent of arm size. Learning a that addresses and extracts this information from and partially reuses it to solve the robot’s task may be much faster than learning to solve the task from scratch without access to .
2.3 Consequences of the AIT Argument for ModelBuilding Controllers
The simple AIT insight above suggests that in many partially observable environments it should be possible to greatly speed up the program search of an RL RNN, , by letting it learn to access, query, and exploit in arbitrary computable ways the program of a typically much bigger gradientbased UL RNN, , used to model and compress the RL agent’s entire growing interaction history of all failed and successful trials.
3 The RNNAI and its Holy Data
In what follows, let denote positive integer constants, and positive integer variables assuming ranges implicit in the given contexts. The th component of any realvalued vector, , is denoted by . Let the RNNAI’s life span a discrete sequence of time steps, .
At the beginning of a given time step, , there is a “normal” sensory input vector, , and a reward input vector, . For example, parts of may represent the pixel intensities of an incoming video frame, while components of may reflect external positive rewards, or negative values produced by pain sensors whenever they measure excessive temperature or pressure. Let denote the concatenation of the vectors and . The total reward at time is . The total cumulative reward up to time is . During time step , the RNNAI produces an output action vector, , which may influence the environment and thus future for . At any given time, the RNNAI’s goal is to maximize .
Let denote the concatenation of and . Let denote the sequence up to time .
To be able to retrain its components on all observations ever made, the RNNAI stores its entire, growing, lifelong sensorymotor interaction history including all inputs and actions and reward signals observed during all successful and failed trials [239, 240], including what initially looks like noise but later may turn out to be regular. This is normally not done, but is feasible today.
That is, all data is “holy”, and never discarded, in line with what mathematically optimal general problem solvers should do
[109, 237]. Remarkably, even human brains may have enough storage capacity to store 100 years of sensory input at a reasonable resolution [240].3.1 Standard Activation Spreading in Typical RNNs
Many RNNlike models can be used to build general computers, e.g., neural pushdown automata [47, 175], NNs with quickly modifiable, differentiable external memory based on fast weights [229], or closely related RNNbased metalearners [232, 231, 103, 219]. Using sloppy but convenient terminology, we refer to all of them as RNNs. A typical implementation of uses an LSTM network (see Sec. 1.2). If there are large 2dimensional inputs such as video images, then they can be first filtered through a CNN (compare Sec. 1.2 and 4.3) before fed into the LSTM. Such a CNNLSTM combination is still an RNN.
Here we briefly summarize information processing in standard RNNs. Using notation similar to the one of a previous survey [245, Sec. 2], let denote positive integer variables assuming ranges implicit in the given contexts. Let also denote positive integers.
At any given moment, an RNN (such as the
of Sec. 4) can be described as a connected graph with units (or nodes or neurons) in a set and a set of directed edges or connections between nodes. The input layer is the set of input units, a subset of . In fully connected RNNs, all units have connections to all noninput units.The RNN’s behavior or program is determined by realvalued, possibly modifiable, parameters or weights, . During an episode of information processing (e.g., during a trial of Sec. 3.2), there is a partially causal sequence of real values called events. Here the index is used in a way that is much more finegrained than the one of the index in Sec. 3, 4, 5: a single time step may involve numerous events. Each is either an input set by the environment, or the activation of a unit that may directly depend on other through a current NN topologydependent set, , of indices representing incoming causal connections or links. Let the function encode topology information, and map such event index pairs, , to weight indices. For example, in the noninput case we may have with realvalued (additive case) or (multiplicative case), where is a typically nonlinear realvalued activation function such as . Other functions combine additions and multiplications [113, 112]; many other activation functions are possible. The sequence, , may directly affect certain through outgoing connections or links represented through a current set, , of indices with . Some of the noninput events are called output events.
Many of the may refer to different, timevarying activations of the same unit, e.g., in RNNs. During the episode, the same weight may get reused over and over again in topologydependent ways. Such weight sharing across space and/or time may greatly reduce the NN’s descriptive complexity, which is the number of bits of information required to describe the NN (Sec. 4). Training algorithms for the RNNs of our RNNAIs will be discussed later.
3.2 Alternating Training Phases for Controller and World Model
Several novel implementations of are described in Sec. 5. All of them make use of a variable size RNN called the world model, , which learns to compactly encode the growing history, for example, through predictive coding, trying to predict (the expected value of) each input component, given the history of actions and observations. ’s goal is to discover algorithmic regularities in the data so far by learning a program that compresses the data better in a lossless manner. Example details will be specified in Sec. 4.
Both and have realvalued parameters or weights that can be modified to improve performance. To avoid instabilities, and are trained in alternating fashion, as in Algorithm 1.
4 The GradientBased World Model
A central objective of unsupervised learning is to compress the observed data [14, 228]. ’s goal is to compress the RL agent’s entire growing interaction history of all failed and successful trials [239, 241], e.g., through predictive coding [228, 248]. has input units to receive at time , and output units to produce a prediction of [223, 226, 222, 227].
4.1 ’s Compression Performance on the History so far
Let us address details of training in a “sleep phase” of step 4 in algorithm 1. (The training of will be discussed in Sec. 5.) Consider some with given (typically suboptimal) weights and a default initialization of all unit activations. One example way of making compress the history (but not the only one) is the following. Given , we can train by replaying [149] in semioffline training, sequentially feeding into ’s input units in standard RNN fashion (Sec. 1.2, 3.1). Given (), calculates , a prediction of
. A standard error function to be minimized by gradient descent in
’s weights (Sec. 1.2) would be , the sum of the deviations of the predictions from the observations so far.However, ’s goal is not only to minimize the total prediction error, . Instead, to avoid the erroneous “discovery” of “regular patterns” in irregular noise, we use AIT’s sound way of dealing with overfitting [263, 130, 289, 207, 147, 84], and measure ’s compression performance by the number of bits required to specify , plus the bits needed to encode the observed deviations from ’s predictions [239, 241]. For example, whenever incorrectly predicts certain input pixels of a perceived video frame, those pixel values will have to be encoded separately, which will cost storage space. (In typical applications, can only execute a fixed number of elementary computations per time step to compress and decompress data, which usually has to be done online. That is, in general will not reflect the data’s true Kolmogorov complexity [263, 130], but at best a timebounded variant thereof [147].)
Let integer variables, and
, denote estimates of the number of bits required to encode (by a fixed algorithmic scheme) the current
, and the deviations of ’s predictions from the observations on the current history, respectively. For example, to obtain, we may naively assume some simple, bellshaped, zerocentered probability distribution
on the finite number of possible realvalued prediction errors (in practical applications the errors will be given with limited precision), and encode each by bits [108, 257]. That is, large errors are considered unlikely and cost more bits than small ones. To obtain , we may naively multiply the current number of ’s nonzero modifiable weights by a small integer constant reflecting the weight precision. Alternatively, we may assume some simple, bellshaped, zerocentered probability distribution, , on the finite number of possible weight values (given with limited precision), and encode each by bits. That is, large absolute weight values are considered unlikely and cost more bits than small ones [91, 294, 135, 97]. Both alternatives ignore the possibility that ’s entire weight matrix might be computable by a short computer program [235, 132], but have the advantage of being easy to calculate. Moreover, since is a general computer itself, at least in principle it has a chance of learning equivalents of such short programs.4.2 ’s Training
To decrease , we add a regularizing term to , to punish excessive complexity [4, 5, 91, 294, 155, 135, 97, 170, 169, 104, 290, 7, 290, 87, 286, 319, 100, 102].
Step 1 of algorithm 1 starts with a small . As the history grows, to find an with small , step 4 uses sequential network construction: it regularly changes ’s size by adding or pruning units and connections [111, 112, 8, 168, 59, 107, 304, 176, 141, 92, 143, 204, 53, 296, 106, 31, 57, 185, 283]. Whenever this helps (after additional training with BPTT of —see Sec. 1.2) to improve on the history so far, the changes are kept, otherwise they are discarded. (Note that even animal brains grow and prune neurons.)
Given history , instead of retraining in a sleep phase (step 4 of algorithm 1) on all of , we may retrain it on parts thereof, by selecting trials randomly or otherwise from , and replay them to retrain in standard fashion (Sec. 1.2). To do this, however, all of ’s unit activations need to be stored at the beginning of each trial. (’s hidden unit activations, however, do not have to be stored if they are reset to zero at the beginning of each trial.)
4.3 may have a BuiltIn FNN Preprocessor
To facilitate ’s task in certain environments, each frame of the sensory input stream (video, etc.) can first be separately compressed through autoencoders [211] or autoencoder hierarchies [13, 21] based on CNNs or other FNNs (see Sec. 1.2) [42] used as sensory preprocessors to create less redundant sensory codes [118, 138, 142, 46]. The compressed codes are then fed into an RNN trained to predict not the raw inputs, but their compressed codes. Those predictions have to be decompressed again by the FNN, to evaluate the total compression performance, , of the FNNRNN combination representing .
5 The Controller Learning to Exploit RNN World Model
Here we describe ways of using the world model, , of Sec. 4 to facilitate the task of the RL controller, . Especially the systems of Sec. 5.3 overcome drawbacks of early systems mentioned in Sec. 1.3.1, 1.3.2. Some of the setups of the present Sec. 5 can be viewed as special cases of the general scheme in Sec. 2.2.
5.1 as a Standard RL Machine whose States are ’s Activations
We start with details of an approach whose principles date back to the early 1990s [227, 150] (Sec. 1.3.2). Given an RNN or RNNlike as in Sec. 4, we implement as a traditional RL machine [272][245, Sec. 6.2] based on the Markov assumption (Sec. 1.3). While is processing the history of actions and observations to predict future inputs, the internal states of are used as inputs to a predictor of cumulative expected future reward.
More specifically, in step 3 of algorithm 1, consider a trial lasting from time to . is used as a preprocessor for as follows. At the beginning of a given time step, , of the trial , let denote the vector of ’s current hidden unit activations (those units that are neither input nor output units). Let denote the concatenation of , and . (In cases where ’s activations are reset after each trial, and are initialized by default values, e.g., zero vectors.)
is an RL machine with dimensional inputs and dimensional outputs. At time , is fed into , which then computes action . Then computes from , and the values and . Then is executed in the environment, to obtain the next input .
The parameters or weights of are trained to maximize reward by a standard RL method such as Qlearning or similar methods [17, 292, 293, 172, 254, 212, 262, 10, 122, 188, 157, 281, 26, 216, 197, 272, 311, 9, 163, 174, 22, 27, 2, 137, 276, 156, 284]. Note that most of these methods evaluate not only input events but pairs of input and output (action) events.
In one of the simplest cases,
is just a linear perceptron FNN (instead of an RNN like in the early system
[227]). The fact that has no builtin memory in this case is not a fundamental restriction since is recurrent, and has been trained to predict not only normal sensory inputs, but also reward signals. That is, the state of must contain all the historic information relevant to maximize future expected reward, provided the data history so far already contains the relevant experience, and has learned to compactly extract and represent its regular aspects.This approach is different from other, previous combinations of traditional RL [272][245, Sec. 6.2] and RNNs [227, 148, 12] which use RNNs only as value function approximators that directly predict cumulative expected reward, instead of trying to predict all sensations time step by time step. The system in the present section separates the hard task of prediction in partially observable environments from the comparatively simple task of RL under the Markovian assumption that the current input to (which is ’s state) contains all information relevant for achieving the goal.
5.2 as an Evolutionary RL (R)NN whose Inputs are ’s Activations
This approach is essentially the same as the one of Sec. 5.1, except that is now an FNN or RNN trained by evolutionary algorithms [200, 255, 105, 56, 68] applied to NNs [165, 321, 180, 259, 72, 90, 89, 110, 94], or by policy gradient methods [314, 315, 316, 274, 18, 1, 63, 128, 313, 210, 192, 191, 256, 85, 312, 190, 82, 93][245, Sec. 6.6], or by Compressed NN Search; see Sec. 1. has input units and output units. At time , is fed into , which computes ; then computes and ; then is executed to obtain .
5.3 Learns to Think with : HighLevel Plans and Abstractions
Our RNNbased systems of the early 1990s [223, 226](Sec. 1.3.1) could in principle plan ahead by performing numerous fast mental experiments on a predictive RNN world model, , instead of timeconsuming real experiments, extending earlier work on reactive systems without memory [301, 273]. However, this can work well only in (near)deterministic environments, and, even there, would have to simulate many entire alternative futures, time step by time step, to find an action sequence for that maximizes reward. This method seems very different from the much smarter hierarchical planning methods of humans, who apparently can learn to identify and exploit a few relevant problemspecific abstractions of possible future events; reasoning abstractly, and efficiently ignoring irrelevant spatiotemporal details.
We now describe a system that can in principle learn to plan and reason like this as well, according to the AIT argument (Sec. 2.1). This should be viewed as a main contribution of the present paper. See Figure 1.
Consider an RNN (with typically rather small feasible search space) as in Sec. 5.2. We add standard and/or multiplicative learnable connections (Sec. 3.1) from some of the units of to some of the units of the typically huge unsupervised , and from some of the units of to some of the units of . The new connections are said to belong to . and now collectively form a new RNN called , with standard activation spreading as in Sec. 3.1. The activations of are initialized to default values at the beginning of each trial. Now is trained on RL tasks in line with step 3 of algorithm 1, using search methods such as those of Sec. 5.2 (compare Sec. 1). The (typically many) connections of , however, do not change—only the (typically relatively few) connections of do.
What does that mean? It means that now ’s relatively small candidate programs are given time to “think” by feeding sequences of activations into , and reading activations out of , before and while interacting with the environment. Since and are general computers, ’s programs may query, edit or invoke subprograms of in arbitrary, computable ways through the new connections. Given some RL problem, according to the AIT argument (Sec. 2.1), this can greatly accelerate ’s search for a problemsolving weight vector , provided the (timebounded [147]) mutual algorithmic information between and ’s program is high, as is to be expected in many cases since ’s environmentmodeling program should reflect many regularities useful not only for prediction and coding, but also for decision making.^{4}^{4}4 An alternative way of letting learn to access the program of is to add owned connections from the weights of to units of , treating the current weights of as additional realvalued inputs to . This, however, will typically result in a much larger search space for . There are many other variants of the general scheme described in Sec. 2.2.
This simple but novel approach is much more general than previous computable, but restricted, ways of letting a feedforward use a model (Sec. 1.3.1)[301, 273][245, Sec. 6.1], by simulating entire possible futures step by step, then propagating error signals or temporal difference errors backwards (see Section 1.3.1). Instead, we give ’s program search an opportunity to discover sophisticated computable ways of exploiting ’s code, such as abstract hierarchical planning and analogybased reasoning. For example, to represent previous observations, an implemented as an LSTM network (Sec. 1.2) will develop highlevel, abstract, spatiotemporal feature detectors that may be active for thousands of time steps, as long as those memories are useful to predict (and thus compress) future observations [62, 61, 189, 79]. However, may learn to directly invoke the corresponding “abstract” units in by inserting appropriate pattern sequences into . might then shortcut from there to typical subsequent abstract representations, ignoring the long input sequences normally required to invoke them in , thus quickly anticipating a few possible positive outcomes to be pursued (plus computable ways of achieving them), or negative outcomes to be avoided.
Note that (and by extension ) does not at all have to be a perfect predictor. For example, it won’t be able to predict noise. Instead will have learned to approximate conditional expectations of future inputs, given the history so far. A naive way of exploiting ’s probabilistic knowledge would be to plan ahead through naive stepbystep MonteCarlo simulations of possible predicted futures, to find and execute action sequences that maximize expected reward predicted by those simulations. However, we won’t limit the system to this naive approach. Instead it will be the task of to learn to address useful problemspecific parts of the current , and reuse them for problem solving. Sure, will have to intelligently exploit , which will cost bits of information (and thus search time for appropriate weight changes of ), but this is often still much cheaper in the AIT sense than learning a good program from scratch, as in our previous nonRNN AITbased work on algorithmic transfer learning [238], where selfinvented recursive code for previous solutions sped up the search for code for more complex tasks by a factor of 1000.
Numerous topologies are possible for the adaptive connections from to , and back. Although in some applications may find it hard to exploit , and might prefer to ignore (by setting connections to and from to zero), in some environments under certain topologies, can greatly profit from .
While ’s weights are frozen in step 3 of algorithm 1, the weights of can learn when to make attend to history information represented by ’s state, and when to ignore such information, and instead use ’s innards in other computable ways. This can be further facilitated by introducing a special unit, , to , where instead of is fed into at time , such that can easily (by setting ) force to completely ignore environmental inputs, to use for “thinking” in other ways.
5.4 Incremental / Hierarchical / Multitask Learning of with
A variant of the approach in Sec. 5.3 incrementally trains on a neverending series of tasks, continually building on solutions to previous problems, instead of learning each new problem from scratch. In principle, this can be done through incremental NN evolution [70], hierarchical NN evolution [306, 285], hierarchical Policy Gradient algorithms [63], or asymptotically optimal ways of algorithmic transfer learning [238].
Given a new task and a trained on several previous tasks, such hierarchical/incremental methods may freeze the current weights of , then enlarge by adding new units and connections which are trained on the new task. This process reduces the size of the search space for the new task, giving the new weights the opportunity to learn to use the frozen parts of as subprograms.
Incremental variants of Compressed RNN Search [132] (Sec. 1) do not directly search in ’s potentially large weight space, but in the frequency domain by representing the weight matrix as a small set of Fouriertype coefficients. By searching for new coefficients to be added to already learned set responsible for solving previous problems, ’s weight matrix is fine tuned incrementally and indirectly (through superpositions). Given a current problem, in AITbased OOPS style [238], we may impose growing run time limits on programs tested on , until a solution is found.
6 Exploration: Rewarding for Experiments that Improve
Humans, even as infants, invent their own tasks in a curious and creative fashion, continually increasing their problem solving repertoire even without an external reward or teacher. They seem to get intrinsic reward for creating experiments leading to observations that obey a previously unknown law that allows for better compression of the observations—corresponding to the discovery of a temporarily interesting, subjectively novel regularity [224, 239, 241] (compare also [261, 184]).
For example, a video of 100 falling apples can be greatly compressed via predictive coding once the law of gravity is discovered. Likewise, the videolike image sequence perceived while moving through an office can be greatly compressed by constructing an internal 3D model of the office space [243]. The 3D model allows for recomputing the entire highresolution video from a compact sequence of very lowdimensional eye coordinates and eye directions. The model itself can be specified by far fewer bits of information than needed to store the raw pixel data of a long video. Even if the 3D model is not precise, only relatively few extra bits will be required to encode the observed deviations from the predictions of the model.
Even mirror neurons [129] are easily explained as byproducts of history compression as in Sec. 3 and 4. They fire both when an animal acts and when the animal observes the same action performed by another. Due to mutual algorithmic information shared by perceptions of similar actions performed by various animals, efficient RNNbased predictive coding (Sec. 3, 4) profits from using the same feature detectors (neurons) to encode the shared information, thus saving storage space.
Given the  combinations of Sec. 5, we motivate to become an efficient explorer and an artificial scientist, by adding to its standard external reward (or fitness) for solving usergiven tasks another intrinsic reward for generating novel action sequences ( experiments) that allow to improve its compression performance on the resulting data [239, 241].
At first glance, repeatedly evaluating
’s compression performance on the entire history seems impractical. A heuristic to overcome this is to focus on
’s improvements on the most recent trial, while regularly retraining on randomly selected previous trials, to avoid catastrophic forgetting.A related problem is that ’s incremental program search may find it difficult to identify (and assign credit to) those parts of responsible for improvements of a huge, black boxlike, monolithic . But we can implement as a selfmodularizing, computation costminimizing, winnertakeall RNN [221, 242, 267]. Then it is possible to keep track of which parts of are used to encode which parts of the history. That is, to evaluate weight changes of , only the affected parts of the stored history have to be retested [243]. Then ’s search can be facilitated by tracking which parts of affected those parts of . By penalizing ’s programs for the time consumed by such tests, the search for is biased to prefer programs that conduct experiments causing data yielding quickly verifiable compression progress of . That is, the program search will prefer to change weights of that are not used to compress large parts of the history that are expensive to verify [242, 243]. The first implementations of this simple principle were described in our work on the PowerPlay framework [243, 267], which incrementally searches the space of possible pairs of new tasks and modifications of the current program, until it finds a more powerful program that, unlike the unmodified program, solves all previously learned tasks plus the new one, or simplifies/compresses/speeds up previous solutions, without forgetting any. Under certain conditions this can accelerate the acquisition of external reward specified by userdefined tasks.
7 Conclusion
We introduced novel combinations of a reinforcement learning (RL) controller, , and an RNNbased predictive world model, . The most general systems implement principles of algorithmic [263, 130, 147] as opposed to traditional [24, 257] information theory. Here both and are RNNs or RNNlike systems. is actively exploited in arbitrary computable ways by , whose program search space is typically much smaller, and which may learn to selectively probe and reuse ’s internal programs to plan and reason. The basic principles are not limited to RL, but apply to all kinds of active algorithmic transfer learning from one RNN to another. By combining gradientbased RNNs and RL RNNs, we create a qualitatively new type of selfimproving, general purpose, connectionist control architecture. This RNNAI may continually build upon previously acquired problem solving procedures, some of them selfinvented in a way that resembles a scientist’s search for novel data with unknown regularities, preferring stillunsolved but quickly learnable tasks over others.
References
 [1] D. Aberdeen. PolicyGradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University, 2003.
 [2] J. Abounadi, D. Bertsekas, and V. S. Borkar. Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3):681–698, 2002.
 [3] I. Aizenberg, N. N. Aizenberg, and J. Vandewalle. MultiValued and Universal Binary Neurons: Theory, Learning and Applications. Springer Science & Business Media, 2000.
 [4] H. Akaike. Statistical predictor identification. Ann. Inst. Statist. Math., 22:203–217, 1970.
 [5] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716–723, 1974.
 [6] A. Allender. Application of timebounded Kolmogorov complexity in complexity theory. In O. Watanabe, editor, Kolmogorov complexity and computational complexity, pages 6–22. EATCS Monographs on Theoretical Computer Science, Springer, 1992.
 [7] S. Amari and N. Murata. Statistical theory of learning curves under entropic loss criterion. Neural Computation, 5(1):140–153, 1993.
 [8] T. Ash. Dynamic node creation in backpropagation neural networks. Connection Science, 1(4):365–375, 1989.
 [9] L. Baird and A. W. Moore. Gradient descent for general reinforcement learning. In Advances in neural information processing systems 12 (NIPS), pages 968–974. MIT Press, 1999.
 [10] L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, pages 30–37, 1995.
 [11] B. Bakker and J. Schmidhuber. Hierarchical reinforcement learning based on subgoal discovery and subpolicy specialization. In F. G. et al., editor, Proc. 8th Conference on Intelligent Autonomous Systems IAS8, pages 438–445, Amsterdam, NL, 2004. IOS Press.
 [12] B. Bakker, V. Zhumatiy, G. Gruener, and J. Schmidhuber. A robot that reinforcementlearns to identify and memorize important previous observations. In Proceedings of the 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2003, pages 430–435, 2003.
 [13] D. H. Ballard. Modular learning in neural networks. In Proc. AAAI, pages 279–284, 1987.
 [14] H. B. Barlow, T. P. Kaushal, and G. J. Mitchison. Finding minimum entropy codes. Neural Computation, 1(3):412–423, 1989.
 [15] A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.
 [16] A. G. Barto, S. Singh, and N. Chentanez. Intrinsically motivated learning of hierarchical collections of skills. In Proceedings of International Conference on Developmental Learning (ICDL), pages 112–119. MIT Press, Cambridge, MA, 2004.
 [17] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC13:834–846, 1983.
 [18] J. Baxter and P. L. Bartlett. Infinitehorizon policygradient estimation. J. Artif. Int. Res., 15(1):319–350, 2001.
 [19] S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of Lecture Notes in Computer Science. Springer, 2003.
 [20] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1st edition, 1957.
 [21] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828, 2013.
 [22] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2001.
 [23] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 [24] L. Boltzmann. In F. Hasenöhrl, editor, Wissenschaftliche Abhandlungen (collection of Boltzmann’s articles in scientific journals). Barth, Leipzig, 1909.
 [25] C. Boutilier and D. Poole. Computing optimal policies for partially observable Markov decision processes using compact representations. In Proceedings of the AAAI, Portland, OR, 1996.
 [26] S. J. Bradtke, A. G. Barto, and L. P. Kaelbling. Linear leastsquares algorithms for temporal difference learning. In Machine Learning, pages 22–33, 1996.
 [27] R. I. Brafman and M. Tennenholtz. RMAX—a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2002.
 [28] L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996.
 [29] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
 [30] A. E. Bryson. A gradient method for optimizing multistage allocation processes. In Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
 [31] N. Burgess. A constructive algorithm that converges for realvalued input patterns. International Journal of Neural Systems, 5(1):59–66, 1994.
 [32] R. Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
 [33] G. J. Chaitin. On the length of programs for computing finite binary sequences. Journal of the ACM, 13:547–569, 1966.

[34]
K. Chellapilla, S. Puri, and P. Simard.
High performance convolutional neural networks for document processing.
In International Workshop on Frontiers in Handwriting Recognition, 2006.  [35] A. Church. An unsolvable problem of elementary number theory. American Journal of Mathematics, 58:345–363, 1936.
 [36] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in Neural Information Processing Systems (NIPS), pages 2852–2860, 2012.
 [37] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Mitosis detection in breast cancer histology images with deep neural networks. In Proc. MICCAI, volume 2, pages 411–418, 2013.
 [38] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep big simple neural nets for handwritten digit recogntion. Neural Computation, 22(12):3207–3220, 2010.
 [39] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Convolutional neural network committees for handwritten character classification. In 11th International Conference on Document Analysis and Recognition (ICDAR), pages 1250–1254, 2011.
 [40] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In Intl. Joint Conference on Artificial Intelligence IJCAI, pages 1237–1242, 2011.
 [41] D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. A committee of neural networks for traffic sign classification. In International Joint Conference on Neural Networks (IJCNN), pages 1918–1921, 2011.
 [42] D. C. Ciresan, U. Meier, and J. Schmidhuber. Multicolumn deep neural networks for image classification. In IEEE Conference on Computer Vision and Pattern Recognition CVPR 2012, 2012. Long preprint arXiv:1202.2745v1 [cs.CV].
 [43] D. C. Ciresan, U. Meier, and J. Schmidhuber. Transfer learning for Latin and Chinese characters with deep neural networks. In International Joint Conference on Neural Networks (IJCNN), pages 1301–1306, 2012.

[44]
D. T. Cliff, P. Husbands, and I. Harvey.
Evolving recurrent dynamical networks for robot control.
In
Artificial Neural Nets and Genetic Algorithms
, pages 428–435. Springer, 1993.  [45] A. Cochocki and R. Unbehauen. Neural networks for optimization and signal processing. John Wiley & Sons, Inc., 1993.
 [46] G. Cuccu, M. Luciw, J. Schmidhuber, and F. Gomez. Intrinsically motivated evolutionary search for visionbased reinforcement learning. In Proceedings of the 2011 IEEE Conference on Development and Learning and Epigenetic Robotics IEEEICDLEPIROB, volume 2, pages 1–7. IEEE, 2011.
 [47] S. Das, C. Giles, and G. Sun. Learning contextfree grammars: Capabilities and limitations of a neural network with an external stack memory. In Proceedings of the The Fourteenth Annual Conference of the Cognitive Science Society, Bloomington, 1992.
 [48] P. Dayan and G. Hinton. Feudal reinforcement learning. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems (NIPS) 5, pages 271–278. Morgan Kaufmann, 1993.
 [49] R. Dechter. Learning while searching in constraintsatisfaction problems. In Proceedings of AAAI86, 1986.
 [50] T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res. (JAIR), 13:227–303, 2000.
 [51] K. Doya, K. Samejima, K. ichi Katagiri, and M. Kawato. Multiple modelbased reinforcement learning. Neural Computation, 14(6):1347–1369, 2002.
 [52] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1):30–45, 1962.
 [53] S. E. Fahlman. The recurrent cascadecorrelation learning algorithm. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems (NIPS) 3, pages 190–196. Morgan Kaufmann, 1991.
 [54] Y. Fan, Y. Qian, F. Xie, and F. K. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech, 2014.
 [55] R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory. Prosody contour prediction with Long ShortTerm Memory, bidirectional, deep recurrent neural networks. In Proc. Interspeech, 2014.
 [56] L. Fogel, A. Owens, and M. Walsh. Artificial Intelligence through Simulated Evolution. Wiley, New York, 1966.
 [57] B. Fritzke. A growing neural gas network learns topologies. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, NIPS, pages 625–632. MIT Press, 1994.
 [58] K. Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in position  Neocognitron. Trans. IECE, J62A(10):658–665, 1979.
 [59] S. I. Gallant. Connectionist expert systems. Communications of the ACM, 31(2):152–169, 1988.
 [60] S. Ge, C. C. Hang, T. H. Lee, and T. Zhang. Stable adaptive neural network control. Springer, 2010.
 [61] F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
 [62] F. A. Gers, N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3:115–143, 2002.
 [63] M. Ghavamzadeh and S. Mahadevan. Hierarchical policy gradient algorithms. In Proceedings of the Twentieth Conference on Machine Learning (ICML2003), pages 226–233, 2003.
 [64] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Technical Report arxiv.org/abs/1311.2524, UC Berkeley and ICSI, 2013.
 [65] L. Gisslen, M. Luciw, V. Graziano, and J. Schmidhuber. Sequential constant size compressor for reinforcement learning. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google, Mountain View, CA, pages 31–40. Springer, 2011.

[66]
T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra, and J. Schmidhuber.
Exponential natural evolution strategies.
In
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)
, pages 393–400. ACM, 2010.  [67] K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173–198, 1931.
 [68] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. AddisonWesley, Reading, MA, 1989.
 [69] F. J. Gomez. Robust Nonlinear Control through Neuroevolution. PhD thesis, Department of Computer Sciences, University of Texas at Austin, 2003.
 [70] F. J. Gomez and R. Miikkulainen. Incremental evolution of complex general behavior. Adaptive Behavior, 5:317–342, 1997.
 [71] F. J. Gomez and R. Miikkulainen. Active guidance for a finless rocket using neuroevolution. In Proc. GECCO 2003, Chicago, 2003.

[72]
F. J. Gomez, J. Schmidhuber, and R. Miikkulainen.
Accelerated neural evolution through cooperatively coevolved synapses.
Journal of Machine Learning Research, 9(May):937–965, 2008.  [73] H. Gomi and M. Kawato. Neural network control for a closedloop system using feedbackerrorlearning. Neural Networks, 6(7):933–946, 1993.
 [74] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multidigit number recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082 v4, 2014.
 [75] I. J. Goodfellow, D. WardeFarley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In International Conference on Machine Learning (ICML), 2013.
 [76] A. Graves, S. Fernandez, F. J. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. In ICML’06: Proceedings of the 23rd International Conference on Machine Learning, pages 369–376, 2006.
 [77] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 2009.
 [78] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(56):602–610, 2005.
 [79] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in Neural Information Processing Systems (NIPS) 21, pages 545–552. MIT Press, Cambridge, MA, 2009.
 [80] A. Graves, G. Wayne, and I. Danihelka. Neural Turing machines. Preprint arXiv:1410.5401, 2014.
 [81] M. Graziano. The Intelligent Movement Machine: An Ethological Perspective on the Primate Motor System. Oxford University Press, USA, 2009.
 [82] I. Grondman, L. Busoniu, G. A. D. Lopes, and R. Babuska. A survey of actorcritic reinforcement learning: Standard and natural policy gradients. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(6):1291–1307, Nov 2012.
 [83] F. Gruau, D. Whitley, and L. Pyeatt. A comparison between cellular encoding and direct encoding for genetic neural networks. NeuroCOLT Technical Report NCTR96048, ESPRIT Working Group in Neural and Computational Learning, NeuroCOLT 8556, 1996.
 [84] P. D. Grünwald, I. J. Myung, and M. A. Pitt. Advances in minimum description length: Theory and applications. MIT Press, 2005.
 [85] M. Grüttner, F. Sehnke, T. Schaul, and J. Schmidhuber. MultiDimensional Deep Memory AtariGo Players for Parameter Exploring Policy Gradients. In Proceedings of the International Conference on Artificial Neural Networks ICANN, pages 114–123. Springer, 2010.
 [86] X. Guo, S. Singh, H. Lee, R. Lewis, and X. Wang. Deep learning for realtime Atari game play using offline MonteCarlo tree search planning. In Advances in Neural Information Processing Systems 27 (NIPS). 2014.
 [87] I. Guyon, V. Vapnik, B. Boser, L. Bottou, and S. A. Solla. Structural risk minimization for character recognition. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems (NIPS) 4, pages 471–479. Morgan Kaufmann, 1992.
 [88] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng. DeepSpeech: Scaling up endtoend speech recognition. Preprint arXiv:1412.5567, 2014.
 [89] N. Hansen, S. D. Müller, and P. Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMAES). Evolutionary Computation, 11(1):1–18, 2003.
 [90] N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001.
 [91] S. J. Hanson and L. Y. Pratt. Comparing biases for minimal network construction with backpropagation. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems (NIPS) 1, pages 177–185. San Mateo, CA: Morgan Kaufmann, 1989.
 [92] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 5, pages 164–171. Morgan Kaufmann, 1993.
 [93] N. Heess, D. Silver, and Y. W. Teh. Actorcritic reinforcement learning with energybased policies. In Proc. European Workshop on Reinforcement Learning, pages 43–57, 2012.
 [94] V. HeidrichMeisner and C. Igel. Neuroevolution strategies for episodic reinforcement learning. Journal of Algorithms, 64(4):152–168, 2009.
 [95] J. Hertz, A. Krogh, and R. Palmer. Introduction to the Theory of Neural Computation. AddisonWesley, Redwood City, 1991.
 [96] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [97] G. E. Hinton and D. van Camp. Keeping neural networks simple. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 11–18. Springer, 1993.
 [98] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991. Advisor: J. Schmidhuber.
 [99] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
 [100] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
 [101] S. Hochreiter and J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):1735–1780, 1997. Based on TR FKI20795, TUM (1995).
 [102] S. Hochreiter and J. Schmidhuber. Feature extraction through LOCOCODE. Neural Computation, 11(3):679–714, 1999.
 [103] S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artificial Neural Networks (ICANN2001), pages 87–94. Springer: Berlin, Heidelberg, 2001.
 [104] S. B. Holden. On the Theory of Generalization and SelfStructuring in Linearly Weighted Connectionist Networks. PhD thesis, Cambridge University, Engineering Department, 1994.
 [105] J. H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, 1975.
 [106] V. Honavar and L. Uhr. Generative learning structures and processes for generalized connectionist networks. Information Sciences, 70(1):75–108, 1993.
 [107] V. Honavar and L. M. Uhr. A network of neuronlike units that learns to perceive by generation as well as reweighting of its links. In D. Touretzky, G. E. Hinton, and T. Sejnowski, editors, Proc. of the 1988 Connectionist Models Summer School, pages 472–484, San Mateo, 1988. Morgan Kaufman.
 [108] D. A. Huffman. A method for construction of minimumredundancy codes. Proceedings IRE, 40:1098–1101, 1952.
 [109] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005. (On J. Schmidhuber’s SNF grant 2061847).
 [110] C. Igel. Neuroevolution for reinforcement learning using evolution strategies. In R. Reynolds, H. Abbass, K. C. Tan, B. Mckay, D. Essam, and T. Gedeon, editors, Congress on Evolutionary Computation (CEC 2003), volume 4, pages 2588–2595. IEEE, 2003.
 [111] A. G. Ivakhnenko. The group method of data handling – a rival of the method of stochastic approximation. Soviet Automatic Control, 13(3):43–55, 1968.
 [112] A. G. Ivakhnenko. Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364–378, 1971.
 [113] A. G. Ivakhnenko and V. G. Lapa. Cybernetic Predicting Devices. CCM Information Corporation, 1965.
 [114] T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems (NIPS) 7, pages 345–352. MIT Press, 1995.
 [115] C. Jacob, A. Lindenmayer, and G. Rozenberg. Genetic LSystem Programming. In Parallel Problem Solving from Nature III, Lecture Notes in Computer Science. SpringerVerlag, 1994.
 [116] H. Jaeger. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304:78–80, 2004.
 [117] J. Jameson. Delayed reinforcement learning with multiple time scale hierarchical backpropagated adaptive critics. In Neural Networks for Control. 1991.
 [118] S. R. Jodogne and J. H. Piater. Closedloop learning of visual control policies. J. Artificial Intelligence Research, 28:349–391, 2007.

[119]
M. I. Jordan.
Supervised learning and systems with excess degrees of freedom.
Technical Report COINS TR 8827, Massachusetts Institute of Technology, 1988.  [120] M. I. Jordan and D. E. Rumelhart. Supervised learning with a distal teacher. Technical Report Occasional Paper #40, Center for Cog. Sci., Massachusetts Institute of Technology, 1990.

[121]
C.F. Juang.
A hybrid of genetic algorithm and particle swarm optimization for recurrent network design.
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 34(2):997–1006, 2004.  [122] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Technical report, Brown University, Providence RI, 1995.
 [123] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of AI research, 4:237–285, 1996.
 [124] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei. Largescale video classification with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 [125] H. J. Kelley. Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954, 1960.
 [126] H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In ICML, volume 97, pages 152–160, 1997.
 [127] H. Kitano. Designing neural networks using genetic algorithms with graph generation system. Complex Systems, 4:461–476, 1990.
 [128] N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, volume 3, pages 2619–2624. IEEE, 2004.
 [129] E. Kohler, C. Keysers, M. A. Umilta, L. Fogassi, V. Gallese, and G. Rizzolatti. Hearing sounds, understanding actions: action representation in mirror neurons. Science, 297(5582):846–848, 2002.
 [130] A. N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1:1–11, 1965.
 [131] V. R. Kompella, M. D. Luciw, and J. Schmidhuber. Incremental slow feature analysis: Adaptive lowcomplexity slow feature updating from highdimensional input streams. Neural Computation, 24(11):2994–3024, 2012.
 [132] J. Koutník, G. Cuccu, J. Schmidhuber, and F. Gomez. Evolving largescale neural networks for visionbased reinforcement learning. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), pages 1061–1068, Amsterdam, July 2013. ACM.
 [133] J. Koutník, K. Greff, F. Gomez, and J. Schmidhuber. A Clockwork RNN. In Proceedings of the 31th International Conference on Machine Learning (ICML), volume 32, pages 1845–1853, 2014. arXiv:1402.3511 [cs.NE].
 [134] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS 2012), page 4, 2012.
 [135] A. Krogh and J. A. Hertz. A simple weight decay can improve generalization. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 4, pages 950–957. Morgan Kaufmann, 1992.
 [136] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, pages 79–86, 1951.
 [137] M. G. Lagoudakis and R. Parr. Leastsquares policy iteration. JMLR, 4:1107–1149, 12 2003.
 [138] S. Lange and M. Riedmiller. Deep autoencoder neural networks in reinforcement learning. In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–8, July 2010.
 [139] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015. Critique by JS under http://www.idsia.ch/~juergen/deeplearningconspiracy.html.
 [140] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
 [141] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 598–605. Morgan Kaufmann, 1990.
 [142] R. Legenstein, N. Wilbert, and L. Wiskott. Reinforcement learning on slow features of highdimensional input streams. PLoS Computational Biology, 6(8), 2010.
 [143] A. U. Levin, T. K. Leen, and J. E. Moody. Fast pruning using principal components. In Advances in Neural Information Processing Systems 6, page 35. Morgan Kaufmann, 1994.
 [144] A. U. Levin and K. S. Narendra. Control of nonlinear dynamical systems using neural networks. ii. observability, identification, and control. IEEE Transactions on Neural Networks, 7(1):30–42, 1995.
 [145] L. A. Levin. On the notion of a random sequence. Soviet Math. Dokl., 14(5):1413–1416, 1973.
 [146] L. A. Levin. Universal sequential search problems. Problems of Information Transmission, 9(3):265–266, 1973.
 [147] M. Li and P. M. B. Vitányi. An Introduction to Kolmogorov Complexity and its Applications (2nd edition). Springer, 1997.
 [148] L. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh, January 1993.
 [149] L.J. Lin. Programming robots using reinforcement learning and teaching. In Proceedings of the Ninth National Conference on Artificial Intelligence  Volume 2, AAAI’91, pages 781–786. AAAI Press, 1991.
 [150] L.J. Lin and T. M. Mitchell. Memory approaches to reinforcement learning in nonmarkovian domains. Technical Report CMUCS92138, School of Computer Science, Carnegie Mellon University, 1992.
 [151] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.
 [152] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially observable environments: Scaling up. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 362–370. Morgan Kaufmann Publishers, San Francisco, CA, 1995.
 [153] L. Ljung. System identification. Springer, 1998.
 [154] M. Luciw, V. R. Kompella, S. Kazerounian, and J. Schmidhuber. An intrinsic value system for developing multiple invariant representations with incremental slowness learning. Frontiers in Neurorobotics, 7(9), 2013.
 [155] D. J. C. MacKay. A practical Bayesian framework for backprop networks. Neural Computation, 4:448–472, 1992.
 [156] H. R. Maei and R. S. Sutton. GQ(): A general gradient algorithm for temporaldifference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence, volume 1, pages 91–96, 2010.
 [157] S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22:159, 1996.
 [158] J. Martens and I. Sutskever. Learning recurrent neural networks with Hessianfree optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1033–1040, 2011.
 [159] H. Mayer, F. Gomez, D. Wierstra, I. Nagy, A. Knoll, and J. Schmidhuber. A system for robotic heart surgery that learns to tie knots using recurrent neural networks. Advanced Robotics, 22(1314):1521–1537, 2008.
 [160] R. A. McCallum. Learning to use selective attention and shortterm memory in sequential tasks. In P. Maes, M. Mataric, J.A. Meyer, J. Pollack, and S. W. Wilson, editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 315–324. MIT Press, Bradford Books, 1996.
 [161] U. Meier, D. C. Ciresan, L. M. Gambardella, and J. Schmidhuber. Better digit recognition with a committee of simple neural nets. In 11th International Conference on Document Analysis and Recognition (ICDAR), pages 1135–1139, 2011.
 [162] I. Menache, S. Mannor, and N. Shimkin. Qcut – dynamic discovery of subgoals in reinforcement learning. In Proc. ECML’02, pages 295–306, 2002.
 [163] N. Meuleau, L. Peshkin, K. E. Kim, and L. P. Kaelbling. Learning finite state controllers for partially observable environments. In 15th International Conference of Uncertainty in AI, pages 427–436, 1999.
 [164] O. Miglino, H. Lund, and S. Nolfi. Evolving mobile robots in simulated and real environments. Artificial Life, 2(4):417–434, 1995.
 [165] G. Miller, P. Todd, and S. Hedge. Designing neural networks using genetic algorithms. In Proceedings of the 3rd International Conference on Genetic Algorithms, pages 379–384. Morgan Kauffman, 1989.
 [166] W. T. Miller, P. J. Werbos, and R. S. Sutton. Neural networks for control. MIT Press, 1995.
 [167] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. Based on TR arXiv:1312.5602 (2013); critique by JS under http://www.idsia.ch/~juergen/naturedeepmind.html.
 [168] J. E. Moody. Fast learning in multiresolution hierarchies. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems (NIPS) 1, pages 29–39. Morgan Kaufmann, 1989.
 [169] J. E. Moody. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems (NIPS) 4, pages 847–854. Morgan Kaufmann, 1992.
 [170] J. E. Moody and J. Utans. Architecture selection strategies for neural networks: Application to corporate bond rating prediction. In A. N. Refenes, editor, Neural Networks in the Capital Markets. John Wiley & Sons, 1994.
 [171] A. Moore and C. Atkeson. The partigame algorithm for variable resolution reinforcement learning in multidimensional statespaces. Machine Learning, 21(3):199–233, 1995.
 [172] A. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103–130, 1993.
 [173] D. E. Moriarty. Symbiotic Evolution of Neural Networks in Sequential Decision Tasks. PhD thesis, Department of Computer Sciences, The University of Texas at Austin, 1997.
 [174] J. Morimoto and K. Doya. Robust reinforcement learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems (NIPS) 13, pages 1061–1067. MIT Press, 2000.
 [175] M. C. Mozer and S. Das. A connectionist symbol manipulator that discovers the structure of contextfree languages. Advances in Neural Information Processing Systems (NIPS), pages 863–863, 1993.
 [176] M. C. Mozer and P. Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems (NIPS) 1, pages 107–115. Morgan Kaufmann, 1989.
 [177] P. W. Munro. A dual backpropagation scheme for scalar reinforcement learning. Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165–176, 1987.
 [178] K. S. Narendra and K. Parthasarathy. Identification and control of dynamical systems using neural networks. Neural Networks, IEEE Transactions on, 1(1):4–27, 1990.
 [179] N. Nguyen and B. Widrow. The truck backerupper: An example of self learning in neural networks. In Proceedings of the International Joint Conference on Neural Networks, pages 357–363. IEEE Press, 1989.
 [180] S. Nolfi, D. Floreano, O. Miglino, and F. Mondada. How to evolve autonomous robots: Different approaches in evolutionary robotics. In R. A. Brooks and P. Maes, editors, Fourth International Workshop on the Synthesis and Simulation of Living Systems (Artificial Life IV), pages 190–197. MIT, 1994.
 [181] K.S. Oh and K. Jung. GPU implementation of neural networks. Pattern Recognition, 37(6):1311–1314, 2004.
 [182] J. R. Olsson. Inductive functional programming using incremental program transformation. Artificial Intelligence, 74(1):55–83, 1995.
 [183] M. Otsuka, J. Yoshimoto, and K. Doya. Freeenergybased reinforcement learning in a partially observable environment. In Proc. ESANN, 2010.
 [184] P.Y. Oudeyer, A. Baranes, and F. Kaplan. Intrinsically motivated learning of real world sensorimotor skills with developmental constraints. In G. Baldassarre and M. Mirolli, editors, Intrinsically Motivated Learning in Natural and Artificial Systems. Springer, 2013.
 [185] R. Parekh, J. Yang, and V. Honavar. Constructive neural network learning algorithms for multicategory pattern classification. IEEE Transactions on Neural Networks, 11(2):436–451, 2000.
 [186] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML’13: JMLR: W&CP volume 28, 2013.
 [187] F. Pasemann, U. Steinmetz, and U. Dieckman. Evolving structure and function of neurocontrollers. In P. J. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, and A. Zalzala, editors, Proceedings of the Congress on Evolutionary Computation, volume 3, pages 1973–1978, Mayflower Hotel, Washington D.C., USA, 69 1999. IEEE Press.
 [188] J. Peng and R. J. Williams. Incremental multistep Qlearning. Machine Learning, 22:283–290, 1996.
 [189] J. A. PérezOrtiz, F. A. Gers, D. Eck, and J. Schmidhuber. Kalman filters improve LSTM network performance in problems unsolvable by traditional recurrent nets. Neural Networks, 16:241–250, 2003.
 [190] J. Peters. Policy gradient methods. Scholarpedia, 5(11):3698, 2010.
 [191] J. Peters and S. Schaal. Natural actorcritic. Neurocomputing, 71:1180–1190, March 2008.
 [192] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural Network, 21(4):682–697, 2008.

[193]
J. B. Pollack.
Implications of recursive distributed representations.
In Proc. NIPS, pages 527–536, 1988.  [194] E. L. Post. Finite combinatory processesformulation 1. The Journal of Symbolic Logic, 1(3):103–105, 1936.
 [195] D. Precup, R. S. Sutton, and S. Singh. Multitime models for temporally abstract planning. In Advances in Neural Information Processing Systems (NIPS), pages 1050–1056. Morgan Kaufmann, 1998.
 [196] D. Prokhorov, G. Puskorius, and L. Feldkamp. Dynamical neural networks for control. In J. Kolen and S. Kremer, editors, A field guide to dynamical recurrent networks, pages 23–78. IEEE Press, 2001.
 [197] D. Prokhorov and D. Wunsch. Adaptive critic design. IEEE Transactions on Neural Networks, 8(5):997–1007, 1997.
 [198] R. Raina, A. Madhavan, and A. Ng. Largescale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 873–880. ACM, 2009.
 [199] M. A. Ranzato, F. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Proc. Computer Vision and Pattern Recognition Conference (CVPR’07), pages 1–8. IEEE Press, 2007.
 [200] I. Rechenberg. Evolutionsstrategie  Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Dissertation, 1971. Published 1973 by FrommanHolzboog.
 [201] M. Riedmiller. Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. In Proc. ECML2005, pages 317–328. SpringerVerlag Berlin Heidelberg, 2005.
 [202] M. Riedmiller, S. Lange, and A. Voigtlaender. Autonomous reinforcement learning on raw visual input data in a real world application. In International Joint Conference on Neural Networks (IJCNN), pages 1–8, Brisbane, Australia, 2012.
 [203] M. Ring, T. Schaul, and J. Schmidhuber. The twodimensional organization of behavior. In Proceedings of the First Joint Conference on Development Learning and on Epigenetic Robotics ICDLEPIROB, Frankfurt, August 2011.
 [204] M. B. Ring. Incremental development of complex behaviors through automatic construction of sensorymotor hierarchies. In L. Birnbaum and G. Collins, editors, Machine Learning: Proceedings of the Eighth International Workshop, pages 343–347. Morgan Kaufmann, 1991.
 [205] M. B. Ring. Learning sequential tasks by incrementally adding higher orders. In J. D. C. S. J. Hanson and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 115–122. Morgan Kaufmann, 1993.
 [206] M. B. Ring. Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712, August 1994.
 [207] J. Rissanen. Stochastic complexity and modeling. The Annals of Statistics, 14(3):1080–1100, 1986.
 [208] A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/FINFENG/TR.1, Cambridge University Engineering Department, 1987.
 [209] T. Robinson and F. Fallside. Dynamic reinforcement driven error propagation networks with application to game playing. In Proceedings of the 11th Conference of the Cognitive Science Society, Ann Arbor, pages 836–843, 1989.
 [210] T. Rückstieß, M. Felder, and J. Schmidhuber. StateDependent Exploration for policy gradient methods. In W. D. et al., editor, European Conference on Machine Learning (ECML) and Principles and Practice of Knowledge Discovery in Databases 2008, Part II, LNAI 5212, pages 234–249, 2008.
 [211] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986.
 [212] G. Rummery and M. Niranjan. Online Qlearning using connectionist sytems. Technical Report CUED/FINFENGTR 166, Cambridge University, UK, 1994.
 [213] H. Sak, A. Senior, and F. Beaufays. Long ShortTerm Memory recurrent neural network architectures for large scale acoustic modeling. In Proc. Interspeech, 2014.
 [214] H. Sak, A. Senior, K. Rao, F. Beaufays, and J. Schalkwyk. Google Voice search: faster and more accurate. In Google Research Blog http://googleresearch.blogspot.ch/2015/09/googlevoicesearchfasterandmore.html, 2015.
 [215] K. Samejima, K. Doya, and M. Kawato. Intermodule credit assignment in modular reinforcement learning. Neural Networks, 16(7):985–994, 2003.
 [216] J. C. Santamaría, R. S. Sutton, and A. Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior, 6(2):163–217, 1997.
 [217] A. M. Schäfer, S. Udluft, and H.G. Zimmermann. Learning long term dependencies with recurrent neural networks. In S. D. Kollias, A. Stafylopatis, W. Duch, and E. Oja, editors, ICANN (1), volume 4131 of Lecture Notes in Computer Science, pages 71–80. Springer, 2006.
 [218] R. E. Schapire. The strength of weak learnability. Machine Learning, 5:197–227, 1990.
 [219] T. Schaul and J. Schmidhuber. Metalearning. Scholarpedia, 6(5):4650, 2010.
 [220] D. Scherer, A. Müller, and S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN), pages 92–101, 2010.
 [221] J. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4):403–412, 1989.
 [222] J. Schmidhuber. Learning algorithms for networks with internal and external feedback. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Proc. of the 1990 Connectionist Models Summer School, pages 52–61. Morgan Kaufmann, 1990.
 [223] J. Schmidhuber. An online algorithm for dynamic reinforcement learning and planning in reactive environments. In Proc. IEEE/INNS International Joint Conference on Neural Networks, San Diego, volume 2, pages 253–258, 1990.
 [224] J. Schmidhuber. Curious modelbuilding control systems. In Proceedings of the International Joint Conference on Neural Networks, Singapore, volume 2, pages 1458–1463. IEEE press, 1991.
 [225] J. Schmidhuber. Learning to generate subgoals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967–972. Elsevier Science Publishers B.V., NorthHolland, 1991.
 [226] J. Schmidhuber. A possibility for implementing curiosity and boredom in modelbuilding neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222–227. MIT Press/Bradford Books, 1991.
 [227] J. Schmidhuber. Reinforcement learning in Markovian and nonMarkovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3 (NIPS 3), pages 500–506. Morgan Kaufmann, 1991.
 [228] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992. (Based on TR FKI14891, TUM, 1991).
 [229] J. Schmidhuber. Learning to control fastweight memories: An alternative to recurrent nets. Neural Computation, 4(1):131–139, 1992.

[230]
J. Schmidhuber.
Netzwerkarchitekturen, Zielfunktionen und Kettenregel.
(Network architectures, objective functions, and chain rule.)
Habilitation Thesis, Inst. f. Inf., Tech. Univ. Munich, 1993.  [231] J. Schmidhuber. On decreasing the ratio between learning complexity and number of timevarying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460–463. Springer, 1993.
 [232] J. Schmidhuber. A selfreferential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 446–451. Springer, 1993.
 [233] J. Schmidhuber. On learning how to learn learning strategies. Technical Report FKI19894, Fakultät für Informatik, Technische Universität München, 1994. See [252, 251].
 [234] J. Schmidhuber. Discovering solutions with low Kolmogorov complexity and high generalization capability. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 488–496. Morgan Kaufmann Publishers, San Francisco, CA, 1995.
 [235] J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.
 [236] J. Schmidhuber. Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. International Journal of Foundations of Computer Science, 13(4):587–612, 2002.

[237]
J. Schmidhuber.
The Speed Prior: a new simplicity measure yielding nearoptimal
computable predictions.
In J. Kivinen and R. H. Sloan, editors,
Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002)
, Lecture Notes in Artificial Intelligence, pages 216–228. Springer, Sydney, Australia, 2002.  [238] J. Schmidhuber. Optimal ordered problem solver. Machine Learning, 54:211–254, 2004.
 [239] J. Schmidhuber. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connection Science, 18(2):173–187, 2006.
 [240] J. Schmidhuber. Simple algorithmic theory of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. SICE Journal of the Society of Instrument and Control Engineers, 48(1):21–32, 2009.
 [241] J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
 [242] J. Schmidhuber. Selfdelimiting neural networks. Technical Report IDSIA0812, arXiv:1210.0118v1 [cs.NE], The Swiss AI Lab IDSIA, 2012.
 [243] J. Schmidhuber. PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem. Frontiers in Psychology, 2013.
 [244] J. Schmidhuber. Deep Learning. Scholarpedia, 10(11):32832, 2015.
 [245] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. Published online 2014; 888 references; based on TR arXiv:1404.7828 [cs.NE].
 [246] J. Schmidhuber and B. Bakker. NIPS 2003 RNNaissance workshop on recurrent neural networks, Whistler, CA, 2003. http://www.idsia.ch/~juergen/rnnaissance.html.
 [247] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, and A. Graves. On fast deep nets for AGI vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google, Mountain View, CA, pages 243–246, 2011.
 [248] J. Schmidhuber and S. Heil. Sequential neural text compression. IEEE Transactions on Neural Networks, 7(1):142–146, 1996.
 [249] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135–141, 1991.
 [250] J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. J. Gomez. Training recurrent networks by EVOLINO. Neural Computation, 19(3):757–779, 2007.
 [251] J. Schmidhuber, J. Zhao, and N. Schraudolph. Reinforcement learning with selfmodifying policies. In S. Thrun and L. Pratt, editors, Learning to learn, pages 293–309. Kluwer, 1997.
 [252] J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with successstory algorithm, adaptive Levin search, and incremental selfimprovement. Machine Learning, 28:105–130, 1997.
 [253] B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors. Advances in Kernel Methods  Support Vector Learning. MIT Press, Cambridge, MA, 1998.
 [254] A. Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proc. ICML, pages 298–305, 1993.
 [255] H. P. Schwefel. Numerische Optimierung von ComputerModellen. Dissertation, 1974. Published 1977 by Birkhäuser, Basel.
 [256] F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, J. Peters, and J. Schmidhuber. Parameterexploring policy gradients. Neural Networks, 23(4):551–559, 2010.
 [257] C. E. Shannon. A mathematical theory of communication (parts I and II). Bell System Technical Journal, XXVII:379–423, 1948.
 [258] H. T. Siegelmann and E. D. Sontag. Turing computability with neural nets. Applied Mathematics Letters, 4(6):77–80, 1991.
 [259] K. Sims. Evolving virtual creatures. In A. Glassner, editor, Proceedings of SIGGRAPH ’94 (Orlando, Florida, July 1994), Computer Graphics Proceedings, Annual Conference, pages 15–22. ACM SIGGRAPH, ACM Press, jul 1994. ISBN 0897916670.
 [260] Ö. Simsek and A. G. Barto. Skill characterization based on betweenness. In NIPS’08, pages 1497–1504, 2008.
 [261] S. Singh, A. G. Barto, and N. Chentanez. Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems 17 (NIPS). MIT Press, Cambridge, MA, 2005.
 [262] S. P. Singh. Reinforcement learning algorithms for averagepayoff Markovian decision processes. In National Conference on Artificial Intelligence, pages 700–705, 1994.
 [263] R. J. Solomonoff. A formal theory of inductive inference. Part I. Information and Control, 7:1–22, 1964.
 [264] R. J. Solomonoff. Complexitybased induction systems. IEEE Transactions on Information Theory, IT24(5):422–432, 1978.
 [265] B. Speelpenning. Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD thesis, Department of Computer Science, University of Illinois, UrbanaChampaign, Jan. 1980.
 [266] R. K. Srivastava, J. Masci, S. Kazerounian, F. Gomez, and J. Schmidhuber. Compete to compute. In Advances in Neural Information Processing Systems (NIPS), pages 2310–2318, 2013.
 [267] R. K. Srivastava, B. R. Steunebrink, and J. Schmidhuber. First experiments with PowerPlay. Neural Networks, 41(0):130 – 136, 2013. Special Issue on Autonomous Learning.
 [268] K. O. Stanley, D. B. D’Ambrosio, and J. Gauci. A hypercubebased encoding for evolving largescale neural networks. Artificial Life, 15(2):185–212, 2009.
 [269] Y. Sun, F. Gomez, T. Schaul, and J. Schmidhuber. A Linear Time Natural Evolution Strategy for NonSeparable Functions. In Proceedings of the Genetic and Evolutionary Computation Conference, page 61, Amsterdam, NL, July 2013. ACM.
 [270] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Efficient natural evolution strategies. In Proc. 11th Genetic and Evolutionary Computation Conference (GECCO), pages 539–546, 2009.
 [271] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. Technical Report arXiv:1409.3215 [cs.CL], Google, 2014. NIPS’2014.
 [272] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge, MA, MIT Press, 1998.
 [273] R. S. Sutton. Integrated architectures for learning, planning and reacting based on dynamic programming. In Machine Learning: Proceedings of the Seventh International Workshop, 1990.
 [274] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS) 12, pages 1057–1063, 1999.
 [275] R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112(12):181–211, 1999.
 [276] R. S. Sutton, C. Szepesvári, and H. R. Maei. A convergent O(n) algorithm for offpolicy temporaldifference learning with linear function approximation. In Advances in Neural Information Processing Systems (NIPS’08), volume 21, pages 1609–1616, 2008.
 [277] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. Technical Report arXiv:1409.4842 [cs.CV], Google, 2014.

[278]
A. Teller.
The evolution of mental models.
In J. Kenneth E. Kinnear, editor,
Advances in Genetic Programming
, pages 199–219. MIT Press, 1994.  [279] J. Tenenberg, J. Karlsson, and S. Whitehead. Learning via task decomposition. In J. A. Meyer, H. Roitblat, and S. Wilson, editors, From Animals to Animats 2: Proceedings of the Second International Conference on Simulation of Adaptive Behavior, pages 337–343. MIT Press, 1993.
 [280] G. Tesauro. TDgammon, a selfteaching backgammon program, achieves masterlevel play. Neural Computation, 6(2):215–219, 1994.
 [281] J. N. Tsitsiklis and B. van Roy. Featurebased methods for large scale dynamic programming. Machine Learning, 22(13):59–94, 1996.
 [282] A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, Series 2, 41:230–267, 1936.
 [283] P. E. Utgoff and D. J. Stracuzzi. Manylayered learning. Neural Computation, 14(10):2497–2529, 2002.
 [284] H. van Hasselt. Reinforcement learning in continuous state and action spaces. In M. Wiering and M. van Otterlo, editors, Reinforcement Learning, pages 207–251. Springer, 2012.
 [285] N. van Hoorn, J. Togelius, and J. Schmidhuber. Hierarchical controller learning in a firstperson shooter. In Proceedings of the IEEE Symposium on Computational Intelligence and Games, 2009.
 [286] V. Vapnik. Principles of risk minimization for learning theory. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems (NIPS) 4, pages 831–838. Morgan Kaufmann, 1992.

[287]
V. Vapnik.
The Nature of Statistical Learning Theory
. Springer, New York, 1995.  [288] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. Preprint arXiv:1411.4555, 2014.
 [289] C. S. Wallace and D. M. Boulton. An information theoretic measure for classification. Computer Journal, 11(2):185–194, 1968.
 [290] C. Wang, S. S. Venkatesh, and J. S. Judd. Optimal stopping and effective machine complexity in learning. In Advances in Neural Information Processing Systems (NIPS’6), pages 303–310. Morgan Kaufmann, 1994.
 [291] O. Watanabe. Kolmogorov complexity and computational complexity. EATCS Monographs on Theoretical Computer Science, Springer, 1992.
 [292] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Oxford, 1989.
 [293] C. J. C. H. Watkins and P. Dayan. Qlearning. Machine Learning, 8:279–292, 1992.
 [294] A. S. Weigend, D. E. Rumelhart, and B. A. Huberman. Generalization by weightelimination with application to forecasting. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems (NIPS) 3, pages 875–882. San Mateo, CA: Morgan Kaufmann, 1991.

[295]
G. Weiss.
Hierarchical chunking in classifier systems.
In Proceedings of the 12th National Conference on Artificial Intelligence, volume 2, pages 1335–1340. AAAI Press/The MIT Press, 1994.  [296] J. Weng, N. Ahuja, and T. S. Huang. Cresceptron: a selforganizing neural network which grows adaptively. In International Joint Conference on Neural Networks (IJCNN), volume 1, pages 576–581. IEEE, 1992.
 [297] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th IFIP Conference, 31.8  4.9, NYC, pages 762–770, 1981.
 [298] P. J. Werbos. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17, 1987.
 [299] P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 1988.
 [300] P. J. Werbos. Backpropagation and neurocontrol: A review and prospectus. In IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C., volume 1, pages 209–216, 1989.
 [301] P. J. Werbos. Neural networks for control and system identification. In Proceedings of IEEE/CDC Tampa, Florida, 1989.
 [302] P. J. Werbos. Neural networks, system identification, and control in the chemical industries. In D. A. S. D. A. White, editor, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pages 283–356. Thomson Learning, 1992.
 [303] J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
 [304] H. White. Learning in artificial neural networks: A statistical perspective. Neural Computation, 1(4):425–464, 1989.
 [305] S. Whiteson. Evolutionary computation for reinforcement learning. In M. Wiering and M. van Otterlo, editors, Reinforcement Learning, pages 325–355. Springer, Berlin, Germany, 2012.
 [306] S. Whiteson, N. Kohl, R. Miikkulainen, and P. Stone. Evolving keepaway soccer players through task decomposition. Machine Learning, 59(1):5–30, May 2005.
 [307] A. P. Wieland. Evolving neural network controllers for unstable systems. In International Joint Conference on Neural Networks (IJCNN), volume 2, pages 667–673. IEEE, 1991.
 [308] M. Wiering and J. Schmidhuber. Solving POMDPs with Levin search and EIRA. In L. Saitta, editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA, 1996.
 [309] M. Wiering and J. Schmidhuber. HQlearning. Adaptive Behavior, 6(2):219–246, 1998.
 [310] M. Wiering and M. van Otterlo. Reinforcement Learning. Springer, 2012.
 [311] M. A. Wiering and J. Schmidhuber. Fast online Q(). Machine Learning, 33(1):105–116, 1998.
 [312] D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber. Recurrent policy gradients. Logic Journal of IGPL, 18(2):620–634, 2010.
 [313] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural evolution strategies. In Congress of Evolutionary Computation (CEC 2008), 2008.
 [314] R. J. Williams. Reinforcementlearning in connectionist networks: A mathematical analysis. Technical Report 8605, Institute for Cognitive Science, University of California, San Diego, 1986.
 [315] R. J. Williams. Toward a theory of reinforcementlearning connectionist systems. Technical Report NUCCS883, College of Comp. Sci., Northeastern University, Boston, MA, 1988.
 [316] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
 [317] R. J. Williams and D. Zipser. Gradientbased learning algorithms for recurrent networks and their computational complexity. In Backpropagation: Theory, Architectures and Applications. Hillsdale, NJ: Erlbaum, 1994.
 [318] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715–770, 2002.
 [319] D. H. Wolpert. Bayesian backpropagation over io functions rather than weights. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems (NIPS) 6, pages 200–207. Morgan Kaufmann, 1994.
 [320] B. M. Yamauchi and R. D. Beer. Sequential behavior and learning in evolved dynamical neural networks. Adaptive Behavior, 2(3):219–246, 1994.
 [321] X. Yao. A review of evolutionary artificial neural networks. International Journal of Intelligent Systems, 4:203–222, 1993.
 [322] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Technical Report arXiv:1311.2901 [cs.CV], NYU, 2013.
 [323] H.G. Zimmermann, C. Tietz, and R. Grothmann. Forecasting with recurrent neural networks: 12 tricks. In G. Montavon, G. B. Orr, and K.R. Müller, editors, Neural Networks: Tricks of the Trade (2nd ed.), volume 7700 of Lecture Notes in Computer Science, pages 687–707. Springer, 2012.
Comments
There are no comments yet.