Thinking Required

12/07/2015 ∙ by Kamil Rocki, et al. ∙ 0

There exists a theory of a single general-purpose learning algorithm which could explain the principles its operation. It assumes the initial rough architecture, a small library of simple innate circuits which are prewired at birth. and proposes that all significant mental algorithms are learned. Given current understanding and observations, this paper reviews and lists the ingredients of such an algorithm from architectural and functional perspectives.



There are no comments yet.


page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Background

In a very simplistic way, current research efforts in the field of Artificial Intelligence (AI) can be divided into

111this might be the author’s opinion only:

Recently, much progress has been made in the area of supervised learning (as it has been mentioned in the paragraph above). However, one of the greatest challenges remaining in artificial intelligence research is to make steps towards advancements in the field of unsupervised learning algorithms

(DBLP:journals/corr/abs-1206-5538, ; Bengio-2009, ; lecuncvpr, ; Goodfellow-et-al-2015-Book, ). Especially, autonomous learning of complex spatiotemporal patterns (another way of thinking about predicting spatiotemporal patterns is structured prediction(Bakir:2007:PSD:1296180, )). The main motivation of the work presented in this paper is the observation that all human experiences are inherently spatiotemporal and that all predictions are context-dependent.

This paper postulates a need for another intensified research effort spanning the fields of neuroscience, machine learning, AGI and neuromorphic computing in order to design algorithms and build machines that think that is machine intelligence. It focuses on the neocortex, that is the intelligence part, not addressing other aspects related to the brain such as consciousness or emotions, reinforcement and long-term memory.

Ii Ingredients

A. General Purpose

Neocortex, which is attributed only to mammals, is deemed to be the place where intelligent behavior resides. It has been studied extensively over the past decades, but to date there is still no consensus on how it works. There exists a theory of a single learning algorithm which explains intelligence(citeulike:13329708, ; Hawkins:2004:INT:993636, ; kurzweil2012create, ; HintonSejnowski:86, ; domingos2015master, ). It has been considered ever since Mountcastle’s discovery of the simple uniform architecture of the cortex(Mountcastle:1978, ) (six horizontal layers organized into vertical structures called cortical columns; the columns can be thought of as the basic repeating functional units of the neocortex), which might suggest a that all brain regions perform similar operations and there are no region-specific algorithms. Another famous experiment which reignited the same idea, showed, that after rewiring, the auditory part of the brain in ferrets was able to learn to interpret visual inputs(roe1992visual, ). However, alternative theories of such a structural arrangement exist, which propose that the brain might be a collection of microcircuits, that basically look the same, however perform very different functions(citeulike:12886889, ). This paper does not consider the latter view. Despite the fact that there is still much to be discovered, there are many facts that are either already known or are very likely to be true. Assuming that a general purpose learning procedure does indeed exist, this paper lists its key aspects that could agree with the pieces of evidence which have been gathered so far. Our knowledge about necessary ingredients of such an algorithm is shaped by neuroscientific discoveries, empirical evaluation of effectiveness of algorithms, metacognition and observations. Some of the points below may be considered as very general assumptions for reverse-engineering the learning algorithm.

B. Unsupervised

In real world, almost all data is unlabeled. Although, to the best of knowledge, nobody has discovered the precise rules used by human brain for learning, one can assume that we learn mostly in an unsupervised way. Specifically, when newborn learns about the world and how different objects interact, there might not even be a way to provide supervised signal to him/her, because the appropriate sensory representations (i.e. visual, auditory) need to be developed first. Another piece of evidence against supervised learning may be obtained by simple calculation: assuming that there are approximately synapses and seconds of human lifetime, there is enough capacity to store all memories at the rate of bits/second(Schmidhuber:07alt, ), so there would simply not be enough information in the labels alone. This motivates the hypothesis of predominance of unsupervised learning, since the only way of acquiring so much information is by absorbing data from perceptual inputs(hintonlecture, ). Even when a teacher is present, most learning must be done by learning associations between events without supervision. That is to say, first learning a concepts by forming internal representations of the experiences and associating those internal representations with names afterwards, so that the two are separated (which also allows to label all previously seen and all yet to be seen objects which fall into the same cluster). Unsupervised learning has been researched extensively and was found to be closely connected to the process of entropy-maximization, regularization and compression(DBLP:journals/corr/abs-1206-5538, ; MacKay:itp, ; hinton1999unsupervised, ; 888, ) which means that through evolution, our brains have adapted to act as data compactors. In particular, the goal of unsupervised learning might be to find codes which disentangle input sources and describe the original information in a less redundant or interpretable way(888, ) by throwing out as much data as possible out without losing information. An example of this operation has been observed in the visual cortex(wiesel:1959, )

(but might even happen as early as in the retina) which learns patterns appearing in the natural environment and assigns high probability to those

(hyvarinen2009natural, ), whereas low probability to random combinations. The real world data is said to lie near a non-linear manifold(bengio2004discovering, )

within the higher-dimensional space, where the manifold shape is defined by the data probability distribution. Clustering is then equivalent to learning those manifolds and being able to separate them well enough for a given task.

C. Hierarchical

Humans learn concepts in a sequential order, first making sense of simple patterns and representing more complex ones in terms of those previously learned abstractions. The ability to read, might serve as an example. First we learn to see, we recognize pen strokes, then letters, words and then we are able to understand complex sentences, whereas the non-hierarchical approach would be to attempt to read straight from ink patterns on a piece of paper. The brain might have adapted this way to reflect the fact that the world is inherently hierarchical. And this observation also inspired the deep learning movement, which used the hierarchical approach to model real world data, achieving unprecedented performance on many tasks. The way deep learning algorithms automatically compose multiple layers of representations of data gives rise to models, which yield increasingly abstract associations between concepts (hence the other names used - representation learning(Barlow:89review, ; Bengio:2013:RLR:2498740.2498889, ; bengio2013deep, ; deng2014deep, ; erhan2010does, ) and feature learning(DBLP:journals/corr/abs-1206-5538, ) among others). The main distinction between the the deep approach and previous generation of machine learning is that the structure in the data should be discovered automatically by a general-purpose learning procedure, without the need to hand-engineer feature detectors(DBLP:journals/corr/abs-1206-5538, ; LeCun2015, ). This scheme agrees very well with the idea of unsupervised learning mentioned above. In a way, abstract hierarchical representations might be a natural by-products of data compression(888, ). Upon the theoretical and empirical evidence in favor of the deep representation learning, one could formulate a requirement for any type of brain-like architecture to be deep. The question, however, is the nature of the learning procedure.

D. sparse distributed representations

The existence of cortical columns in the neocortex has been linked to the functional importance of such an arrangement. Each column typically responds to a sensory stimulus representing a certain body part or region of sound or vision, so that all cells belonging to that cell are excited simultaneously, therefore acting as a feature detector. At the same time, a column which is active (receives strong input signal and spikes) will prohibit other nearby columns from becoming active. This lateral inhibition mechanism leads to sparse activity patterns. The fact that only strongly active columns will not be inhibited forces the learned patterns to be as invariant as possible, giving rise to independent feature detectors in the cortex(Bell97the`independent, )

. As it might have been expected, these sparse distributed representations in the brain (SDRs) are not coincidental, since they possess important properties from the information-theoretical perspective. The

distributed aspect makes factored codes possible, which is important in order to disentangle the underlying causes(Bengio:2013:RLR:2498740.2498889, ) (i.e. melody, instrument, pitch, loudness, the equivalent task is called blind source separation in cocktail party problem), while sparsity affects other elements of learning good features(see Fig. 1). It has been proven that given certain sparsity, a signal may be correctly reconstructed even with fewer samples than the sampling theorem requires(citeulike:2688127, ; Donoho:2006:CS:2263438.2272089, ).

Figure 1: An example of a factored representation - where a composite object/scene can be represented as a sum of objects comprising the scene (square, circle and triangle in this case are the disentangled factors of variation). In addition to that, each of those shapes is described by reusable properties such as color or shape.

As a comparison, a dense binary representation is very space-efficient, in a sense that given bits, it is capable of storing

different values. In the domain of learning complex patterns in real environment, there is a fundamental problem with it. Almost any random flip of bits (noise) will produce a value that is very different than the original one, so that given a noisy version of a pattern, it would not possible to recover the correct version. Sparse distributed representations (SDRs) on the other hand assume that every bit has some meaning, only a small number of bits should be active (sparse) at any moment in time and that a object is a collection of simples patterns (distributed). These properties make SDRs significantly more noise-resistant

(1503.07469, ). Another important property of distributed representations is that the number of distinguishable regions scales exponentially with the number of parameters used to describe it. This is not true for non-distributed representations. That is, sparse distributed representations are combinatorially much more expressive. Given this observation, it is simple to see that from the discriminative point of view or higher levels of abstractions, SDRs will be a preferred way of representing inputs, since the the learning procedure produces a form which preserves as much information as possible while making the code as short/simple as possible (also it corresponds to finding minimum-entropy codes(Barlow:89, ; hyvarinen2001, )). This is in-line with the Occam’s Razor or Minimum Description Length (MDL) rules which postulate that the simple solution should be chosen over the complex ones(Solomonoff:64, ; Rissanen:78, ). This allows for manipulating sparse representations throughout the large network and simplifies learning higher level concepts (see dimensionality reduction(HinSal06, ; saul2003, ), redundancy reduction(Li:2008:IKC:1478784, ; doi:10.1080/net., )).

Ever since the discovery of selective features detectors such as edge detectors and center-surround receptive fields in V1 by Hubel and Wiesel in 1959(wiesel:1959, ), learning biologically plausible sparse distributed representations of input patterns has been a hot research topic(Barlow:89, ; Foldiak:95, ; Kanerva:1988:SDM:534853, ; Olshausen97sparsecoding, ). It may be seen as an instantiation of unsupervised learning and many algorithms have been invented so far(Hinton:1986:DR:104279.104287, ; ackley1985learning, ; Poultney06efficientlearning, ; 1503.07469, ; sparse2007ng, ; lee2007sparse, ; Hinton:97, ; AISTATS2011_GlorotBB11, ; hinton2007learning, ; hyvarinen2009natural, ; olshausen2004sparse, ; Donoho:2006:CS:2263438.2272089, ; Boureau07y.:sparse, ; lee:2009, )

(they include Factor models, PCA, RBM, ICA, Sparse coding, AE, among others). Convolutional Neural Networks (CNNs or convnets

(Fukushima:1979neocognitron, ; lecun-89e, )) are on the other hand a supervised learning architectures based on the principle of learning a hierarchy of SDRs and currently provide state-of-the-art image recognition, proving discriminative value of learning such representations. Sparsity has been also link to quantifiably better performance on discriminative tasks(AISTATS2011_GlorotBB11, ), which may be explained by by the fact that sparse representations simplify optimization of an objective.

Figure 2: Efficient learning of SDRs; Sparse Distributed Representations (SDRs) simplify learning temporal dependencies; provide a mechanism for generalization and out-of-domain prediction

Another desirable property resulting from a factored representation is the generalization capability, meaning that similar input patterns will produce similar bit outputs. It might imply that SDRs are a plausible candidate for the alphabet used by the neocortex and a means to machine intelligence. One example of an advantage of an SDR compared to a dense representation becomes obvious when considering learning temporal dependencies between spatial patterns (Fig. 2). Assuming that the learning procedure has disentangled the underlying sources of variation, learning complex sequences may be decomposed into finding relationships between those sources.

E. Objectiveless

The backpropagation algorithm

(opac-b1081822, ; Rumelhart:1988:LRB:65669.104451, ) lies at the heart of most of deep architectures. It specifies how the internal parameters of a model at all levels should be changed in order to improve it (specifies the direction of movement in the state-space). Given certain problems222this, again, is the author’s view; backpropagation is not evil; it might be a different path leading to the same goal and following the Occam’s Razor rule, this paper questions whether one really needs backpropagation to learn non-trivial concepts. Usually, the algorithm computes the derivatives of the outputs and propagates them backward, which in turn rely on having an objective function, which depends on the task definition, performance criterion and other assumptions. This is definitely a problem which causes a generality/performance tradeoff and requires some a priori knowledge about a task. Another issue is related to scalability of the procedure of propagating the error derivatives backwards from a single place in a network. Next, the standard backpropagation assumes that the objective function and intermediate activations are differentiable.

Figure 3:

Images that are unrecognizable to humans, but classified by NN trained on ImageNet with with over 99.6 percent certainty. This result highlights differences between how statistical methods and humans recognize objects, figure borrowed from [

DBLP:journals/corr/NguyenYC14 ]

Then, in addition to the difficulty of having to hand-engineer the task definition and the objective function, there exists a problem that has been discovered recently(DBLP:journals/corr/SzegedyZSBEGF13, ; goodfellow2014explaining, ) - the fact that there exist images which can be classified as almost any object with great confidence(DBLP:journals/corr/NguyenYC14, ) (Figure 3), despite the fact that they might not resemble any know object to humans. The existence of those images suggests that the models fails to really understand the concepts and instead the situation bears resemblance to the Chinese Room argument(searle1984minds, ). The adversarial examples lie in the neighborhood of the data manifold, possess similar statistics, therefore the algorithm thinks they are the same, however, drastically different for humans. One hypothesis explaining this phenomenon is that having an objective(DBLP:books/sp/StanleyL15, ) is the problem itself. It could be hypothesized, that by following the gradient of the objective function, one may prohibit the learning procedure from discovering the unknown state-space or that progress in learning is not equivalent with being close to the objective (Figure 5). The fooling images problem is not limited to a particular NN algorithm, architecture or dataset. In fact, it has been shown in many areas and the same undesirable properties can even be transferred from one net to the other. A striking example of contrast between how accurate NN can be at generating image captions and the type of mistakes it makes, is shown in Figure 4. The same problem of lack of understanding of grammar and complex concepts applies to machine translation.

Figure 4: Captions generated by the same network, taken from [DBLP:journals/corr/XuBKCCSZB15 ]. Image (b) shows that the network actually does not have a deeper understanding of what is in an image

Figure 5: The final product does not need to resemble intermediate steps, some things might not be discovered when following the gradient of an objective function, figure by K. Stanley [DBLP:books/sp/StanleyL15 ], see [Secretan:2011:PCS:2078014.2078016 ] for other examples

Finally, one more problem which has been attributed to gradient-based learning is called catastrophic forgetting(Goodfellow2014, ), which means that a model can forget previously learned knowledge upon a presentation of new data by re-adjusting the parameters according to the gradients.

F. Scalable

The brain comprises approximately neurons and synapses(Chklovskii2004, ; hawkins2015neurons, ). In such a large network, having a single learning objective and propagating error derivates backwards might not be the best choice (one of the reasons might be the that modification of all synapses every time is wasteful, another is the hardness of parallelizing global optimization problems, however there are some approaches to solving this problem (dean2012, ; coates:2013icml, )). Instead it might be more reasonable to separate local learning (gray matter) from adjusting higher level connections between layers/regions (white matter). This functional distinction would reflect the structural hierarchy that is so predominant in deep learning methods described before and the real world. Biological, technological, social networks (most crucially transportation) and other types of real-world networks are neither completely random nor definitely regular. Instead, their topology lies somewhere in between. The, so called, small world networks(watts1998cds, ) may be nature’s solution to a hierarchical structure which allows for separate parallel local and global updates of synapses, scalability and unsupervised learning at the lower levels with more goal-oriented fine-tuning in higher regions.

Figure 6: An example of a small world network: each edge encodes the presence of long-distance connection between corresponding regions in a macaque brain. Figure borrowed from [modha2010network ]

Studying the neocortex indeed reveals that this is the case (Fig. 6), where columnar organization reflects the local connectivity of the cerebral cortex. Another piece of evidence comes from the success of convolutional networks, where sharing connections imposes such local/global structure of learning. Usage of sparse distributed representation is also very important from scalability point of view, since the representations are inherently fault tolerant. Moreover, sparse activations may be stored in a more compact only non-zero elements could be processed instead of all. Finally, it has been shown recently, that neural networks are able to learn even with very limited computational precision(DBLP:journals/corr/GuptaAGN15, ; DBLP:journals/corr/CourbariauxBD14, ), stochastic approximations(lin2015neural, ) and noise(Srivastava:2014:DSW:2627435.2670313, ; icml2013_wan13, ; courbariaux2015binaryconnect, ). In fact, such networks may even have better generalization capability.

The brain is an inherently parallel machine, without a separate instruction-issuing and memory storage areas. Instead, all parts of the neocortex participate in both. This is a very big difference when compared to the von-Neumann architecture describing majority of computing systems are organized. The main bottleneck current systems concerns data movement, which implies additional bandwidth, power and latency requirements. CPUs are typically optimized for serial tasks, mitigating the negative effects of such an architecture by deep cache hierarchy, but losing when parallelism is involved. GPUs have more brain-like layout, with more equal processing units, each having some private memory, so that they can actually operate in parallel without colliding. However, the problem of moving the data still exists, either between CPU and GPU or inside in the GPU. The same problem persists. In fact, it is quite easy to show, that it is virtually impossible to achieve the peak performance of those processors, because the data cannot be fed fast enough. Moreover, the data transfers are the major energy consumption factors on parallel GPU-like devices(Villa:2014:SPW:2683593.2683684, ). Therefore, a more radical approach may be needed in order to improve the performance significantly. The von-Neumann architecture needs to be changed into one where memory itself can compute. Some hardware which allows such a functionality has already appeared(dlugosch2014efficient, ). The concept of in-place processing assumes however, that a different approach is also needed when thinking about algorithms. This process of communication-aware algorithm design has already started with the advent of multi-core CPUs, GPUs and FPGAs. The next step is to design communication-less algorithms(Baboulin201217, ). This is an ongoing effort in supercomputing community, where it has been noticed, that no significant progress can be made without reducing information transfer-overhead.

G. Biologically plausible

Given the views expressed above, how could one facilitate learning? The most successful learning algorithms at the moment are based on global gradient descent algorithm, however, due to several reasons (mentioned before), the preferred solution to this problem should avoid specifying a global objective. One idea is to look once more for an inspiration in nature, and to study what our brains evolved to do in order to build rich internal representations of the environment. Not just because it provides a shortcut, but it might be the safest bet to general AI, at the same time potentially solving many other problems that we have not even though about yet. This this a major goal in current AI research. Although, the task of decoding the algorithm(s) used by the brain is very difficult, there are some clues to what might be happening inside our heads.

To start with, very roughly the brain can be divided into separate subsystems (the exact functions are yet to be discovered), where the neocortex is the main information processing workhorse and therefore considered as being separated from lower-level actions (cerebellum), reward/value-like inputs (amygdala, limbic system) or long-term memory access/formation (hippocampal complex). This may serve as a reason why this paper skips those parts, and may justify why it is not unreasonable to think about a single learning algorithm as not being reinforcement-based.

It has been shown that two of the tasks being performed in the visual cortex in the brain is spatial pattern detection(wiesel:1959, ) and forming sparse representations of the inputs(Olshausen97sparsecoding, ) (introduced earlier), which have been found to be work in the same way when modeled algorithmically too(lee2007sparse, ; Le_buildinghigh-level, ). It has been shown that biologically plausible features can be learned using very simple Hebbian-like learning rules(Linsker1986, ). Then, it has been observed that there exist so called simple cells, which act as a specific pattern detector, such as an oriented edge, and complex cells, which are to some extent invariant to transformations of the inputs and react to a more general group of stimuli (i.e. shape detector). Those discoveries served as an inspiration to the groundbreaking performance of deep convolutional neural networks in image recognition(Fukushima:1979neocognitron, ; lecun-89e, ; lee:2009, ; NIPS2012_4824, ). At the same time, those successes served as a feedback loop to study create even more accurate models of processing in the brain, such as a feedforward HMAX(Riesenhuber:99, ) model, whose neurophysiologically plausible topology is very similar to the one of the 1992 Cresceptron(weng1992, ) (and thus to the 1979 Neocognitron(Fukushima:1979neocognitron, )). Similarities have been in the auditory cortex(4218213, ), where individual phonemes activated different subsets of auditory neurons.

When it comes to learning, there is still much to be discovered. One thing has been already pointed out, the algorithm/learning process has to be be mostly unsupervised in nature. More specifically, it has been shown and hypothesized that the main function of the brain has to be unsupervised learning of temporal sequences. At a very general level it means that is constantly anticipates an outcome, acts, observes the world, compares observations with previous expectations and adjusts (or forms new) synapses so that the internal model of the world makes more accurate predictions(Hawkins:2004:INT:993636, )

. More formal approaches expressing the same idea have been formulated by G. Hinton (Boltzmann machines

(HintonSejnowski:86, )) and K. Fritz (free-energy principle(Friston2010, )), Wiskott and Sejnowski (slow feature analysis(WisSej2002, )

). Local contrastive divergence-like (CD) learning or target propagation

(DBLP:journals/corr/LeeZBB14, ) may be a much more plausible method than backpropagation(hinton2007backpropagation, ); On the implementation side of things, it has been discovered that there are at least 3 types of connections being integrated in a pyramidal cell(hawkins2015neurons, ; Lamme1998529, ) (proximal, distal/basal, apical) most likely serving feed-forward, sequence and feedback roles accordingly using simple, local Hebbian-like(Hebb:49, ; MacKay:itp, ) learning rules, which might resemble local backpropagation algorithm at a very tiny scale. In fact that is what has been observed in the brain(Cichon2015, ; Fu2011, ).

Figure 7: A neocortical pyramidal neuron has thousands of excitatory synapses located on dendrites (inset). There are three sources of input to the cell. The feedforward inputs (shown in green) which form synapses proximal to the soma, directly lead to action potentials. NMDA spikes generated in the more distal basal and apical dendrites depolarize the soma but typically not sufficiently to generate a somatic action potential. Figure borrowed from [hawkins2015neurons ]

It is hypothesized that pyramidal cells in the neocortex might not integrate incoming bottom-up and top-down signals in a simple way(Larkum2013141, ; Larkum07082009, ), instead their operation may be more gate-like, with proximal dendrites responsible for feed-forward stimuli, distal dendrites for immediate prediction, and apical dendrites acting as a filter/gate and disambiguation mechanism(Hawkins:2004:INT:993636, ) (logical AND-like, so creating an OR-AND or SUM-MULTIPLY cascade(H usserMel2003, ; representingrelations, )). in this sense, recently very popular in ML approach of including gating as a means of memory access control(DBLP:journals/corr/GravesWD14, ; DBLP:journals/corr/WestonCB14, ) or attention(DBLP:journals/corr/ChorowskiBSCB15, ; DBLP:journals/corr/XuBKCCSZB15, ; mnih2014recurrent, ), might be biologically plausible(oreilly:2003, ; 888, ).

Probably the most anatomically accurate model so far is the networks of spiking neurons(Maas:1997:NSN:281543.281637, ). Learning is such networks can also be some using local plasticity rules, i.e. Spike Timing Dependent Plasticity(markram1997regulation, ) (STDP), a temporal form of Hebbian-like learning based on temporal correlations between the spikes of pre- and postsynaptic neurons. It is believed to underlie learning and information storage in the biological neural networks. It might suggest that in fact, local learning rules do exist in the brain, however in such an asynchronous, asymmetric version. The core operation which can be defined in this kind of networks is detecting the occurrence of temporally close but spatially distributed input signals(ruf1997learning, ), that is, coincidence detection. For the sake of simplicity333and to stay within a reasonable number of pages, this paper will not focus on this aspect (However, some approaches to AI are based entirely on the fact of asynchronously propagated spikes(Merolla08082014, ); also I), but assume a simpler, symmetric local learning approach as described by G. Hinton(hinton2007backpropagation, ).

An important fact about the neocortex is that connections between neurons can be created and removed, in other words, the number of parameters of the model is not fixed. However, majority of current ML approaches (deep or not) focus only on changes in the strengths of connections between neurons (weights/parameters), meaning that the topology of the network is fixed and reflects some prior knowledge. Given sparse connectivity (approx. synapses/neuron), an approach which is closer to the biological reality (wiring plasticity, where connections can be formed and removed(Chklovskii2004, )) would offer a great advantage in terms of increase of the number of combinations of neurons/synapses available to encode learned information.

Iii Towards Machine Intelligence

The most interesting aspect of this research is connecting the mechanisms described above with the theoretical concepts of what machine intelligence should be. Given some low-level properties of the learning algorithm, what would be the overall goal of learning and what should the learning path look like? What kind of behavior would be considered as a stepping stone towards machine intelligence and if so, is there a way to describe it in a precise way? The very basic question of what it means for a machine or an algorithm be intelligent needs clarification. According to some, goal-directed behavior is considered the essence of intelligence(Russell:2003:AIM:773294, ). However, this implies that the necessary and sufficient condition of intelligent behavior is rationality and this paper questions this statement. Humans are often very far from being rational. Creativity does not fall into this definition, risk-taking might not be rational, yet it’s essential for innovation. Therefore, far more appealing theories of universal intelligence are those with broader priors, such as the theory of curiosity, creativity and beauty described by J. Schmidhuber(Schmidhuber:07alt, ). Previous section introduces problems which may arise from objective based learning, that is the Chinese Room argument, when all the algorithm is interested to do is to map inputs to outputs without any motivation to learn anything beyond the task given. An intelligent algorithm (strong AI(searle1984minds, ), among other names) should be able to reveal hidden knowledge which might not even be discoverable to humans. Despite not having a specific task, this section will point out functional ingredients of a any learning procedure which would not violate the generality assumption.

A. Compression

Learning may be likened to a formal information-theory based concept of information compression. Assuming that the goal is to build more compact and more useful representations of the environment(such as finding minimum entropy codes(journals/neco/BarlowKM89, )), this interpretation relates to representation learning and analogy building compression scheme(DBLP:journals/corr/abs-1108-1169, ) of the neocortex. One way of looking at this task is considering a general artificial intelligence as a general purpose compressor, one which is able to discover the probability distribution of any source(MacKay:itp, ). However, the No Free Lunch Theorem(Wolpert:1997:NFL:2221336.2221408, ) states that no completely general-purpose learning algorithm can exist or in other words that for every compressor, there exists a data distribution on which it will perform poorly. This implies that there must exist some restrictions on the class of problems it will work on well. The previous section already mentioned a few of them, which are fortunately very general and plausible such as the smoothness prior or depth prior (also see [Bengio+chapter2007 ] for a more complete list of sensible assumptions).

B. Prediction

Whereas smoothness prior may be considered as a spatial coherence, the assumption that the world is mostly predictable corresponds to temporal or more generally spatiotemporal coherence. This is probably the most important ingredient of a general-purpose learning procedure. In other words, it states that things which close in time are close in space and vice versa. A purely spatial analogy is huge image space and only a tiny fraction of possible real images(hyvarinen2009natural, ). The same is true for spatiotemporal patterns; the assumption that a sequence of spatial patterns is coherent, restricts the spectrum of future spatial states which are likely (if looking at a giraffe, the next thing you see is most likely a giraffe too, images of trains could be probably excluded from predictions).

Occam’s Razor rule or MDL principle(Solomonoff:64, ; Rissanen:78, ) state that simple solutions should be favored over more complex ones, therefore learning better representations should be a goal itself, even without any other objective. If it is assumed that no task is given a priori, the best we can do is just to observe and learn to predict. One of the first working examples (and a proof of concept) is the principle of history compression employed in the recurrent architecture proposed by J. Schidmuber(schmidhuber1992, ).

C. Understanding

The ability to predict is equivalent to understanding, since at any given moment, a cause and prediction could be inferred from given state context. Therefore, learning to predict may be a more general requirement of an intelligent behavior. In fact is has been postulated(Hawkins:2004:INT:993636, ), that all the brain does is constantly predicting the future states, comparing those predictions with sensory inputs and readjusting accordingly. This might seem to be equivalent to backpropagating the error through the entire network, however from the biological perspective the prediction/expectation readjustment of neurons is most likely operating locally.

D. Sensorimotor

Scientists have demonstrated that the brain predicts consequences of our eye movements based on what we see next. The findings have implications for understanding human attention and applications to robotics. Despite the fact that, in practice, the is no experienced is perceived twice, human brains are able to form a stable representation of an abstract make accurate predictions despite changes in context. An example of such mental representations being present may be observed by explaining rapid eye movements known as saccades (see Fig. 8). Our eyes move rapidly approximately three times a second in order to capture new visual information. With each jump a new image falls onto the retina. However, we do not experience this quickly-changing sequence of images, instead, we see a stable image. The brain uses such a mechanism in order to redirect attention, since only approximately 1 of the retina provides sharp image (fovea). This operation has been extensively researched from the neuroscientific perspective as it provides one of few visible brain activities(Rolfs2011, ; Kowler20111457, ) as well as provided an inspiration for algorithms mimicking this behavior(Gaskett:04, ; DBLP:journals/corr/Ranzato14, ; SchmidhuberHuber:91, ; NIPS2010_4089, ). Sensorimotor connections are needed in order to know, which changes in the image do not result from internal eye movement. It is assumed(yuwei2015maintaining, ) that the motor command are being subtracted from the inputs in order to provide an invariant representation of a concept. What this implies is that actually every part of the neocortex must be performing this function given the uniformity. One hypothesis is that the basic repeating functional unit of the neocortex is a sensorimotor model(hawkins2015neurons, ), that is every part of the brain performs both sensory and motor processing to some extent. Complex cells in V2 visual cortex are invariant to small changes of inputs patterns(lee2007sparse, ), those invariant activations might be mapped purely spatially or may represent a spatiotemporal patterns (i.e. invariant representation given an action). Other experiments support the claim, showing a similar mechanism operating on different type of sensory inputs(Krieger:2015:SIW:2838985, ; Diamond2008, ). From an implementation perspective, sensorimotor integration may be understood in the same way as top-down connections mentioned in the previous section (see Fig. 9).

Figure 8: Face as an example of a spatiotemporal concept, micro-saccades are sequences of low-level spatial patterns in the fovea, they can be pooled temporally into a mid-level concept of an eye, or nose; macro-saccades are more task-oriented movement - moving between nose, eyes, mouth

Figure 9: Feedforward and feedback connections’ roles in concept understanding. Despite rapidly changing sensory inputs and different order of observations, the brain is capable of maintaining stable representations at higher levels; low-level predictions depends on context (eye, mouth, face)

E. spatiotemporal concept

This section proposes that the sensorimotor integration may be indeed a more general way the brain operates.

Thinking about motor command in a more abstract way, it is possible to show that in order to disambiguate multiple predictions. one needs to inject additional context as in Fig. 9. This paper assumes that predictions are associated with some uncertainty(ref1, ; 10.1371/journal.pcbi.1004305, ) as in the bayesian approach and that instead of assuming a single point prediction, the distribution is highly multimodal. Additional context is equivalent to integrating evidence which makes predictions more specific.

Figure 10: An example of a spatiotemporal concept

The necessity for a means of manipulating a spatiotemporal concept can be illustrated with a simple example. Given 2 images as in Fig. 10 it is obvious how unnatural is classification based on purely spatial aspect of a pattern. A much more natural way of putting these 2 objects in the same category is by their function, which requires an ability to imagine whether a particular object can be used to be used in a certain way (in this case, open a door). The same applies to other objects, such as chairs. It is much more natural to learn these concepts as spatiotemporal ideas rather than predominant purely spatial machine-learning methods (CNNs). When considering the ability to imagine/dream/hallucinate, then the commonness sensorimotor functionality in the brain is not very surprising. The concept of manipulating a compact spatiotemporal thought might be necessary from the reasoning perspective(bottou-mlj-2013, )

or transfer learning, as majority of the analogies we make are temporal in nature. The importance of learning transformations in the real-world has been recognized in the research community

(memisevic2010, ; Boulanger-et-al-ICML2012, ; Sutskever, ; sutskever2008, ; WisSej2002, ; Elman90findingstructure, ; 888, ), but still needs more attention.

F. Context update

The last functional component postulated by this paper states that there exists an infinite (in theory) loop between bottom-up predictions and top-down context. The hypothesis is that such interconnectedness enables perceptual filling in, where higher layers make hypotheses about the inferences coming from the lower layers and the predictions are iteratively refined based on those hypotheses. It may be likened to working memory theory, where non-episodic memories are being held (not involving hippocampus). An analogy of this an Expectation Maximization or the learning procedure commonly used in Boltzmann Machines, where a samples are obtained iteratively by alternating between unit activations on 2 connected layers

(series/lncs/Hinton12, ; resnik2010gibbs, ) (see Fig. 11). A real-world analogy of this process is solving a crossword or a sudoku puzzle or filling in missing words in a sentence. Those problems may require iterative procedure of refining the solution with using intermediate hypotheses. In addition, it links the ideas of objective-less learning, imagination and novel pattern discovery.

Figure 11: Illustration of iterative context update, every prediction changes the context slightly and vice-versa


Partial support for this work was provided by the Defense Advanced Research Projects Agency (DARPA). I would like to thank members of machine intelligence group at IBM Research and Numenta for their suggestions and many interesting discussions.


This paper is not fixed. Its content is subject to iterative local optimization. Any comments or suggestions are welcome.


  • (1) John Haugeland. Artificial Intelligence: The Very Idea. Massachusetts Institute of Technology, Cambridge, MA, USA, 1985.
  • (2) N. J. Nilsson. Principles of artificial intelligence. Morgan Kaufmann, San Francisco, CA, USA, 1980.
  • (3) Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson Education, 2 edition, 2003.
  • (4) John Henry Holland. Adaptation in natural and artificial systems : an introductory analysis with applications to biology, control, and artificial intelligence. Complex adaptive systems. Cambridge, Mass. MIT Press, 1992. A Bradford book.
  • (5) D. Simon. Evolutionary Optimization Algorithms. Wiley, 2013.
  • (6) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
  • (7) Tom M. Mitchell. Machine Learning. WCB McGraw-Hill, 1997.
  • (8) J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. Published online 2014; based on TR arXiv:1404.7828 [cs.NE].
  • (9) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015. Insight.
  • (10) Ian Goodfellow, Aaron Courville, and Yoshua Bengio. Deep learning. Book in preparation for MIT Press, 2015.
  • (11) Christopher John Cornish Hellaby Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, May 1989.
  • (12) Gerald Tesauro. Temporal difference learning and td-gammon. Commun. ACM, 38(3):58–68, March 1995.
  • (13) Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
  • (14) Richard S. Sutton and Andrew G. Barto. Reinforcement learning i: Introduction, 1998.
  • (15) Masashi Sugiyama. Statistical Reinforcement Learning - Modern Machine Learning Approaches.

    Chapman and Hall / CRC machine learning and pattern recognition series. CRC Press, 2015.

  • (16) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, Feb 2015. Letter.
  • (17) V. Vapnik.

    The Nature of Statistical Learning Theory

    Springer, New York, 1995.
  • (18) Aleksey Grigorievitch Ivakhnenko and Valentin Grigorievitch Lapa. Cybernetic Predicting Devices. CCM Information Corporation, 1965.
  • (19) Aleksey Grigorievitch Ivakhnenko. Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364–378, 1971.
  • (20) K. Fukushima. Neural network model for a mechanism of pattern recognition unaffected by shift in position - Neocognitron. Trans. IECE, J62-A(10):658–665, 1979.
  • (21) Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • (22) M. Liwicki, A. Graves, H. Bunke, and J. Schmidhuber.

    A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks.

    In Proceedings of the 9th International Conference on Document Analysis and Recognition, September 2007.
  • (23) A. Mohamed, G. E. Dahl, and G. E. Hinton. Deep belief networks for phone recognition. In NIPS’22 workshop on deep learning for speech recognition, 2009.
  • (24) A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 2009.
  • (25) Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010.
  • (26) Grégoire Mesnil, Yann Dauphin, Xavier Glorot, Salah Rifai, Yoshua Bengio, Ian Goodfellow, Erick Lavoie, Xavier Muller, Guillaume Desjardins, David Warde-Farley, Pascal Vincent, Aaron Courville, and James Bergstra. Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7, 2011.
  • (27) Dan Claudiu Ciresan, Ueli Meier, and Jürgen Schmidhuber. Transfer learning for Latin and Chinese characters with deep neural networks. In International Joint Conference on Neural Networks (IJCNN), pages 1301–1306, 2012.
  • (28) M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Technical Report arXiv:1311.2901 [cs.CV], NYU, 2013.
  • (29) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • (30) Quoc V. Le, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, and Andrew Y. Ng. Building high-level features using large scale unsupervised learning. In In International Conference on Machine Learning, 2012. 103.
  • (31) A. Coates, B. Huval, T. Wang, D. J. Wu, Andrew Y. Ng, and B. Catanzaro. Deep learning with COTS HPC systems. In Proc. International Conference on Machine learning (ICML’13), 2013.
  • (32) Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. pages 2553–2561, 2013.
  • (33) Alex Graves and Navdeep Jaitly.

    Towards end-to-end speech recognition with recurrent neural networks.

    In Proc. 31st International Conference on Machine Learning (ICML), pages 1764–1772, 2014.
  • (34) Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082 v4, 2014.
  • (35) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions, 2014.
  • (36) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. CoRR, abs/1502.03044, 2015.
  • (37) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.

    International Journal of Computer Vision

    , pages 1–42, 2014.
  • (38) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • (39) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • (40) Ian Buck. The Evolution of GPUs for General Purpose Computing., 2010. [Online; accessed 21-November-2015].
  • (41) Kyoung-Su Oh and Keechul Jung. GPU implementation of neural networks. Pattern Recognition, 37(6):1311–1314, 2004.
  • (42) Mark Harris. Fast fluid dynamics simulation on the gpu. In ACM SIGGRAPH 2005 Courses, SIGGRAPH ’05, New York, NY, USA, 2005. ACM.
  • (43) Vasily Volkov and James W. Demmel. Benchmarking gpus to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, pages 31:1–31:11, Piscataway, NJ, USA, 2008. IEEE Press.
  • (44) Kumar Chellapilla, Sidd Puri, and Patrice Simard. High performance convolutional neural networks for document processing. In International Workshop on Frontiers in Handwriting Recognition, 2006.
  • (45) R. Raina, A. Madhavan, and A.Y. Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 873–880. ACM, 2009.
  • (46) R. Uetz and S. Behnke. Large-scale object recognition with cuda-accelerated hierarchical neural networks. In IEEE International Converence on Intelligent Computing and Intelligent Systems (ICIS), 2009.
  • (47) D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In Intl. Joint Conference on Artificial Intelligence IJCAI, pages 1237–1242, 2011.
  • (48) D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. Multi-column deep neural network for traffic sign classification. Neural Networks, 32:333–338, 2012.
  • (49) Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
  • (50) Clément Farabet, Yann LeCun, Koray Kavukcuoglu, Eugenio Culurciello, Berin Martini, Polina Akselrod, and Selcuk Talay. Large-scale fpga-based convolutional networks. Machine Learning on Very Large Data Sets, 1, 2011.
  • (51) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey J. Gordon and David B. Dunson, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS-11), volume 15, pages 315–323. Journal of Machine Learning Research - Workshop and Conference Proceedings, 2011.
  • (52) George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton.

    Improving deep neural networks for LVCSR using rectified linear units and dropout.

    In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8609–8613. IEEE, 2013.
  • (53) Geoffrey E. Hinton.

    Rectified linear units improve restricted boltzmann machines vinod nair.

  • (54) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, 2013.
  • (55) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015.
  • (56) James Martens. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 735–742, 2010.
  • (57) Quoc V. Le, Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, and Andrew Y. Ng. On optimization methods for deep learning. In Lise Getoor and Tobias Scheffer, editors, ICML, pages 265–272. Omnipress, 2011.
  • (58) James Martens and Ilya Sutskever. Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages 1033–1040, 2011.
  • (59) Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Z. Ghahramani, M. Welling, C. Cortes, N.d. Lawrence, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2933–2941. Curran Associates, Inc., 2014.
  • (60) Marcus Hutter. The fastest and shortest algorithm for all well-defined problems. International Journal of Foundations of Computer Science, 13(3):431–443, 2002.
  • (61) Marcus Hutter. On the existence and convergence of computable universal priors. In Proc. 14th International Conf. on Algorithmic Learning Theory (ALT’03), volume 2842 of LNAI, pages 298–312, Sapporo, Japan, 2003. Springer.
  • (62) Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005. 300 pages,
  • (63) J. Schmidhuber. Algorithmic theories of everything, 2000.
  • (64) J. Schmidhuber. The Speed Prior: a new simplicity measure yielding near-optimal computable predictions. In J. Kivinen and R. H. Sloan, editors,

    Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002)

    , Lecture Notes in Artificial Intelligence, pages 216–228. Springer, Sydney, Australia, 2002.
  • (65) J. Schmidhuber. Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. International Journal of Foundations of Computer Science, 13(4):587–612, 2002.
  • (66) Jeff Hawkins and Sandra Blakeslee. On Intelligence. Times Books, 2004.
  • (67) J. Schmidhuber. A formal theory of creativity to model the creation of art. In Jon McCormack and Mark d’Inverno, editors, Computers and Creativity, pages 323–337. Springer Berlin Heidelberg, 2012.
  • (68) R. Kurzweil. How to Create a Mind: The Secret of Human Thought Revealed. Penguin Publishing Group, 2012.
  • (69) Christof Koch and Idan Segev. Methods in neuronal modeling: from ions to networks. MIT press, 1998.
  • (70) Jeff Hawkins and Subutai Ahmad. Why neurons have thousands of synapses, a theory of sequence memory in neocortex. arXiv preprint arXiv:1511.00083, 2015.
  • (71) Christof Koch. Project mindscope. Frontiers in Computational Neuroscience, (33).
  • (72) S. Seung. Connectome: How the Brain’s Wiring Makes Us who We are. A Mariner Book. Houghton Mifflin Harcourt, 2012.
  • (73) Henry Markram. The human brain project. Scientific American, 306(6):50–55, 2012.
  • (74) Paul A. Merolla, John V. Arthur, Rodrigo Alvarez-Icaza, Andrew S. Cassidy, Jun Sawada, Filipp Akopyan, Bryan L. Jackson, Nabil Imam, Chen Guo, Yutaka Nakamura, Bernard Brezzo, Ivan Vo, Steven K. Esser, Rathinakumar Appuswamy, Brian Taba, Arnon Amir, Myron D. Flickner, William P. Risk, Rajit Manohar, and Dharmendra S. Modha. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
  • (75) Steve Furber, David R. Lester, Luis A. Plana, Jim D. Garside, Eustace Painkras, Steve Temple, and Andrew D. Brown. Overview of the spinnaker system architecture. IEEE Trans. Computers, 62(12):2454–2467, 2013.
  • (76) J. Schemmel, D. Brüderle, A. Grübl, M. Hock, K. Meier, and S. Millner. A wafer-scale neuromorphic hardware system for large-scale neural modeling. Proceedings of the 2010 IEEE International Symposium on Circuits and Systems (ISCAS”10), pages 1947–1950, 2010.
  • (77) Giacomo Indiveri, Bernabé Linares-Barranco, Tara Julia Hamilton, André Van Schaik, Ralph Etienne-Cummings, Tobi Delbruck, Shih-Chii Liu, Piotr Dudek, Philipp Häfliger, Sylvie Renaud, et al. Neuromorphic silicon neuron circuits. Frontiers in neuroscience, 5, 2011.
  • (78) Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128 128 120 db 15 s latency asynchronous temporal contrast vision sensor. Solid-State Circuits, IEEE Journal of, 43(2):566–576, 2008.
  • (79) Guy Rachmuth, Harel Z Shouval, Mark F Bear, and Chi-Sang Poon. A biophysically-based neuromorphic model of spike rate-and timing-dependent plasticity. Proceedings of the National Academy of Sciences, 108(49):E1266–E1274, 2011.
  • (80) Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538, 2012.
  • (81) Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Also published as a book. Now Publishers, 2009.
  • (82) Yann LeCun. What’s Wrong With Deep Learning?, 2015. [Online; accessed 20-November-2015].
  • (83) Gükhan H. Bakir, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan. Predicting Structured Data (Neural Information Processing). The MIT Press, 2007.
  • (84) A. Ng. The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI., July 2013.
  • (85) G. E. Hinton and T. E. Sejnowski. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing, volume 1, pages 282–317. MIT Press, 1986.
  • (86) P. Domingos. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Penguin Books Limited, 2015.
  • (87) Vernon B. Mountcastle. An organizing principle for cerebral function: The unit model and the distributed system. In Gerald M. Edelman and Vernon V. Mountcastle, editors, The Mindful Brain, pages 7–50. MIT Press, Cambridge, MA, 1978.
  • (88) Anna W Roe, Sarah L Pallas, Young H Kwon, and Mriganka Sur. Visual projections routed to the auditory pathway in ferrets: receptive fields of visual neurons in primary auditory cortex. The Journal of neuroscience, 12(9):3651–3664, 1992.
  • (89) G. Marcus. How does the mind work? insights from biology. 1(1):145–172, 2009.
  • (90) J. Schmidhuber. Simple algorithmic principles of discovery, subjective beauty, selective attention, curiosity & creativity. In Proc. 18th Intl. Conf. on Algorithmic Learning Theory (ALT 2007), LNAI 4754, pages 32–33. Springer, 2007. Joint invited lecture for ALT 2007 and DS 2007, Sendai, Japan, 2007.
  • (91) Geoffrey E. Hinton. Learning Representations by Unlearning Beliefs., 2003. [Online; accessed 23-November-2015].
  • (92) David J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. .
  • (93) G.E. Hinton and T.J. Sejnowski. Unsupervised Learning: Foundations of Neural Computation. A Bradford Book. MCGRAW HILL BOOK Company, 1999.
  • (94) D. H. Wiesel and T. N. Hubel. Receptive fields of single neurones in the cat’s striate cortex. J. Physiol., 148:574–591, 1959.
  • (95) Aapo Hyvärinen, Jarmo Hurri, and Patrick O Hoyer. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision., volume 39. Springer Science & Business Media, 2009.
  • (96) Yoshua Bengio and Martin Monperrus. Discovering shared structure in manifold learning. 2004.
  • (97) H. B. Barlow. Unsupervised learning. Neural Computation, 1(3):295–311, 1989.
  • (98) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, August 2013.
  • (99) Yoshua Bengio. Deep learning of representations: Looking forward. In Statistical Language and Speech Processing, pages 1–37. Springer, 2013.
  • (100) Li Deng and Dong Yu. Deep learning: methods and applications. Foundations and Trends in Signal Processing, 7(3–4):197–387, 2014.
  • (101) Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? The Journal of Machine Learning Research, 11:625–660, 2010.
  • (102) Anthony J. Bell and Terrence J. Sejnowski. The ‘independent components’ of natural scenes are edge filters. VISION RESEARCH, 37:3327–3338, 1997.
  • (103) Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59(8):1207–1223, August 2006.
  • (104) D. L. Donoho. Compressed sensing. IEEE Trans. Inf. Theor., 52(4):1289–1306, April 2006.
  • (105) Subutai Ahmad and Jeff Hawkins. Properties of sparse distributed representations and their application to hierarchical temporal memory, 2015.
  • (106) H. B. Barlow, T. P. Kaushal, and G. J. Mitchison. Finding minimum entropy codes. Neural Computation, 1(3):412–423, 1989.
  • (107) Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent component analysis. John Wiley & Sons, 2001.
  • (108) R. J. Solomonoff. A formal theory of inductive inference. Part I. Information and Control, 7:1–22, 1964.
  • (109) J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.
  • (110) Lawrence K Saul and Sam T Roweis. Think globally, fit locally: unsupervised learning of low dimensional manifolds. The Journal of Machine Learning Research, 4:119–155, 2003.
  • (111) Ming Li and Paul M.B. Vitnyi. An Introduction to Kolmogorov Complexity and Its Applications. Springer Publishing Company, Incorporated, 3 edition, 2008.
  • (112) H. Barlow. Redundancy reduction revisited. Network: Computation in Neural Systems, 12(3):241–253, 2001. PMID: 11563528.
  • (113) P. Földiák and M. P. Young. Sparse coding in the primate cortex. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 895–898. The MIT Press, 1995.
  • (114) Pentti Kanerva. Sparse Distributed Memory. MIT Press, Cambridge, MA, USA, 1988.
  • (115) Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1. Vision Research, 37:3311–3325, 1997.
  • (116) G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Distributed Representations, pages 77–109. MIT Press, Cambridge, MA, USA, 1986.
  • (117) David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines*. Cognitive science, 9(1):147–169, 1985.
  • (118) Christopher Poultney Marc?Aurelio Ranzato, Sumit Chopra, and Yann Lecun.

    Efficient learning of sparse representations with an energy-based model.

    In Advances in Neural Information Processing Systems (NIPS 2006. MIT Press, 2006.
  • (119) Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems (NIPS) 19, pages 801–808, 2007.
  • (120) Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. Sparse deep belief net model for visual area V2. In Advances in Neural Information Processing Systems (NIPS), volume 7, pages 873–880, 2007.
  • (121) G. E. Hinton and Z. Ghahramani. Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B, 352:1177–1190, 1997.
  • (122) Geoffrey E Hinton. Learning multiple layers of representation. Trends in cognitive sciences, 11(10):428–434, 2007.
  • (123) Bruno A Olshausen and David J Field. Sparse coding of sensory inputs. Current opinion in neurobiology, 14(4):481–487, 2004.
  • (124) Y lan Boureau, Yann Lecun, and Inria Rocquencourt. Y.: Sparse feature learning for deep belief networks. In In: Advances in Neural Information Processing Systems (NIPS 2007, 2007.
  • (125) Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th International Conference on Machine Learning (ICML), pages 609–616, 2009.
  • (126) Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Winter 1989.
  • (127) Arthur Earl Bryson and Yu-Chi Ho.

    Applied optimal control : optimization, estimation, and control, 1969.

  • (128) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Back-propagating Errors, pages 696–699. MIT Press, Cambridge, MA, USA, 1988.
  • (129) Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. CoRR, abs/1412.1897, 2014.
  • (130) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
  • (131) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • (132) J.R. Searle. Minds, Brains, and Science. Reith lectures. Harvard University Press, 1984.
  • (133) Kenneth O. Stanley and Joel Lehman. Why Greatness Cannot Be Planned - The Myth of the Objective. Springer, 2015.
  • (134) Jimmy Secretan, Nicholas Beato, David B. D’Ambrosio, Adelein Rodriguez, Adam Campbell, Jeremiah T. Folsom-Kovarik, and Kenneth O. Stanley. Picbreeder: A case study in collaborative evolutionary exploration of design space. Evol. Comput., 19(3):373–403, September 2011.
  • (135) IJ Goodfellow, M Mirza, X Da, Aaron Courville, and Yoshua Bengio. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. TR arXiv:1312.6211v2, 2014.
  • (136) D. B. Chklovskii, B. W. Mel, and K. Svoboda. Cortical rewiring and information storage. Nature, 431(7010):782–788, Oct 2004.
  • (137) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In NIPS’12, pages 1232–1240, 2012.
  • (138) D. J. Watts and S. H. Strogatz. Collective dynamics of’small-world’networks. Nature, 393(6684):409–10, 1998.
  • (139) Dharmendra S Modha and Raghavendra Singh. Network architecture of the long-distance pathways in the macaque brain. Proceedings of the National Academy of Sciences, 107(30):13485–13490, 2010.
  • (140) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. CoRR, abs/1502.02551, 2015.
  • (141) Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Low precision arithmetic for deep learning. CoRR, abs/1412.7024, 2014.
  • (142) Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
  • (143) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014.
  • (144) Li Wan, Matthew Zeiler, Sixin Zhang, Yann L. Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In Sanjoy Dasgupta and David Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1058–1066. JMLR Workshop and Conference Proceedings, May 2013.
  • (145) Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363, 2015.
  • (146) Oreste Villa, Daniel R. Johnson, Mike O’Connor, Evgeny Bolotin, David Nellans, Justin Luitjens, Nikolai Sakharnykh, Peng Wang, Paulius Micikevicius, Anthony Scudiero, Stephen W. Keckler, and William J. Dally. Scaling the power wall: A path to exascale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pages 830–841, Piscataway, NJ, USA, 2014. IEEE Press.
  • (147) Paul Dlugosch, Dean Brown, Paul Glendenning, Michael Leventhal, and Harold Noyes. An efficient and scalable semiconductor architecture for parallel automata processing. Parallel and Distributed Systems, IEEE Transactions on, 25(12):3088–3098, 2014.
  • (148) Marc Baboulin, Simplice Donfack, Jack Dongarra, Laura Grigori, Adrien Rémy, and Stanimire Tomov. A class of communication-avoiding algorithms for solving general dense linear systems on cpu/gpu parallel machines. Procedia Computer Science, 9:17 – 26, 2012. Proceedings of the International Conference on Computational Science, {ICCS} 2012.
  • (149) R. Linsker. From basic network principles to neural architecture: emergence of orientation-selective cells. Proc Natl Acad Sci U S A, 83(21):8390–8394, Nov 1986. 3464958[pmid].
  • (150) M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1999.
  • (151) Juyang Weng, Narendra Ahuja, and Thomas S Huang. Cresceptron: a self-organizing neural network which grows adaptively. In International Joint Conference on Neural Networks (IJCNN), volume 1, pages 576–581. IEEE, 1992.
  • (152) N. Mesgarani, S. David, and S. Shamma. Representation of phonemes in primary auditory cortex: How the brain analyzes speech. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages IV–765–IV–768, April 2007.
  • (153) Karl Friston. The free-energy principle: a unified brain theory? Nat Rev Neurosci, 11(2):127–138, Feb 2010.
  • (154) L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715–770, 2002.
  • (155) Dong-Hyun Lee, Saizheng Zhang, Antoine Biard, and Yoshua Bengio. Target propagation. CoRR, abs/1412.7525, 2014.
  • (156) Geoffrey Hinton. How to do backpropagation in a brain.
  • (157) Victor AF Lamme, Hans Supèr, and Henk Spekreijse. Feedforward, horizontal, and feedback processing in the visual cortex. Current Opinion in Neurobiology, 8(4):529 – 535, 1998.
  • (158) D. O. Hebb. The Organization of Behavior. Wiley, New York, 1949.
  • (159) Joseph Cichon and Wen-Biao Gan. Branch-specific dendritic ca2+ spikes cause persistent synaptic plasticity. Nature, 520(7546):180–185, Apr 2015. Article.
  • (160) Min Fu and Yi Zuo. Experience-dependent structural plasticity in the cortex. Trends Neurosci, 34(4):177–187, Apr 2011. 21397343[pmid].
  • (161) Matthew Larkum. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends in Neurosciences, 36(3):141 – 151, 2013.
  • (162) Matthew E. Larkum, Thomas Nevian, Maya Sandler, Alon Polsky, and Jackie Schiller. Synaptic integration in tuft dendrites of layer 5 pyramidal neurons: A new unifying principle. Science, 325(5941):756–760, 2009.
  • (163) Michael H usser and Barlett Mel. Dendrites: bug or feature? Curr Opin Neurobiol, 13(3):372–383, June 2003.
  • (164) Roland Memisevic. Representing relations., 2014. [Online; accessed 27-November-2015].
  • (165) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.
  • (166) Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. CoRR, abs/1410.3916, 2014.
  • (167) Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyungHyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. CoRR, abs/1506.07503, 2015.
  • (168) Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, pages 2204–2212, 2014.
  • (169) Randall O’Reilly. Making working memory work: A computational model of learning in the prefrontal cortex and basal ganglia. Technical Report ICS-03-03, ICS, June 2003.
  • (170) Wofgang Maas. Networks of spiking neurons: The third generation of neural network models. Trans. Soc. Comput. Simul. Int., 14(4):1659–1671, December 1997.
  • (171) Henry Markram, Joachim Lübke, Michael Frotscher, and Bert Sakmann. Regulation of synaptic efficacy by coincidence of postsynaptic aps and epsps. Science, 275(5297):213–215, 1997.
  • (172) Berthold Ruf and Michael Schmitt. Learning temporally encoded patterns in networks of spiking neurons. Neural Processing Letters, 5(1):9–18, 1997.
  • (173) H. B. Barlow, T. P. Kaushal, and G. J. Mitchison. Finding minimum entropy codes. Neural Computation, 1(3):412–423, 1989.
  • (174) Karol Gregor and Yann LeCun. Learning representations by maximizing compression. CoRR, abs/1108.1169, 2011.
  • (175) D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. Trans. Evol. Comp, 1(1):67–82, April 1997.
  • (176) Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards AI. In Léon Bottou, Olivier Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press, 2007.
  • (177) J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992.
  • (178) Martin Rolfs, Donatas Jonikaitis, Heiner Deubel, and Patrick Cavanagh. Predictive remapping of attention across eye movements. Nat Neurosci, 14(2):252–256, Feb 2011.
  • (179) Eileen Kowler. Eye movements: The past 25 years. Vision Research, 51(13):1457 – 1483, 2011. Vision Research 50th Anniversary Issue: Part 2.
  • (180) A. Ude, C. Gaskett, and G. Cheng. Support vector machines and gabor kernels for object recognition on a humanoid with active foveated vision. In Proceedings of the IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS 2004), Sendai, Japan, 2004.
  • (181) Marc’Aurelio Ranzato. On learning where to look. CoRR, abs/1405.5488, 2014.
  • (182) J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135–141, 1991.
  • (183) Hugo Larochelle and Geoffrey E. Hinton. Learning to combine foveal glimpses with a third-order boltzmann machine. In J.D. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1243–1251. Curran Associates, Inc., 2010.
  • (184) Yuwei Cui, Subutai Ahmad, Chetan Surpur, and Jeff Hawkins. Maintaining stable perception during active exploration. In Computational and Systems Neuroscience, 2015.
  • (185) Patrik Krieger and Alexander Groh. Sensorimotor Integration in the Whisker System. Springer Publishing Company, Incorporated, 1st edition, 2015.
  • (186) Mathew E. Diamond, Moritz von Heimendahl, Per Magne Knutsen, David Kleinfeld, and Ehud Ahissar. ’where’ and ’what’ in the whisker sensorimotor system. Nat Rev Neurosci, 9(8):601–612, Aug 2008.
  • (187) Florent Meyniel, Mariano Sigman, and Zachary?F Mainen.

    Confidence as bayesian probability: From neural origins to behavior.

    Neuron, 88(1):78–92, 2015/11/25 XXXX.
  • (188) Florent Meyniel, Daniel Schlunegger, and Stanislas Dehaene. The sense of confidence during probabilistic learning: A normative account. PLoS Comput Biol, 11(6):e1004305, 06 2015.
  • (189) Léon Bottou. From machine learning to machine reasoning: an essay. Machine Learning, 94:133–149, January 2014.
  • (190) Roland Memisevic and Geoffrey E. Hinton. Learning to represent spatial transformations with factored higher-order Boltzmann machines. Neural Computation, 22(6):1473–1492, 2010.
  • (191) Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the Twenty-nine International Conference on Machine Learning (ICML’12). ACM, 2012.
  • (192) Ilya Sutskever and Geoffrey Hinton. Learning multilevel distributed representations for high-dimensional sequences. AISTATS, 2007.
  • (193) Ilya Sutskever, Geoffrey E Hinton, and Graham W Taylor. The recurrent temporal restricted Boltzmann machine. In NIPS, volume 21, page 2008, 2008.
  • (194) Jeffrey L. Elman. Finding structure in time. COGNITIVE SCIENCE, 14(2):179–211, 1990.
  • (195) Geoffrey E. Hinton. A practical guide to training restricted boltzmann machines. In Gr goire Montavon, Genevieve B. Orr, and Klaus-Robert M ller, editors, Neural Networks: Tricks of the Trade (2nd ed.), volume 7700 of Lecture Notes in Computer Science, pages 599–619. Springer, 2012.
  • (196) Philip Resnik and Eric Hardisty. Gibbs sampling for the uninitiated. Technical report, DTIC Document, 2010.