1 Introduction
Learning the fundamental features that generate sensory signals and having the ability to infer these features once given the sensory input is a crucial aspect of information processing. It would not be far fetched to hypothesize that the organization of cortical circuitry is largely evolved for performing such tasks, or to put it slightly differently, that our evolutionary history has stored memories of such features in the connections that form a large part of our brains. But how can a neuronal network learn and extract features that cause the experiences of our sensory organs, in a supervised or unsupervised manner? Answering this question has been, and still is, a fundamental task of theoretical and experimental neuroscientists and those interested in artificial intelligence. And as one can imagine, it is a question that can lead, and in fact already has contributed, to deep insights about the organization of cortical microcircuitry and the way organisms understand their surrounding environment.
Over the past decades, it has become clear that the processing of sensory stimuli in the nervous system is performed in a hierarchical manner. Although hierarchical models of information processing leading to behaviour had already been suggested in the 50s mackay1956towards , the experimental work of Hubel and Wiesel hubel1963shape and the discovery of simple and complex cells inspired much activity in studying hierarchical models of cortical processing with various degrees of abstraction from highlevel algorithmic models to neural network models. On the algorithmic side, there was the work pioneered by David Marr discussing the processing stages required for building a 3D perception of objects from the stimulus on the retina marr1978representation
. On the implementation side, mechanistic neural network models of visual pattern recognition were proposed
fukushima1980neocognitron . All these studies modeled sensory processing with a hierarchy of stages each capturing the necessarily basic features in the input useful for further analysis by higher levels. This body of work, initially, did not include any probabilistic concepts: both the inputs and the computations performed at various stages were deterministic in nature, although some even earlier work did consider the probabilistic aspects of the sensory stimulus to be an important feature governing the design of the sensory information processing attneave1954some ; barlow1961possible . Over the years, more models emerged in which more emphasis was put on the probabilistic nature of stimuli. There has been a rise of models emphasizing the view that learning and inference of features of sensory input in a neuronal network can best be examined and answered by considering the hierarchy of processing stages as each performing probabilistic inference as depicted in the cartoon in Fig. 1 A. In this setting, one assumes that the sensory stimuli received by the sensory organs are generated through a probabilistic process as a combination of usually hidden causes which are then transmitted to the brain and need to be interpreted. By interpretation here, we mean the ability to use the sensory input to assign likelihoods to various combinations of the probabilistic causes that could have generated the input. The job of cortical processing layers and stages is to infer these probablistic hidden causes essentially through learning a generative model for the observed data. Building a successful model of cortical processing within this framework requires us to understand how in a hierarchy of cortical networks, the synaptic weights can be adjusted so that the probabilistic mapping from sensory input to underlying hidden causes that generated the input could be accomplished.Although the initial ideas of hierarchical processing of sensory stimuli arose in the artificial intelligence community with the goal of ultimately understanding the operations of the nervous system, most of recent work and breakthroughs on the subject have come from machine learning, and more recently, the deep learning community without much interest in understanding the brain. It is probably now time to reexamine these recent findings to see what are the implications for the nervous system. As we discuss below, some of the findings in the deep learning literature offer new directions of research and enquiry in neuroscience both at an algorithmic level and for biological implementation. In what follows, after a brief review of older relevant work on learning in neural networks, we describe a number of studies that have been responsible for the recent advances in learning in hierarchical stochastic models in the deep learning community. We then emphasize three issues which we think are of particular interest to neuroscience both for the insight they already offer and for the interesting research questions they raise: biological plausibility of recent learning algorithms, the role of single neuron inputoutput function in the success of learning, and learning from dynamic data.
2 Learning hidden structure with hidden neurons
A large body of early work in the neuroscience of learning hidden structures is focused on using networks with less emphasis on the probabilistic and unsupervised nature of the processing that what is common today and that we will discuss in the next section. In this previous work, one considers neuronal networks with potentially multiple hidden layers that progressively extract information about the input to perform an action, usually rewarded by a supervisory signal. The job of the learning rule was to adjust the synaptic weights such that the last stages of processing can e.g. identify different input patterns as belonging to different categories as exemplified by a set of training data. In the 80s, the backpropagation algorithm
rummelhart1986learningprovided significant excitement in the field as a potential mechanism for performing this task: the error between the present output and the desired output of a neuron is calculated starting from the final layer, propagated backwards in the network and used to adjust the weights. The algorithm, however, proved to be time consuming and easily stuck in unfavorable configurations of weights, let alone that its biological implementation was questionable. Consequently, there was a shift to other machine learning frameworks, such as support vector machines, with even more questionable biological implementations but ultimately useful for machine learning tasks.
In addition to backpropagation and other supervised learning algorithms, unsupervised learning, in particular in probabilistic settings, also emerged in the 80s and the 90s. The Boltzmann Machine was proposed by Ackley, Hinton and Sejnowsky in 1985
Ackley85 as a model for learning to extract features from probabilistic sensory input with a particularly simple learning rule. Yet this also proved to be a very time consuming network, although faster approximations were also proposed. For instance, Kappen and Rodrigues Kappen98 and Tanaka Tanaka98 employed the susceptibilityresponse theorem from statistical physics combined with the meanfield solutions of the Ising model to devise algorithms for learning the weights in a fully observed Boltzmann Machine. Although the results were very promising, this line did not continue to the case with hidden nodes, and the use of meanfield methods (without any linear response considerations) have not been as successful galland1993limitations .Probably the first influential network architecture that was proposed for unsupervised learning of structure from data which used a computationally and biologically relevant learning algorithm is the Helmholtz Machine dayan1995helmholtz ; hinton1995wake
. In the Helmholtz Machine, there is no supervised signal, and one takes the view that the stuff in the world are samples drawn from a probability distribution, the generative model, and the job of the nervous system is to learn this distribution via learning features to represent the data. This was carried out by a simplified model of Fig.
1A, shown in Fig. 1B which is comprised of a layer of observed neurons coupled to one or more layers of hidden neurons but without intralayer connections. The interlayer connections are divided into two sets of connections: a set going in the topdown direction called the generative weights, and a set going the opposite way called the recognition weights. As the name suggests, when the system is run by the generative weights, samples generated in the bottom visible layer are aimed to be samples from a distribution in the real world. To learn the generative weights, one uses the recognition mode to generate samples given data from the real world, and adjust the generative weights according to these samples. The opposite is done for learning the recognition weights. Although the Helmholtz machine is a powerful model both practically and conceptually, training it with many hidden layers is not easy. It can, however, be considered as a predecessor of the Deep Belief Network which proved to be a very successful and easily learnable model of complex data by employing a smart pretraining regime followed by fine tuning of the weights using the wakesleep algorithm. We elaborate on this model below.3 Modern ways of learning in deep architectures
Although the previous attempts in building learning rules for networks with hidden layers were all faced with some difficulties, more recent work on the subject, starting from the seminal work in 2006 hinton2006fast
, employed better understanding of the dynamics of networks with hidden nodes, combined with faster computers and larger data sets to successfully learn networks with hidden layers to extract hidden features in data. New learning algorithms have been proposed for using Deep Belief Networks and Deep Boltzmann Machines to extract information about the sensory signals in unsupervised, supervised, and semisupervised learning tasks achieving unprecedented performance levels on industriallyrelevant applications. After briefly reviewing these architectures and the learning rules used for training them we review some of the success stories most relevant to neuroscience in recent years.
The introduction of Deep Belief Networks (DBNs) (Fig. 1C) by Hinton et al.hinton2006fast is widely viewed as the breakthrough that stimulated the modern view of Deep Learning. Up to this point, classical multilayer architectures such as deep neural networks were known to the machine learning community but were universally accepted as difficult to train in practice for all but the simplest of problems. Hinton and colleagues showed that a deep neural network could be grown incrementally, by composing singlelayer models called Restricted Boltzmann Machines (RBMs). This simple symmetric neural network had been introduced many years earlier smolensky1986
but did not receive much attention until the early 2000’s when the contrastive divergence learning algorithm was introduced
Hinton2002 making approximate learning tractable.To train a DBN, the individual RBMs were trained in sequence, each model learning to represent the one before it. Once this greedy initialization strategy completed, the model could be finetuned endtoend. Labels could be introduced in this second stage of learning. By first discovering representations, learning became effective and efficient. This pretraining strategy has been empirically shown to lead to models with better generalization that seems to be more robust to random initializations erhan2010does
. While Hinton’s group at the University of Toronto focused on RBMs as a basic building block, other groups proposed different unsupervised architectures. Bengio’s group at the University of Montreal considered types of autoencoders, explained in more detail below
Larochelle2007 ; Vincent2008 . LeCun’s group at New York University (NYU) focused on sparsity as an alternative feature learning strategy Ranzato2007 ; Ranzato2008 ; Kavukcuoglu2008 .Once a Deep Belief Network has been composed by stacking, it is what is called a hybrid directedundirected model. That is because its topmost two layers which form an autoassociative memory are symmetrically connected like an RBM, but the lower layers are connected via directed links. Together, they form the generative model. The DBN also contains a set of separate, bottomup connections for fast approximate inference. They are not part of the generative model, but, similar to the Helmholtz machine, form a recognition model. It is this fast approximate inference which made DBNs very popular for several years, but it is also what limits them. It is wellknown that the brain’s wiring consists of feedforward, lateral and feedback connections. However, inference and generation in a DBN involves no feedback at all.
Salakhutdinov and Hinton salakhutdinov2009deep showed how to train a related but different model called a Deep Boltzman Machine (DBM), in which feedback connections are important. In contrast to the hybrid DBN, the DBM is completely undirected. Like the DBN, its posterior distribution is intractable and must be approximated, however, the approximation is more powerful and more complicated than feedforward recognition model of the DBN. Training the DBM involves a multistep approach. It is first pretrained, initialized from RBMs, much like the DBN. It is then generatively finetuned using a variational approximation to loglikelihood. Though the DBM has been shown to be useful in several applications, such as multimodal learning srivastava2012multimodal and oneshot learning with Bayesian priors salakhutdinov2013learning , the standard approach to training a DBM is cumbersome. It requires training multiple models using different objective functions. Avoiding pretraining fails to learn a good model of the data. In response, current research is focused on the discovery of effective methods for the joint training of all layers of a deep unsupervised model goodfellow2013multi .
The last few years have seen a shift away from unsupervised learning back towards supervised learning, with an emphasis on an architecture called a convolutional network (convnet)lecun1998gradient : a multilayer neural architecture, inspired by Hubel and Wiesel’s model of the visual cortex and early computational models such as Fukushima’s Neocognitron fukushima1980neocognitron . The architecture mainly consists of repeated stages of filtering (efficiently implemented by convolutions) and pooling, akin to the operation of simple and complex cells, respectively. Though proposed in the 1980’s, the model has only reached widespread adoption in the last few years– a result of larger datasets and faster computers krizhevsky2012imagenet . Architectural improvements, such as units that do not saturate and the use of normalization layers have also been important. Convnets have seen application to video Karpathy2014 ; Simonyan14b , image captioning mao2014deep ; xu2015show ; vinyals2014show and labeling entire scenes farabet2013learning
. The ImageNet dataset is commonly used as a means of supervised pretraining of filters to use on tasks for which large labeled datasets are not available
razavian2014cnn .Despite the massive popularity of convnets, and the motivation for unsupervised pretraining tempered by success in pure supervised learning, unsupervised learning has not been abandoned. Following the dominance of RBMs in the mid2000’s, an unsupervised learning architecture known as the autoencoder has received increasing attention. An autoencoder is simply a feedforward neural network which performs unsupervised learning by training it to reconstruct its input from a learned representation. Classical autoencoders use a bottleneck hidden layer, where the number of hidden units is less than the input, in order to avoid learning a trivial representation: simply a copy of the input which can be perfectly reconstructed. Recent work has explored alternative types of autoencoders which regularize, or restrict the hidden representation, for example by injecting noise: Denoising Autoencoders
Vincent2008 , making the hidden units sparse kavukcuoglu2010fast, or making the hidden units less sensitive to their input: Contracting Autoencoders
Rifai2011 . Another body of work has generalized autoencoders to generative models bengio2014deep. Generative Stochastic Networks define a generative model through a Markov Chain that injects noise at every stage of an iterated reconstruction. Like autoencoders, the model is easily trained with backpropagation yet is more capable of modeling multimodal data distributions. A more recent line of work has revised the Helmholtz Machine architecture but improves on the inference procedure by correcting the gradient of the approximating posterior distribution. Several such deep variational networks
Rezendeetalarxiv2014 ; Kingma+WellingICLR2014 ; Mnih+GregorICML2014 were proposed concurrently by different groups and are considered the stateoftheart in deep generative models.4 Implications for cortical organization and plasticity
How can the work in deep learning affect the work by theoretical and experimental neuroscientists aiming at understanding of the nervous system? Most of the work in deep learning so far has not had a direct dialogue with neuroscience and already some of the recent directions taken in deep learning are moving towards architectures and findings obviously away from our current understanding of the nervous system. In our opinion, however, there are already several findings and research questions which can and should inspire more work in neuroscience. We discuss three issues in this direction below.
Role of single neurons. Due to the growing popularity of deep neural networks, many improvements have been proposed both at an architectural level, as well as in training methodology. However, two such discoveries, described in more details in Box 1, stand out in terms of their widespread adoption in modern deep neural networks both of which give insight into potential biological implementations: (1) the effect of the nonlinearities used by individual neurons, (2) a regularization technique called dropout.
The observation that single neuron nonlinearity plays an important role for the success of learning in deep architectures offers a potentially fundamental insight into the mechanisms of learning in the cortical hierarchy: that optimal synaptic weight changes, or whether such changes exist at all, is fundamentally tied to the single neuron properties. In retrospect this might sound like a trivial statement, but it is one that has been largely ignored. In the 80s, following the work of John Hopfield hopfield1982neural , Daniel Amit amit1985spin and others, statistical physicists showed interest in the studying of neural networks with binary neurons. The presence of a spin glass phase MezardParisiVirasoro in which the network exhibited a large number of local minima, none of which related to the stored memories, was noticed and was probably an attractive ground for work by statistical physicists. Already then, however, it was noted that this phase is reduced in size in neuronal networks with sigmoid or piecewiselinear transfer functions kuhn1991statistical and largely suppressed in networks of thresholdlinear rectifiers treves1991spin , and deemed more biologically plausible. This, intuitively, arises because the maximum slope of the transfer function acts as an effective inverse temperature meaning that the network can settle into a spin glass phase (usually a low temperature phase) at possibly higher temperatures for e.g. rectifier neurons. The presumed heavily rugged energy landscape of the system in the spin glass phase, in turn, implies that the energy landscape of networks with graded response is much less rugged than the binary counterparts. We note that since these studies were concerned with models of the retrieval of already stored memories, the implications of these findings for learning in neuronal networks were not explored further. It is probably now time to take this issue up again and use the techniques of statistical physics to understand what single neuron properties and their relationship to the frustrated phases mean for learning in deep cortical models.
Learning kinetics with hidden nodes. Sensory signals are rarely static in real life and the learning of relevant features has to be done by taking these dynamics into account. Although, the overwhelming majority of the works we discussed above, both by neuroscientists and machine learning researchers, are focused on static data, as we describe below, more recent work has been focused on learning models in which these kinetic aspects are taken into account, some resulting in relatively simple learning algorithms. Given the dynamic nature of inputs to the brain, deep dynamic architectures should offer another interesting area of research for understanding learning in the cortical circuits.
Shortly after the discovery of DBNs, Restricted Boltzmann Machines were extended to capture temporal dependencies. The key insight was to condition both the observed and latent variables on fixedlength windows of the past instances of these variables. However, exact inference in such models is extremely hard and therefore it is computationally intractable to learn such models exactly. Sutskever and Hinton Sutskever2007a , focusing on modeling synthetic video, proposed an approximate training strategy. Taylor et al. taylor2006modeling , focusing on modeling human motion capture, proposed an architecture that avoided explicit temporal connections among latent variables. Sutskever et al.later proposed a third option which was a type of recurrent RBM architecture for which exact inference straightforward while maintaining full connections among latent variables sutskever2009recurrent . Temporal variants of RBMs have been applied to 3d tracking of people in video taylor2010dynamical , autotagging music mandel2011contextual , and facial expression transfer zeiler2011facial .
Dynamical generative models are the same mathematical objects as kinetic models used in statistical physics to describe the dynamics of complex systems, e.g. spin glasses, particles in random media etc. In particular, the kinetic Ising model studied in spin glass physics is essentially a generalized linear model with a logit link function. It can also be thought as a system whose dynamics samples the state space of a belief network. In recent years, the problem of learning the weights in a kinetic Ising model given samples as well as predicting the statistics of the samples given the weights has been a very active area of research in statistical physics
Roudi11 ; Zeng11 ; Mezard11 , leading several new approximate techniques to solve both the learning and inference problems either analytically or through very efficient computational methods. Most recently, some people have focused on the problem of learning and inference in kinetic Ising models with hidden nodes dunn2013learning ; HertzandTyrcha2014 ; BachschmidRomanoandOpper2014 ; Battistin2015 . In particular, Battistin et al Battistin2015 have studied a model in which hidden factors are conditionally independent of each other given the observed ones, and have shown that both learning the weights and inference of the hidden states can be efficiently done using a message passing algorithm. For the kinetic Ising model, the Bayes optimal performance of the inference of the hidden spin configurations can be calculated using methods from statistical mechanics BachschmidRomanoandOpper2014 , something that would be interesting to do also for the equilibrium models.The last three years have seen an explosion of activity studying recurrent neural networks (RNNs), a generalization of feedforward neural networks which can map sequences to sequences. Training RNNs using backpropagation through time can be difficult, and was thought up until recently to be hopeless due to vanishing and exploding gradients
hochreiter2001gradient . Recent advances in optimization martens2011learning , gated (multiplicative) units in the architecture hochreiter1997long ; chung2014empirical careful initialization sutskever2013importance ; le2015simpleand tricks such as gradient clipping
pascanu2012difficulty have led to impressive results in modeling speech, handwriting and language graves2013speech ; graves2013generating ; sutskever2014sequence ; mao2014deep ; vinyals2014show ; xu2015show ; mikolov2014learning ; hannun2014deepspeech . O’Reilly and Frank o2006makingproposed a biologicallybased algorithm that is similar to the popular Long ShortTerm Memory
hochreiter1997longboth in motivation and computation. The model, PBWM, implements a flexible working memory with an adaptive gating mechanism, inspired by the interaction between the prefrontal cortex and basal ganglia. Trained by reinforcement learning mechanisms rather than backpropagation, it performs comparably to LSTMs and other recurrent architectures on benchmark working memory tasks.
Biological Implications. Learning rules currently used in deep learning are not biologically plausible. But is it possible, maybe by sacrificing a degree of performance, to derive learning rules that can be implemented in biological neuronal networks? This question should offer an interesting avenue of research and some ideas have already been suggested.
In a recent study, Bengio et al. bengio2015towards
suggest that deep learning is possible using biologically plausible synaptic learning rules (e.g. STDP) by performing approximate expectationmaximization (EM) like algorithms. What sets this work apart from other papers from the deep learning community is that, instead of starting from an algorithmic view and arguing for biological plausibility, it starts from the main learning rule observed in biological synapses (STDP) and reverseengineers some objective function that the phenomenon could improve. The main finding is that the objective comes from a variational bound on the log likelihood of the data which leads to a variational EM learning algorithm. A further contribution of this paper is in connecting variational EM with noise injection, to training a denoising autoencoder over both visible and latent variables. The architecture considered is a deep denoising autoencoder trained with Target Propagation
lee2014target , an alternative to backpropagation that computes targets rather than gradients at each layer. Compared to Boltzmann machines which are perhaps the bestknown biologically plausible method for learning deep architectures, this technique completely avoids the need to avoid representative samples from the stationary distribution using (expensive) MCMC, which is both a practical advantage and more satisfactory from the point of biological plausibility.In another recent study, Pavel and Miller Sountsov2015 implemented the learning rules for a Helmoltz Machine in a network of spiking neurons. The Helmholtz Machine is particularly suitable for neural implementation because its learning rules are indeed local. To implement the synaptic learning rule for the Helmholtz Machine, the delta rule, Pavel and Miller design a neural microcircuit of spiking neurons which act as the nodes of a Helmholtz Machine. Upon receiving the inputs from other units or from the external world, the different pools in each microcircuit follow a dynamics that leads to the required synaptic changes between the nodes as prescribed by the delta rule. This idea of considering local microcircuits for the implementation of learning rules is one that can potentially be exploited further, maybe even for the implementation of backpropagation like algorithms: one criticism agains the backpropagation is that the error signals should travel backwards along the axons. However, if individual nodes are not single neurons but microcircuits with both afferent and efferent connections to and from the neurons from the next layer and an internal mechanism to combine these signals for changing the effective connectivity between nodes such a problem may be avoided.
5 Conclusion
The history of algorithm and architecture development for neuronal networks has drawn much inspiration from theoretical and experimental neuroscience. Although the recent success of the field is remarkable, it has drifted away from biological plausibility towards applicationdriven abstract computational models. In this article, we have reviewed some of the most salient contributions to deep learning from the perspective of neurobiology. We argue that there are aspects of this line of work that should inspire our understanding of cortical circuits, and that more dialogue between machine learning practitioners and neuroscience should yield fruitful.
Box 1: Important modification with biological relevance
As mentioned in the text, following the explosion of interest in deep neuronal networks, various new methods have been proposed to improve learning. Among these, the following have been of particular practical significance and appear to have clear implications for understanding biological neuronal networks.
1. Single neuron transfer function: The first important observation that has lead to significant improvement in the learning is the use of nonsaturating nonlinearities in the individual neurons. In almost all cortical areas, single neurons operate far from their saturation regime, and this has lead to the use of nonsaturating units (e.g. thresholdlinear units) in biological neuronal network models and the study of the effect of saturation or lack of it in the function of such modelstreves1990graded ; shriki2003rate ; roudi2006localized
. In the deep learning side, it has been shown that nonsaturating nonlinearities, in particular, the rectified linear unit (ReLU)
glorot2011deep significantly improve convergence, both in unsupervised and supervised variants of deep neural networks. A generalized parametric form of the ReLU has lead to the first result which surpresses humanlevel performance on the ImageNet classification challenge he2015delving .2. Dropout: The other notable discovery is the “dropout” regularization technique hinton2012improving
, which aims to prevent coadaptation of hidden units by randomly shutting off a collection of units at each presentation of an input. This method has another interpretation of training multiple networks, one per example, but sharing all of their parameters. Obtaining a prediction amounts to taking the geometric mean of the probability distributions over all labels specified by the individual models. Dropout is partially motivated by a theory of the role of sex in evolution: the ability of a set of genes to be able to work well with another random set of genes makes them more robust. Similarly, each hidden unit in a dropouttrained network must learn to work with a randomly chosen set of other units. This encourages the hidden units to model useful features on their own rather than cooperating with other units.
References
References
 (1) D. M. MacKay, Towards an informationflow model of human behaviour, British Journal of Psychology 47 (1) (1956) 30–43.
 (2) D. Hubel, T. Wiesel, Shape and arrangement of columns in cat’s striate cortex, The Journal of physiology 165 (3) (1963) 559–568.
 (3) D. Marr, H. K. Nishihara, Representation and recognition of the spatial organization of threedimensional shapes, Proceedings of the Royal Society of London B: Biological Sciences 200 (1140) (1978) 269–294.
 (4) K. Fukushima, Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological cybernetics 36 (4) (1980) 193–202.
 (5) F. Attneave, Some informational aspects of visual perception., Psychological review 61 (3) (1954) 183.
 (6) H. B. Barlow, Possible principles underlying the transformations of sensory messages.
 (7) D. Rummelhart, Learning representations by backpropagation errors, Nature 323 (1986) 533–536.
 (8) D. H. Ackley, G. E. Hinton, T. J. Sejnowski, A learning algorithm for boltzmann machines, Cognitive Science 9 (1985) 147–169.
 (9) H. J. Kappen, F. B. Rodriguez, Efficient learning in boltzmann machines using linear response theory, Neur. Comp. 10 (1998) 1137–1156.
 (10) T. Tanaka, Meanfield theory of boltzmann machine learning, Phys. Rev. E 58 (1998) 2302–2310.
 (11) C. C. Galland, The limitations of deterministic boltzmann machine learning, Network: Computation in Neural Systems 4 (3) (1993) 355–379.
 (12) P. Dayan, G. E. Hinton, R. M. Neal, R. S. Zemel, The helmholtz machine, Neural computation 7 (5) (1995) 889–904.
 (13) G. E. Hinton, P. Dayan, B. J. Frey, R. M. Neal, The” wakesleep” algorithm for unsupervised neural networks, Science 268 (5214) (1995) 1158–1161.
 (14) G. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural computation 18 (7) (2006) 1527–1554.
 (15) P. Smolensky, Information processing in dynamical systems: Foundations of harmony theory, in: Parallel Distributed Processing: Volume 1: Foundations, MIT Press, 1986, pp. 194–281.
 (16) G. E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation 14 (2002) 1771–1880.
 (17) D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, S. Bengio, Why does unsupervised pretraining help deep learning?, The Journal of Machine Learning Research 11 (2010) 625–660.
 (18) H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, in: Proc. of the 24th Annual International Conference on Machine Learning (ICML), 2007, pp. 473–480.
 (19) P. Vincent, H. Larochelle, Y. Bengio, PierreAntoine Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proc. of the 25th Annual International Conference on Machine Learning (ICML), 2008, pp. 1096–1103.

(20)
M. Ranzato, C. S. Poultney, S. Chopra, Y. LeCun, Efficient learning of sparse representations with an energybased model, in: Advances in Neural Information Processing Systems (NIPS) 19, 2007, pp. 1137–1144.
 (21) M. Ranzato, Y. Boureau, Y. LeCun, Sparse feature learning for deep belief networks, in: Advances in Neural Information Processing Systems (NIPS) 20, 2008.
 (22) K. Kavukcuoglu, M. Ranzato, Y. LeCun, Fast inference in sparse coding algorithms with applications to object recognition, Tech. rep., Computational and Biological Learning Lab, Courant Institute, NYU, cBLLTR20081201 (2008).
 (23) R. Salakhutdinov, G. E. Hinton, Deep boltzmann machines, in: International Conference on Artificial Intelligence and Statistics, 2009, pp. 448–455.
 (24) N. Srivastava, R. R. Salakhutdinov, Multimodal learning with deep boltzmann machines, in: Advances in neural information processing systems, 2012, pp. 2222–2230.
 (25) R. Salakhutdinov, J. B. Tenenbaum, A. Torralba, Learning with hierarchicaldeep models, Pattern Analysis and Machine Intelligence, IEEE Transactions on 35 (8) (2013) 1958–1971.
 (26) I. Goodfellow, M. Mirza, A. Courville, Y. Bengio, Multiprediction deep boltzmann machines, in: Advances in Neural Information Processing Systems, 2013, pp. 548–556.
 (27) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradientbased learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.

(28)
A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems 25, 2012, pp. 1106–1114.
 (29) A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F.F. Li, Largescale Video Classification with Convolutional Neural Networks, in: CVPR, 2014.
 (30) K. Simonyan, A. Zisserman, Twostream convolutional networks for action recognition in videos, in: Advances in Neural Information Processing Systems, 2014.
 (31) J. Mao, W. Xu, Y. Yang, J. Wang, A. Yuille, Deep captioning with multimodal recurrent neural networks (mrnn), arXiv preprint arXiv:1412.6632.
 (32) K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, arXiv preprint arXiv:1502.03044.
 (33) O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural image caption generator, arXiv preprint arXiv:1411.4555.
 (34) C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling, Pattern Analysis and Machine Intelligence, IEEE Transactions on 35 (8) (2013) 1915–1929.

(35)
A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features offtheshelf: an astounding baseline for recognition, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, IEEE, 2014, pp. 512–519.
 (36) K. Kavukcuoglu, M. Ranzato, Y. LeCun, Fast inference in sparse coding algorithms with applications to object recognition, arXiv preprint arXiv:1010.3467.

(37)
S. Rifai, Contractive autoencoders: Explicit invariance during feature extraction, in: ICML, 2011.
 (38) Y. Bengio, E. Laufer, G. Alain, J. Yosinski, Deep generative stochastic networks trainable by backprop, in: Proceedings of the 31st International Conference on Machine Learning (ICML14), 2014, pp. 226–234.
 (39) D. J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, Tech. rep., arXiv:1401.4082 (2014).
 (40) D. P. Kingma, M. Welling, Autoencoding variational bayes, in: Proceedings of the International Conference on Learning Representations (ICLR), 2014.
 (41) A. Mnih, K. Gregor, Neural variational inference and learning in belief networks, in: ICML’2014, 2014.
 (42) J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proceedings of the national academy of sciences 79 (8) (1982) 2554–2558.
 (43) D. J. Amit, H. Gutfreund, H. Sompolinsky, Spinglass models of neural networks, Physical Review A 32 (2) (1985) 1007.
 (44) M. Mezard, G. Parisi, M. Virasoro, Spin Glass Theory and Beyond, World Scientific, Singapore, 1987.
 (45) R. Kühn, S. Bös, J. L. van Hemmen, Statistical mechanics for networks of gradedresponse neurons, Physical Review A 43 (4) (1991) 2084.
 (46) A. Treves, Are spinglass effects relevant to understanding realistic autoassociative networks?, Journal of Physics A: Mathematical and General 24 (11) (1991) 2645.

(47)
I. Sutskever, G. E. Hinton, Learning multilevel distributed representations for highdimensional sequences, in: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS), 2007, pp. 548–555.
 (48) G. W. Taylor, G. E. Hinton, S. T. Roweis, Modeling human motion using binary latent variables, in: Advances in neural information processing systems, 2006, pp. 1345–1352.
 (49) I. Sutskever, G. E. Hinton, G. W. Taylor, The recurrent temporal restricted boltzmann machine, in: Advances in Neural Information Processing Systems, 2009, pp. 1601–1608.
 (50) G. W. Taylor, L. Sigal, D. J. Fleet, G. E. Hinton, Dynamical binary latent variable models for 3d human pose tracking, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 631–638.
 (51) M. I. Mandel, R. Pascanu, D. Eck, Y. Bengio, L. M. Aiello, R. Schifanella, F. Menczer, Contextual tag inference, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 7 (1) (2011) 32.
 (52) M. D. Zeiler, G. W. Taylor, L. Sigal, I. Matthews, R. Fergus, Facial expression transfer with inputoutput temporal restricted boltzmann machines, in: Advances in Neural Information Processing Systems, 2011, pp. 1629–1637.
 (53) Y. Roudi, J. Hertz, Mean field theory for nonequilibrium network reconstruction, Phys. Rev. Lett. 106 (2011) 048702.
 (54) H.L. Zeng, M. Alava, H. Mahmoudi, E. Aurell, Network inference using asynchronously updated kinetic ising model, Phys. Rev. E. 83 (2011) 041135.
 (55) M. Mezard, J. Sakellariou, Exact meanfield inference in asymmetric kinetic Ising systems, J Stat Mech: Theory and Exp.
 (56) B. Dunn, Y. Roudi, Learning and inference in a nonequilibrium Ising model with hidden nodes, Physical Review E 87 (2) (2013) 022127.
 (57) J. Hertz, J. Tyrcha, Network inference with hidden nodes, Math. Biosci. and Eng. 11 (1) (2014) 149–156.
 (58) L. BachschmidRomano, M. Opper, Inferring hidden states in a random kinetic Ising model: replica analysis, Journal of Statistical Mechanics: Theory and Experiment 2014 (6) (2014) P06013.
 (59) C. Battistin, J. Hertz, J. Tyrcha, Y. Roudi, Belief propagation and replicas for inference and learning in a kinetic ising model with hidden spins, J. Stat. Mech P05021.
 (60) S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, Gradient flow in recurrent nets: the difficulty of learning longterm dependencies (2001).
 (61) J. Martens, I. Sutskever, Learning recurrent neural networks with hessianfree optimization, in: Proceedings of the 28th International Conference on Machine Learning (ICML11), 2011, pp. 1033–1040.
 (62) S. Hochreiter, J. Schmidhuber, Long shortterm memory, Neural computation 9 (8) (1997) 1735–1780.
 (63) J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555.
 (64) I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in: Proceedings of the 30th International Conference on Machine Learning (ICML13), 2013, pp. 1139–1147.
 (65) Q. V. Le, N. Jaitly, G. E. Hinton, A simple way to initialize recurrent networks of rectified linear units, arXiv preprint arXiv:1504.00941.
 (66) R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of training recurrent neural networks, arXiv preprint arXiv:1211.5063.
 (67) A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, IEEE, 2013, pp. 6645–6649.
 (68) A. Graves, Generating sequences with recurrent neural networks, arXiv preprint arXiv:1308.0850.
 (69) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in: Advances in Neural Information Processing Systems, 2014, pp. 3104–3112.
 (70) T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, M. Ranzato, Learning longer memory in recurrent neural networks, arXiv preprint arXiv:1412.7753.
 (71) A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al., Deepspeech: Scaling up endtoend speech recognition, arXiv preprint arXiv:1412.5567.
 (72) R. C. O’Reilly, M. J. Frank, Making working memory work: a computational model of learning in the prefrontal cortex and basal ganglia, Neural computation 18 (2) (2006) 283–328.
 (73) Y. Bengio, D.H. Lee, J. Bornschein, Z. Lin, Towards biologically plausible deep learning, arXiv preprint arXiv:1502.04156.
 (74) D.H. Lee, S. Zhang, A. Biard, Y. Bengio, Target propagation, arXiv preprint arXiv:1412.7525.
 (75) P. Sountsov, P. Miller, Spiking neuron network helmholtz machine, Front. Comput. Neurosci 9 (2015) 46.
 (76) A. Treves, Gradedresponse neurons and information encodings in autoassociative memories, Physical Review A 42 (4) (1990) 2418.
 (77) O. Shriki, D. Hansel, H. Sompolinsky, Rate models for conductancebased cortical neuronal networks, Neural computation 15 (8) (2003) 1809–1841.
 (78) Y. Roudi, A. Treves, Localized activity profiles and storage capacity of ratebased autoassociative networks, Physical Review E 73 (6) (2006) 061904.
 (79) X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier networks, in: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. JMLR W&CP Volume, Vol. 15, 2011, pp. 315–323.
 (80) K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification, arXiv preprint arXiv:1502.01852.
 (81) G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improving neural networks by preventing coadaptation of feature detectors, arXiv preprint arXiv:1207.0580.
Highlighs

Work in machine learning has made learning in deep neuronal architectures possible.

Single neuron nonlinearities make a strong impact on the success of learning.

Biological implementation of these learning rules are being suggested.

Dynamic nets with hidden nodes capture longtime correlations, relevant in biology
Highlighed Papers

Hinton et al. hinton2006fast : This paper is widely thought to have launched the field of Deep Learning. It demonstrates how a generative model called a Deep Belief Network can be constructed by incrementally training and stacking Restricted Boltzmann Machines. It also introduces the idea of greedy layerwise pretraining, an effective technique for initializing deep neural networks.

Salakhutdinov and Hinton salakhutdinov2009deep : This paper demonstrates a tractable method for training Deep Boltzmann Machines (DBMs). Feedback connections play an important role in DBM inference, making them more consistent with biological architectures.

Pavel and Miller Sountsov2015 : This is the first paper to propose an implementation of the the Helmholtz Machine in a spiking neuronal network. Taking an intricate neuronal microcircuit as and individual node in a simple twolayer Helmholtz Machine, the authors show that the Helmholtz Machine learning rule can be implemented in the network.

Glorot and Bengio glorot2011deep : This paper shows that the use of nonsaturating nonlinear units has a very positive effect on the training of deep neuronal networks. Given the fact that cortical networks operate far from their saturating regime, this observation can be of great importance for understanding the success of learning from complex data in hierarchical cortical networks.

Hinton et al. hinton2012improving : This paper introduces the regularization technique known as Dropout. Dropout and related methods are currently the most effective means of regularizing large neural networks. The technique amounts to efficiently visiting a large number of related models at training time, while aggregating them to a single predictor at test time.

Taylor et al. taylor2006modeling : This is the first paper to extend the Restricted Boltzmann Machine to unsupervised learning of sequences. It also demonstrates the advantages of using depth in sequence modeling.

Treves 1991 Phys Rev A treves1991spin and Kuhn et al 1991 Phys Rev A. kuhn1991statistical : Using spin glass techniques, these papers studied the effect of single neuron transfer functions on the retrieval properties of attractor networks. They found that the spin glass phase characterized by many local minima and metastable states is reduced in networks with graded response neurons. Although, this work is concerned with retrieval and not learning, the conclusion tallies well with recent findings in deep learning emphasizing the role of single neurons in making learning easier. Studying the relationship between these two results would thus be very interesting.
Comments
There are no comments yet.