Going in circles is the way forward: the role of recurrence in visual inference

03/26/2020 ∙ by Ruben S. van Bergen, et al. ∙ 0

Biological visual systems exhibit abundant recurrent connectivity. State-of-the-art neural network models for visual recognition, by contrast, rely heavily or exclusively on feedforward computation. Any finite-time recurrent neural network (RNN) can be unrolled along time to yield an equivalent feedforward neural network (FNN). This important insight suggests that computational neuroscientists may not need to engage recurrent computation, and that computer-vision engineers may be limiting themselves to a special case of FNN if they build recurrent models. Here we argue, to the contrary, that FNNs are a special case of RNNs and that computational neuroscientists and engineers should engage recurrence to understand how brains and machines can (1) achieve greater and more flexible computational depth, (2) compress complex computations into limited hardware, (3) integrate priors and priorities into visual inference through expectation and attention, (4) exploit sequential dependencies in their data for better inference and prediction, and (5) leverage the power of iterative computation.



There are no comments yet.


page 1

page 2

page 5

page 13

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The primate visual cortex uses a recurrent algorithm to process sensory inputLamme and Roelfsema (2000); Kreiman and Serre (2020); Angelucci and Bressloff (2006)

. Anatomically, connectivity is cyclic. Neurons are connected in cycles within their local cortical neighborhood

Anderson et al. (1994); Martin (2002); Douglas and Martin (2007). Between cortical areas, as well, connections are generally reciprocal Felleman and Van Essen (1991); Salin and Bullier (1995). Physiologically, the dynamics of neural responses bear temporal signatures indicative of recurrent processing Douglas et al. (1995); Lamme and Roelfsema (2000); Supèr et al. (2001). Behaviorally, visual perception can be disturbed by carefully timed interventions that coincide with the arrival of re-entrant information to a visual area Di Lollo et al. (2000); Lamme et al. (2001); Heinen et al. (2005); Fahrenfort et al. (2007). The evidence for recurrent computation in the primate brain, thus, is unequivocal. What is less obvious, however, is why the brain uses a recurrent algorithm.

Figure 1: Unrolling recurrent neural networks. (a) A simple feedforward neural network. (b) The same network with lateral (blue) and feedback (red) connections added, to make it recurrent. (c) ”Unrolling” the network in time clarifies the order of its computations. Here, the network is unrolled for three time steps before its output is read out, but we could choose to run the network for arbitrarily more or fewer steps. Areas are staggered from left to right to show the order in which their neural activities are updated. (d) Alternatively, we can unroll the recurrent network’s time steps in space, by arranging the areas and connections from different time steps in a linear spatial sequence. Note how all arrows now once again point in the same (forward) direction, from input to output. Throughout panels (a-b) Connections that are identical (sharing the same synaptic weights) are indicated by corresponding symbols. (e) If we lift the weight-sharing constraints from the previous network, this induces a deep feedforward ”super-model”, which can implement the spatially-unrolled recurrent network as a special case. This more general architecture may include additional connections (examples shown as light gray arrows) not present in the spatially-unrolled recurrent net.

This question has recently been brought into sharper focus by the successes of deep feedforward neural network models (FNNs)Lecun et al. (2015). These models now match or exceed human performance on certain visual tasks He et al. (2015, 2016); Kemelmacher-Shlizerman et al. (2016), and better predict primate recognition behavior Kubilius et al. (2016); Majaj and Pelli (2018); Spoerer et al. (2019) and neural activity Cadieu et al. (2014); Khaligh-Razavi and Kriegeskorte (2014); Güçlü and van Gerven (2015); Kriegeskorte (2015); Kheradpisheh et al. (2016); Schrimpf et al. (2018) than current alternative models.

Although computer vision and computational neuroscience both have a long history of recurrent modelsRao and Ballard (1999); Yuille and Kersten (2006); Friston and Kiebel (2009); Prince (2012), feedforward models have earned a dominant status in both fields. How should we account for this discrepancy between brains and models?

One answer is that the discrepancy reflects the fact that brains and computer-vision systems operate on different hardware and under different constraints on space, time, and energy. Perhaps we have come to a point at which the two fields must go their separate ways. However, this answer is unsatisfying. Computational neuroscience must still find out how visual inference works in brains. And although engineers face quantitatively different constraints when building computer-vision systems, they, too, must care about the spatial, temporal, and energetic limitations their models must operate under when deployed in, for example, a smartphone. Moreover, as long as neural network models continue to dominate computer vision, more efficient hardware implementations are likely to be more similar to biological neural networks than current implementations using conventional processors and graphics processing units (GPUs).

A second explanation for the discrepancy is that the abundance of recurrent connections in cortex belies a superficial role in neural computation. Perhaps the core computations can be performed by a feedforward network DiCarlo et al. (2012), while recurrent processing serves more auxiliary and modulatory functions, such as divisive normalizationCarandini and Heeger (2012) and attentionDesimone and Duncan (1995); Kastner and Ungerleider (2000); Maunsell and Treue (2006). This perspective is convenient because it enables us to hold on to the feedforward model in our minds. The auxiliary and modulatory functions let us acknowledge recurrence without fundamentally changing the way we envision the algorithm of recognition.

However, there is a third and more exciting explanation for the discrepancy between recurrent brains and feedforward models: Although feedforward computation is powerful, a recurrent algorithm provides a fundamentally superior solution to the problem of visual inference, and this algorithm is implemented in primate visual cortex. This recurrent algorithm explains how primate vision can be so efficient in terms of space, time, energy, and data, while being so rich and robust in terms of the inferences and their generalization to novel environments.

In this review, we argue for the latter possibility, discussing a range of potential computational functions of recurrence and citing the evidence suggesting that the primate brain employs them. We aim to distinguish established from more speculative, and superficial from more profound forms of recurrence, so as to clarify the most exciting directions for future research that will close the gap between models and brains.

Ii Unrolling a recurrent network

What exactly do we mean when we say that a neural network – whether biological or artificial – is recurrent rather than feedforward? This may seem obvious, but it turns out that the distinction can easily be blurred. Consider the simple network in Fig. 1a. It consists of three processing stages, arranged hierarchically, which we will refer to as areas, by analogy to cortex. Each area contains a number of neurons (real or artificial) that apply fixed operations to their input. Visual input enters in the first area, where it undergoes some transformation, the result of which is passed as input to the second area, and so forth. Information travels exclusively in one direction – the “forward” direction, from input to output – and so this is an example of a feedforward architecture. Notably, the number of transformations between input and output is fixed, and equal to the number of areas in the network.

Figure 2: Relationships between recurrent and feedforward networks. (a) The set of all recurrent neural networks (RNNs) is infinitely large, as is the set of all feedforward networks. But how do these sets relate to each other? Note that any RNN can be reduced to a feedforward architecture by removing all its feedback and lateral connections (e.g. going from Fig. 1b back to Fig. 1a), or equivalently, setting the weights of these connections to zero. Vice versa, any one feedforward network can be expanded to an infinite variety of recurrent networks, by adding lateral or feedback connections. Feedforward networks, then, form an architectural subset of RNNs. In this illustration, we specifically consider RNNs that accomplish their task in a finite number of time steps. These finite-time RNNs (ftRNNs) have the special property that they can be unrolled into equivalent feedforward architectures (a concept we expound in the text). White points connected by black lines illustrate these mutually equivalent architectures. Thus, the feedforward NNs contain a subset of architectures that can be obtained by unrolling a ftRNN. (b) These sets of networks can be further subdivided into subsets that are, or are not realistic to implement with available computational resources (areas below and above the dotted line, respectively). Very deep networks, or more generally networks with many neurons and connections, require more memory to store, and more computational time to execute and train, and are therefore more demanding to implement. Some realistic ftRNNs remain realistic when unrolled to a feedforward architecture – these are indicated in blue. Others, however, become too complex, when unrolled, to be feasible (exemplified in the figure by an ’unrolling connector’ that crosses the realism line). This is because the unrolling operation induces a much deeper architecture with many more neural connections to be stored and learned. These not-realistically-unrollable ftRNNs especially interesting, since they correspond to recurrent solutions that cannot be replaced by feedforward architectures.

Now compare this to the architecture in Fig. 1b. Here, we have added lateral and feedback connections to the network. Lateral connections allow the output of an area to be fed back into the same area, to influence its computations in the next processing step. Feedback connections allow the output of an area to influence information processing in a lower area. There is some freedom in the order in which computations may occur in such a network. The order we illustrate here starts with a full feed-forward pass through the network. In subsequent time steps, neural activations are updated in ascending order through the hierarchy, based on the activations that were computed in the previous time step.

This order of operations can be seen more clearly if we ’unroll’ the network in time, as shown in Fig. 1c. In this illustration, the network is unrolled for a fixed number of time steps (3). In fact, recurrent processing can be run for arbitrary durations before its output is read out – a notion we will return to later. Notice how this temporally unrolled, small network resembles a larger feedforward neural network with more connections and areas between its input and output. We can emphasize this recurrent-feedforward equivalence by interpreting the computational graph over time as a spatial architecture, and visually arranging the induced areas and connections in a linear spatial sequence -– an operation we call unrolling in space (Fig. 1d). This results in a deep feedforward architecture with many skip connections between areas that are separated by more than one level in this new hierarchy, and with many connections that are exact copies of one another (sharing identical connection weights).

Thus, any finite-time RNN can be transformed into an equivalent FNN. But this should not be taken to mean that RNNs are a special case of FNNs. In fact, FNNs are a special case of finite-time RNNs, comprising those which happen to have no cycles. Moreover, we can always expand an FNN into a range of different RNNs by adding connections that form cycles. More practically, although the set of all FNNs contains a subset that are equivalent to unrolled RNNs in terms of their computational graph (Fig. 2a), not all of these are realistic (Fig. 2b

). Realistic networks, here, are networks that conform to the real-world constraints the system operates under. For computational neuroscience, a realistic network is one that fits in the brain of the animal and does not require a deeper network architecture or more processing steps than the animal can accommodate. For computer vision, a realistic network is one that can be trained and deployed on available hardware at the training and deployment stages. For example, there may be limits on the storage and energy available, which would limit the complexity of the architecture and computational graph. A realistic finite-time RNN, when unrolled, can yield an unworkably deep FNN. Although the most widely used method for training RNNs (backpropagation through time) currently requires unrolling, an RNN is not equivalent to its unrolled FNN twin at the stage of real-world deployment.

An important recent observation is that the architecture that results from spatially unrolling a recurrent network, resembles an architecture of FNNs called Residual Networks (ResNets) He et al. (2016); Liao and Poggio (2016); Jastrzȩbski et al. (2017); Greff et al. (2016). These networks similarly have skip connections and can be very deep. ResNets may form a super-class of models (Fig. 1e), which reduce to “recurrent-equivalent” architectures when certain subsets of weights are constrained to be identical. Interestingly, when ResNets were trained with such recurrent-equivalent weight-sharing constraints, their performance on computer vision benchmarks was similar to unconstrained ResNets (even though the weight sharing drastically reduces the parameter count and limits the component computations that the network can perform) Liao and Poggio (2016). This is especially noteworthy given that ResNets, and architecturally related DenseNets, are currently among the top-ranking DNNs on prominent computer vision benchmarks He et al. (2016); Huang et al. (2017), as well as measures of brain-similarity Schrimpf et al. (2018). Today’s best artificial vision models, thus, actually implement computational graphs closely related to those of recurrent networks, even though these models are strictly feedforward architectures.

Iii Reasons to recur

We have described how a recurrent network can be unrolled into a deep feedforward architecture. The resulting feedforward super-model offers greater computational flexibility, since weight-sharing constraints can be omitted and additional skip connections added to the network (Fig. 1e). So what would be the benefit of restricting ourselves to recurrent architectures? We will first discuss the benefits of recurrence in terms of overarching principles, before considering more specific implementations of these principles.

iii.1 Recurrence provides greater and more flexible computational depth

iii.1.1 Recurrence enables arbitrary computational depth

One important advantage of recurrent algorithms is that they can be run for arbitrary lengths of time before their output is collected. We can define computational depth as the maximum path length (i.e. number of successive connections and nonlinear transformations) between input and output. A recurrent neural network (RNN) can achieve arbitrary computational depth despite having a finite count of parameters and being limited to finite spatial components. In other words, it can multiply its limited spatial resources along time.

iii.1.2 Recurrence enables more flexible expenditure of energy and time in exchange for inferential accuracy

In addition to enabling an arbitrarily deep computation given enough time, an RNN can adjust its computational depth to the task at hand. The computational depth of a feedforward net, by contrast, is a fixed number determined by the architecture. In particular, by adjusting their computational depth, RNNs can gracefully trade off speed and accuracy. This was recently demonstrated by Spoerer et al., who implemented recurrent models that terminate computations when they reach a flexible confidence threshold (defined by the entropy of the posterior, a measure of the model’s uncertainty). An RNN could flexibly emulate the performance of a different FNNs, with the RNN’s accuracy at a given confidence threshold matching the accuracy of an FNN that requires a similar number of floating point operations Spoerer et al. (2019) (Fig. 3). This presents a clear advantage of recurrence for animals, who may need to respond rapidly in some situations, must limit metabolic expenditures in general, and may benefit from slower and more energetically costly inferences when great accuracy is required. In fact, computer vision faces similar requirements in certain applications. For example, a vision algorithm in a smartphone should respond rapidly and conserve energy in general, but it should also be capable of high-accuracy inference when needed.

iii.2 Recurrent architectures can compress complex computations in limited hardware

Another major benefit of recurrent solutions is that they require fewer components in space when physically implemented in recurrent circuits, such as brains. Compare Figs. 1b and 1e: the recurrent network is anatomically more compact than the feedforward network and has fewer connections. It is easy to see why evolution might have favored a recurrent implementations for many brain functions: Space, neural projections, and the energy to develop and maintain them are all costly for the organism. In addition, synaptic efficacies must be either learned from limited experience or encoded in a limited-capacity genome. Beyond saving space, material, and energy, thus, smaller descriptive complexity (or parameter count) might ease development and learning.

Figure 3:

Recurrence enables a lossless speed-accuracy trade-off in an image classification task. Circles denote the performance of a recurrent neural network that was run for different numbers of time steps, until it achieved a desired threshold of classification confidence (quantified by the entropy of the class probabilities in the final network layer). Squares correspond to three architecturally similar feedforward networks with different computational depths. On the x-axis is the computational cost of running these models, measured by the number of floating point operations. For the feedforward models, this cost is fixed by the architecture. For the recurrent models, it is the average number of operations that was required to meet the given entropy threshold. The y-axis shows the classification accuracy achieved by each model. The performance of the recurrent model for different certainty thresholds follows a smooth curve, trading off computational cost (and thus computational speed) and accuracy. Note that this curve passes almost exactly through the speed/accuracy levels achieved by the feedforward models. Thus, a single recurrent model can emulate the performance of multiple feedforward models as it trades off speed and accuracy. This flexibility does not appear to come at a cost in terms of either parameters or computation: The recurrent model had a similar number of parameters as the feedforward models. For any desired accuracy, the recurrent model required a similar number of floating-point operations on average as the feedforward model achieving this level of accuracy. (Figure reproduced with permission from

Spoerer et al. (2019).)

Engineered devices face the same set of costs, although their relative weighting changes from application to application. In particular, a larger number of units and weights must either be represented in the memory of a conventional computer or implemented in specialized (e.g., neuromorphic) hardware. The connection weights in an NN model need to be learned from limited data. This requires extensive training, e.g., in a supervised setting, with millions of hand-labeled examples that show the network the desired output for a given input. The larger number of parameters associated with a feedforward solution might overfit the training data. The learned parameters then do not generalize well to new examples of the same task. DNNs for image recognition typically have many more parameters than they have training data.

In practice, such DNNs often turn out to generalize well even when they have very large numbers of parameters Advani and Saxe (2017); Belkin et al. (2019); Nakkiran et al. (2019)

. This phenomenon is thought to reflect a regularizing effect of the learning algorithm, stochastic gradient descent. Indeed, the trend is towards ever deeper networks with more connections to be optimized, and this trend is associated with continuing gains in performance on computer vision benchmarks

Rawat and Wang (2017). Nevertheless, it could turn out that recurrent architectures that achieve high computational depth with few parameters may bring benefits not only in terms of their storage, but also in terms of learnability. At the same time, computational resources are not infinite, even outside of biological constraints. Increasingly complex DNNs take increasingly longer to train on increasingly larger computing clusters, while drawing increasingly large amounts of power – a trend that is not sustainable. In the long run, therefore, computer vision too may benefit from the anatomical compression that can be achieved through clever use of recurrence.

Importantly, however, not every deep feedforward model can be compressed into an equivalent recurrent implementation. This anatomical compression can only be achieved when the same function may be applied iteratively or recursively within the network. The crucial question, therefore, is: what are these functions? What operations can be applied repeatedly in a productive manner? The remainder of this review will reflect on the various roles that have been proposed for recurrent processing for visual inference, from superficial to increasingly more profound forms of recurrence.

iii.3 Feedback connections are required to integrate information from outside the visual hierarchy

A key, established role of recurrent connections in biological vision is to propagate information from outside the visual cortex, so that it can aid visual inferenceGilbert and Li (2013). Here, we will briefly discuss two such outside influences: attention and expectations.

Figure 4: Increasingly profound modes of recurrent processing, unrolled in time. Visual cortex likely combines all three modes of recurrence illustrated here. The left side of each panel shows the computational graph induced by each form of recurrence, while the right side illustrates a (simplified) example of how this recurrence can be used. In these examples, circles correspond to neurons (or neural assemblies) encoding the feature illustrated within the circle, and lines that connect to circles indicate neural connections with significant activity. (a) Top-down influences from outside the visual processing hierarchy may be incorporated through two computational sweeps: a feedback sweep priming the network with top-down information and a feedforward sweep to interpret visual input and combine this interpretation with the top-down signal. Note that the lateral connections here merely copy neural activities in each area to the next time point; this identity transformation could also be implemented in other ways, such as slow membrane time constants or other forms of local memory. In the example on the right, a top-down signal communicates the expectation that the upcoming input will be horizontal motion. This primes neurons encoding this direction of motion to be more easily or strongly activated, and sharpens the interpretation of the subsequent (ambiguous) visual input. (b

) To efficiently perform inference on time-varying visual input, recurrent connections may implement a fixed temporal prediction function akin to the transition kernel in a Kalman filter, extrapolating the ongoing dynamics of the world one time step into the future. For instance, in the example on the right, a downward moving square was perceived at

. This motion is predicted to continue, and this prediction constrains the interpretation of the (ambiguous) visual input at the next time point. For simplicity, only lateral recurrence is shown in this example. Note that each input is mapped onto its corresponding output in a single recurrent time step. (c) Static input may also benefit from recurrent processing that iteratively refines an initial, coarse feedforward interpretation. In this mode of recurrence, there are several processing time steps between input and output, whereas in (b) there was one input and output for each time step. Illustrated on the right is an iterative hierarchical inference algorithm. Here, a higher-level hypothesis, generated in the first time step, refines the underlying lower-level representation in the next time step, which in turn improves the higher-level hypothesis, and so forth, until the network converges to an optimal interpretation of the input across the entire hierarchy. For simplicity, lateral recurrent interactions are not shown in this example.

iii.3.1 Attentional prioritization requires feedback connections

Animals have needs and goals that change from moment to moment. Perception is attuned to an animal’s current objectives. For instance, a primate foraging for red berries may be more successful if its visual perception apparatus prioritizes or enhances the processing of red items. Since current goals are represented outside the visual cortex (e.g. in frontal regions), top-down connections are clearly required for this information to influence visual processing. Such top-down effects have been grouped under the label ”attention”, and they have been the subject of an entire sub-field of study. For our purposes, it is sufficient to note that the effects and mechanisms of top-down attention are well-documented and pervasive in visual cortex (for review, see [

Desimone and Duncan (1995); Kastner and Ungerleider (2000); Maunsell and Treue (2006)]), and thus there is no question that this is one important function of recurrent connections.

iii.3.2 Integrating prior expectations into visual inference requires feedback connections

Organisms may constrain their visual inferences by expectationsSummerfield and Egner (2009). Visual input can be ambiguous and unreliable, and thus open to multiple interpretations. To constrain the inference, an observer can make use of prior knowledgevon Helmholtz (1867); Weiss et al. (2002); Stocker and Simoncelli (2006). One form of prior knowledge is environmental constants (e.g. ”light tends to come from above”Mamassian and Goutcher (2001)). Such unvarying knowledge may be stored within visual cortex, especially when it pertains to the overall prevalence of basic visual features (e.g. local edge orientations Girshick et al. (2011)). Another form of prior knowledge is contextual information specific to the current situation. Such time-varying knowledge may require a flexible representation outside visual cortex (e.g. ”I rang the doorbell at my mother’s house, so I expect to see her open the door”). Such expectations, represented in higher cortical regions, require feedback connections to affect processing in visual cortex Summerfield and Egner (2009).

The top-down imposition of attention and expectation must be mediated by feedback connections. However, it is unclear whether these influences fundamentally change the nature of visual representations or merely modulate these representations, adjusting the gain depending on the current relevance of different features of the visual input. As illustrated in Fig. 4a, for a given input this would require only two ”sweeps” of computation through the visual processing hierarchy: a feedback sweep that primes visual areas with top-down information, and a bottom-up sweep to interpret the visual input and integrate or modify this interpretation with the top-down signal (not necessarily in that order). Importantly, if the feedback signal merely enhances or suppresses some visual features, then the core inference algorithm need not be fundamentally recurrent – one can imagine that the bottom-up part of such a network is modeled perfectly by an FNN, while an optional recurrent module could be added in order to implement top-down contextual influences.

iii.4 Recurrent networks can exploit temporal dependency structure

Contextual constraints on visual inference include not only information from outside the visual hierarchy, such as information from other sensory modalities and memory, as discussed in the previous section. The recent stimulus history within the visual modality also provides context, likely represented within the visual system.

iii.4.1 Recurrent networks can dynamically compress the stimulus history

The primate visual system is thought to contain a hierarchy, not only of processing stages and spatial scales, but also of temporal scales Hasson et al. (2008); Murray et al. (2014). Visual representations track the environment moment by moment. However, the duration of a visual moment, the temporal grain, may depend on the level of representation. These principles apply to all sensory modalities and have been empirically explored, in particular, for audition and speech perception. At the simplest level, a neural network could use delay lines to detect spatiotemporal, rather than purely spatial, patterns. Recurrent neural networks have internal states and can represent temporal context across units tuned to different latencies. An RNN could represent a fixed temporal window, by replicating units tuned to different patterns for multiple latencies. However, RNNs trained on sequence processing tasks, such as language translation, learn more sophisticated representations of temporal context Sutskever et al. (2014). They can represent context at multiple time scales, learning a latent representation that enables them to dynamically compress whatever information from the past is needed for the task. In contrast to a feedforward network, a recurrent network is not limited by spatial constraints in terms of its retrospective time horizon. It can maintain task-relevant information indefinitely, integrating long-term memory into its inferences.

iii.4.2 Recurrent dynamics can simulate and predict the dynamics of the world

Dynamic compression of the past exploits the temporal dependency structure of the sensory data. The purpose of representing the past is to act well in the future. This suggests that a neural network should exploit temporal dependencies not just to compress the past, but also to predict the future. In fact, an optimal representation of even just the present requires prediction, because the sensory data is delayed and noisy.

Changes in the world are governed by laws of dynamics, which by definition are temporally invariant. An ideal observer will exploit these laws in visual inference and optimally combine previous with present observations to estimate the current state. This implies an extrapolation of the past to generate predictions that improve the interpretation of the present sensory input. When the dynamics are linear and noise is Gaussian, the optimal way to infer the present state by combining past and present evidence is the

Kalman filterKalman (1960) – an algorithm widely used in engineering applications. A number of authors Wolpert et al. (1995); Rao and Ballard (1997); Rao (2004); Denève et al. (2007) have proposed that the visual cortex may implement an algorithm similar to a Kalman filter. This theory is consistent with temporal biases that are evident in human perceptual judgments Orban de Xivry et al. (2013); Kwon et al. (2015); van Bergen and Jehee (2019).

Kalman filters employ a fixed temporal transitional kernel. This kernel takes a representation of the world (e.g., variables encoding the present state of a physical system, such as positions and velocities) at time , and transforms it into a predicted representation for time , to be integrated with new sensory evidence that arrives at that time. While the resulting prediction varies as a function of the kernel’s input, the kernel itself is constant, reflecting the temporal shift-invariance of the laws governing the dynamics. Recurrent neural networks provide a generalization of the Kalman filter and can represent nonlinear dynamical systems with non-Gaussian noise.

Note that this type of recurrent processing is more profound than the two-sweep algorithm (Fig. 4a) that incorporated top-down influences on visual inference. The two-sweep algorithm is trivial to unroll into a feedforward architecture. In contrast, unrolling a Kalman filter-like recurrent algorithm would induce an infinitely deep feedforward network, with a separate set of areas and connections for each time point to be processed. A finite-depth feedforward architecture can only approximate the recurrent algorithm. While the feedforward approximation will have a finite temporal window of memory to constrain its present inferences, the recurrent network can in principle integrate information over arbitrarily long periods.

Due to their advantages for dealing with time-varying (or otherwise ordered) inputs, recurrent neural networks are in fact widely employed in the broader field of machine learning for tasks involving sequential data. Speech recognition and machine translation are prominent applications that RNNs excel at

Graves et al. (2013); Sak et al. (2014); Sutskever et al. (2014); Bahdanau et al. (2014); Cho et al. (2014). Computer vision, too, has embraced RNNs for recognition and prediction of video inputRanzato et al. (2014); Srivastava et al. (2015); Lotter et al. (2016). Note that these applications all exploit the dynamics in RNNs to model the dynamics in the data.

What if we trained a Kalman filter or sequence-to-sequence RNN (Fig. 4b) on a train of independently sampled static

inputs to be classified? The memory of the preceding inputs would not be useful then, so we expect the recurrent model to revert to using essentially only its feedforward weights. The type of recurrent processing we described in this section, thus uses memory to improve visual inference. In the next section, we consider how recurrent processing can help with the inferential computations themselves, even for static inputs.

iii.5 Recurrence enables iterative inference

Recurrent processing can contribute even to inference on static inputs, and regardless of the agent’s goals and expectations, by means of an iterative algorithm. An iterative algorithm is one that employs a computation that improves an initial guess. Applying the computation again to the improved guess yields a further improvement. This process can be repeated until a good solution has been achieved or until we run out of time or energy. Recurrent networks can implement iterative algorithms, with the same neural network functions applied successively to some internal pattern of activity.

In many fields, iterative algorithms are used to solve estimation and optimization problems. In each iteration, a small adjustment is made to the problem’s proposed solution, to improve a mathematically formulated objective. A locally optimal solution is found by making small improvements until further progress is not required or not possible. The algorithm navigates a path in the space of the values to be estimated or the optimization parameters that leads to a good solution (albeit not necessarily the global optimum).

Much of machine learning involves iterative methods. Gradient descent is an iterative optimization method, whose stochastic variant is the most widely used method for training DNNs. Many discrete optimization techniques are iterative. Iterative algorithms are also central to inference in machine learning, for example in variational inference (where inference is achieved by optimization), sampling methods (where steps are chosen stochastically such that the distribution of samples converges on the posterior distribution), and message passing algorithms (such as loopy belief propagation). In particular, such iterative inference algorithms are used in probabilistic approaches to computer vision Yuille and Kersten (2006); Prince (2012). It is somewhat surprising, then, that iterative computation is not widely exploited to perform visual inference in DNNs.

Visual inference is naturally understood as an optimization problem, where the goal is to find hypotheses that can explain the current visual input von Helmholtz (1867). A hypothesis, in this case, is a proposed set of latent

(i.e. unobserved) causes that can jointly explain the image. The hypothesized latent causes could be the identities and positions of objects in the scene. Visual hypotheses are hierarchical, being subdivided into smaller hypotheses about lower or intermediate-level features, such as the local edges that make up a larger contour. An iterative visual inference algorithm starts with an initial hypothesis, and refines it by incremental improvements. These improvements may include eliminating hypotheses that are mutually exclusive, strengthening compatible causes, or adjusting a hypothesis based on its ability to predict the data (the visual input). In a probabilistic framework, the optimization objective would be the likelihood (probability of the image given the latent representation) or the posterior probability (probability of the latent representation given the image).

iii.5.1 Incompatible hypotheses can compete in the representation

There are often multiple plausible explanations for a given sensory input that are mutually exclusive. The distributed, parallel nature of neural networks enables them to initially activate and represent all of these possible hypotheses simultaneously. Recurrent connectivity between neurons can then implement competitive interactions among hypotheses, so as to converge on the best overall explanation.

There is some evidence that sensory representations are probabilistic Pouget et al. (2013); Ma and Jazayeri (2014); Orbán et al. (2016)

– in this case, the probabilities assigned to a set of mutually exclusive hypotheses must sum to 1. A strengthening of belief in one hypothesis, thus, should entail a reduction of the probability of other hypotheses in the representation. If neurons encode point estimates rather than probability distributions, then only one hypothesis can win (although that hypothesis may be encoded by a population response involving multiple neurons). The winning hypothesis could be the maximum a posteriori (MAP) hypothesis or the maximum likelihood hypothesis. Influential models of visual inference involving competitive recurrent interactions include

divisive normalization Carandini and Heeger (2012), biased competition Desimone and Duncan (1995), and predictive coding Rao and Ballard (1999); Friston and Kiebel (2009); Boerlin et al. (2013).

Recent theoretical work has demonstrated that lateral competition can give rise to a robust neural code, and can explain certain puzzling neural response properties Boerlin et al. (2013); Barrett et al. (2016)

. This theory considers a spiking neural network setting, in which different neurons encode highly overlapping or even identical features in their input. This degeneracy means that the same signal can be encoded equally well by a range of different response patterns. When a particular neuron spikes, lateral inhibition ensures that other competing neurons do not encode the same part of the input again. Which neuron gets to do the encoding thus depends on which neuron fires first, because its membrane potential happened to be closest to a spiking threshold. This leads to trial-to-trial variability in neural responses that reflects subtle differences in initial conditions – conditions that may not be known to an experimenter, who may thus mistake this variability for random noise. This could explain the puzzling observation that individual neurons reliably reproduce the same output given the same electrical stimulation, but populations of neurons, wired together, display apparently random variability under sensory stimulation

Schiller et al. (1976); Dean (1981); Mainen and Sejnowski (1995). Since multiple neurons can encode the same feature, the resulting code is also robust to neurons being lost or temporarily inactivated.

FNNs do not incorporate lateral connections for competitive interactions, although they very often include computations that serve a similar purpose. Chief among these are operations known as max-pooling and local response normalization (LRN) Krizhevsky et al. (2012); Lecun et al. (2015). In max-pooling, only the strongest response within a pool of competing neurons is forwarded to the next processing stage. In LRN, each neuron has its response divided by a term that is computed from the sum of activity in its normalization pool. While neither of these mechanisms is mediated by explicit lateral connections in a DNN, a strictly connectionist implementation of these mechanisms (e.g. in biological neurons or neuromorphic hardware) would have to include lateral recurrence. This, then, is another way in which apparently feedforward DNNs can exhibit a (limited) form of recurrent processing ”under the hood”. Note, though, that each of these operations is carried out only once, rather than allowing competitive dynamics to converge over multiple iterations. Furthermore, in contrast to the lateral interactions in predictive coding or other normative models, LRN and max-pooling are not derived from normative principles, and do not necessarily select (or enhance) the best hypothesis (however ”best” is defined).

iii.5.2 Compatible hypotheses can strengthen each other in the representation.

In feedforward models of hierarchical visual inference, neurons at higher stages selectively respond to combinations of simpler features encoded by lower-level neurons. Higher-level neurons thus are sensitive to larger-scale patterns of correlation between subsets of lower-level features. But such larger-scale statistical regularities may not be most efficiently captured by a set of larger-scale building blocks. Instead, they may be more compactly captured by local association rules. Consider, for instance, the problem of contour detection. Many combinations of local edges in an image can form a continuous contour. The resulting space of contours may be too complex to be efficiently represented with larger-scale templates. What all these contours have in common, however, is that they consist of pairs of edges that are locally contiguous, with sharper angles occurring with lower probability. Thus, the criteria for ’contour-ness’ may be compactly expressed by a set of local association rules: these edges go together; those do notField et al. (1993); Geisler et al. (2001). Contours may then be pieced together by repeatedly applying the same local association rules. Those edge pairs which are most clearly connected would be identified in early iterations. Later inferences can benefit from the context provided by earlier inferences, enabling the process to recognize continuity even where it is less locally apparent.

This insight has inspired network models of visual inference that implement local association rules through lateral connections, to aid contour integration and other perceptual grouping operationsRoelfsema (2006). Recent examples include Linsley et al., who developed

horizontal gated-recurrent units

(hGRUs) that learn local spatial dependencies Linsley et al. (2018). A network equipped with this particular recurrent connectivity was competitive with state-of-the-art feedforward models on a contour integration task, while using far fewer free parameters. George et al. George et al. (2017) similarly leveraged lateral interactions to recognize contiguous contours and surfaces, by modeling these with a conditional random field (CRF), using a message-passing algorithm for inference. This approach made their Recursive Cortical Network (RCN) the first computer vision algorithm to reliably beat CAPTCHAs – images of letter sequences under a variety of distortions, noise and clutter, that are widely used to verify that queries to a user interface are made by a person, and not an algorithm. CRFs were also used by Zheng et al. Zheng et al. (2015)

, who incorporated them as a recurrent extension of a convolutional neural network for image segmentation. The model surpassed state-of-the-art performance at the time. Association rules enforced through lateral connections may also help to fill in missing information, such as when objects are partially hidden from view by occluders. Lateral connectivity has been shown to improve recognition performance in such settings

Spoerer et al. (2017, 2019); Montobbio et al. (2019). Montobbio et al. showed that lateral diffusion of activity between features with correlated feedforward filter weights improves robustness to image perturbations including occlusionsMontobbio et al. (2019).

Enhancement of mutually compatible hypotheses (this section) and competition between mutually exclusive hypotheses (previous section) can both contribute to inference. A more general perspective is provided by the insight that prior knowledge about what features in a scene are mutually compatible or exclusive may be part of an overarching generative model, which iterative algorithms can exploit for inference.

iii.5.3 Iterative algorithms can leverage generative models for inference

Perceptual inference aims to converge on a set of hypotheses that best explain the sensory data. Typically, a hypothesis is considered to be a good explanation if it is consistent with both our prior knowledge and the sensory data. A generative model is a model of the joint distribution of latent causes and sensory data. Generative models can powerfully constrain perceptual inference because they capture prior knowledge about the world. In machine learning, defining generative models enables us to express and exploit what we know about the domain. A wide range of inference algorithms can be used to compute posterior distributions over variables of interest, given observed variables. The algorithms include variational inference, message passing, and Markov Chain Monte Carlo sampling, all of which require iterative computation.

In this section, we focus on a particular approach to leveraging generative models in visual inference, in which the joint distribution of the image and the latents is factorized as , which we refer to as the top-down factorization. The architecture contains components that model and predict the image from the latents (or more generally lower-level latent representations from higher-level latent representations). Compared to the alternative factorization , the top-down factorization has the potential advantage that the model operates in the causal direction, matching the causal process in the world that generated the image. The top-down model predicts what visual input is likely to result from a scene that has the hypothesized properties. This is somewhat similar to the graphics engine of a video game or image rendering software. This top-down model can be implemented via feedback connections that translate higher-level hypotheses in the network to representations at a lower level of abstraction.

Using generative models implemented with top-down predictions for inference is known as analysis-by-synthesis – an approach that has a long history in theories of perception von Helmholtz (1867); Rao and Ballard (1999); Friston and Kiebel (2009). Arguably, the goal of perceptual inference, by definition, is to reason back from effects (sensory data) to their causes (unobserved variables of interest), and thus invert the process that generated the effects. The crucial question, however, is whether the causal process is explicitly represented in the inference algorithm. The alternative, which can be achieved with feedforward inference, is to directly approximate the inverse, without ever making predictions in the causal direction. The success of the feedforward approach then depends on how well the inverse can be approximated by a fixed mapping of inputs to hypotheses. To iteratively invert the causal process, a neural network can evaluate the causal model for a current hypothesis and update the hypothesis in a beneficial direction. This process can then be repeated until convergence. This process of analysis by repeated synthesis may be preferable to directly approximating the inverse mapping if the causal process that generates the sensory data is easier to model than its inverse. In particular, the causal process may be more compactly represented, more easily learned, more efficient to compute, and more generalizable beyond the training distribution than its inverse.

Another potential advantage of generative inference lies in robustness to variations in the input. While FNNs can accurately categorize images drawn from the same distribution that the training images were drawn from, it does not take much to fool them. A slight alteration imperceptible to humans can cause a DNN to misclassify an image entirely, with high confidence Szegedy et al. (2014). State-of-the-art DNNs rely more strongly on texture than humans, who rely more on shape Geirhos et al. (2019). More generally, FNNs seem to ignore many image features that are relevant to human perception Jacobsen et al. (2019)

. One hypothesized reason for this is that these networks are trained to discriminate images, but not to generate them. Thus, any visual feature that reliably discriminates categories in the training data will be weighted heavily in the network’s classification decisions. Importantly, this weight is unrelated to how much variance the feature explains in the image, and to the likelihood, i.e. the probability of the image given either of the categories. An ideal observer should evaluate the likelihood for each hypothesis and adjudicate according to their ratio

Neyman and Pearson (1933). A feedforward network may instead latch on to a few highly discriminative, but subtle image features that don’t explain much and may not generalize to images from a different data set Jacobsen et al. (2019); Ilyas et al. (2019). In contrast, visual features that are important for generating or reconstructing images of a given class may be more likely to generalize to other examples of the same category. In support of this intuition, two novel RNN architectures that employ generative models for inference were found to be more robust to adversarial perturbations Li et al. (2019); Schott et al. (2018). Generative inference networks were also shown to better align with human perception, compared to discriminative models, when presented with controversial stimuli – images synthesized to evoke strongly conflicting classifications from different models Golan et al. (2019).

Despite these promising developments, generative inference remains rare in visual DNN models. The exceptions mentioned above are rather simple networks trained on easy classifications problems, and are not (yet) competitive with state-of-the-art performance on more challenging computer vision benchmarks. Within computational neuroscience, by contrast, generative feedback connections appear in many network models of visual inference. Prominent examples are predictive coding Rao and Ballard (1999); Friston and Kiebel (2009)

and hierarchical Bayesian inference

Lee and Mumford (2003). However, these models have not had much success in explaining visual inference beyond its earliest stages A notable exception is work by Wen et al. Wen et al. (2018), which shows that extending supervised convolutional DNNs with the recurrent dynamics of predictive coding can improve classification performance. The fields of computer vision and computational neuroscience both stand to benefit from the development of more powerful generative inference models.

iii.5.4 Iteration is necessary to close the amortization gap

Iterative inference has many advantages. A drawback of iteration, however, is that it takes time for the algorithm to converge during inference. This is unattractive for animals who need to perform visual inference under time pressure. It is also a challenge when training a DNN, which already requires many iterations of optimization. If each update of the network’s connections additionally includes an iterative inner loop to perform inference on each training example, this lengthens the time required for training.

A complementary inference mechanism is amortized inferenceSrikumar et al. (2012); Stuhlmüller et al. (2013)

, where a feedforward models approximates the mapping from images to their latent causes. DNNs are eminently suited for learning complicated input-output mappings. A single transformation then replaces the trajectories that would be navigated by an iterative inference algorithm. In some cases, the iterative solution and the best amortized mapping may be exactly equivalent. A linear model, for instance, can be estimated iteratively, by performing gradient descent on the sum of squared prediction errors. However, if a unique solution exists, it can equivalently be found by a linear transformation that directly maps from the data to the optimal coefficients.

In general, however, amortized inference incurs some error, compared to the optimal solution that might be found through iterative optimization. This error has been called the amortization gap Cremer et al. (2018); Marino et al. (2018). It is analogous to the poor fit that may result from buying clothes ”off the rack”, compared to a tailored version of the same garment. The amortization gap is defined in the context of variational inference, when the iterative optimization of the variational approximation to the posterior is replaced by a neural network that maps from the image to the parameters of the variational distribution. The resulting model suffers from two types of error: (1) error caused be the choice of the variational approximation (variational approximation gap) and (2) error caused by the model mapping from images to variational parameters (amortization gap). One recent study has argued that the amortization gap is often the main source of error in amortized inference models Cremer et al. (2018).

Amortized and iterative inference define a continuum. At one extreme, iterative inference until convergence reaches a solution through a trajectory of small improvements, explicitly evaluating the quality of the current solution at every iteration. At the other extreme, fully amortized inference takes a single leap from input to output. In between these extremes lies a space for algorithms that use intermediate numbers of steps, to approximate the optimal solution through a computational path that is more refined than a leap, but more efficient than full-fledged iterative optimization. Models that occupy this space include explicit hybrids of iterative and amortized inference Hjelm et al. (2016); Krishnan et al. (2018); Marino et al. (2018), as well as RNNs with arbitrary dynamics that are trained to converge to a desired objective in a limited number of time steps (e.g. Liang and Hu (2015); Spoerer et al. (2019); Kar et al. (2019); Nayebi et al. (2018)).

iii.6 Recurrence is required for active vision

Vision is an active exploratory process. Our eye movements scan the scene through a sequence of well-chosen fixations that bring objects of interest into foveal vision. Moving our heads and our bodies enables us to bring entirely new parts of the scene into view, and closer for inspection at high resolution. Active control of our eyes, heads, and bodies can also help disambiguate 3D structure as fixation on points at different depths changes binocular disparity, and head and body movements create motion parallax. Active vision involves a recurrent cycle of sensory processing and muscle control, a cycle that runs through the environment.

Our focus here has been on the internal computational functions of recurrent processing, and active vision has been reviewed elsewhere Ballard (1991); Findlay and Gilchrist (2003); Bajcsy et al. (2018). However, it is important to note that the internal recurrent processes of visual inference from a single glimpse are embedded within the larger recurrent process of active visual exploration. Active vision provides not just the larger behavioral context of visual inference. It also provides a powerful illustration of the fundamental advantages that recurrent algorithms offer in general. It illustrates how limited resources (the fovea) can be dynamically allocated (eye movements) to different portions of the evidence (the visual scene) in temporal sequence. A sensory system limited to a finite number of neurons, thus, can multiply its resources along time to achieve a detailed analysis. The cycle may start with an initial rough analysis of the entire visual field, followed by fixations on locations likely to yield valuable information. This is an example of an essentially recurrent process whose efficiency cannot be emulated with a feedforward system. The internal mechanisms of visual inference are faced with qualitatively similar challenges: Just like our retinae cannot afford foveal resolution throughout the visual field, the ventral stream cannot afford to perform all potentially relevant inferences on the evidence streaming in through the optic nerve in a single feedforward sweep. Internal shifts of attention, like eye movements, can sequentialize a complex computation and avoid wasting energy on portions of the evidence that are uninformative or irrelevant to the current goals of the animal.

Whereas the outer loop of active vision is largely about positioning our eyes relative to the scene and bringing important content into foveal vision, the inner loop of visual inference on each glimpse is far more flexible. Beyond covert attentional shifts that select locations, features, or objects for scrutiny, a recurrent network can decide what computations to perform so as to most efficiently reduce uncertainty about the important parts of the scene. In a game of twenty questions, we choose a question that most reduces our remaining uncertainty at each step. The budget of twenty would not suffice if we had to decide all the questions before seeing any answers. The visual system similarly has limited computational resources for processing a massive stream of evidence. It must choose what inferences to pursue on the basis of their computational cost and uncertainty-reducing benefit as it forages for insight Russell (1997); Gershman et al. (2015); Griffiths et al. (2015).

Iv Closing the gap between biological and artificial vision

We have reviewed a number of advantages that recurrence can bring to neural networks for visual inference. Going forward, neural network models of vision should incorporate recurrence; not just to better understand visual inference in the brain, but also to improve its implementation in machines.

iv.1 Recurrence already improves performance on challenging visual tasks

Efforts in this direction are already underway, and turning up promising results. Some of this work has been described in previous sections, such as the use of lateral connections to impose local association rules Linsley et al. (2018); George et al. (2017); Zheng et al. (2015) and generative inference for more robust performance outside the training distributionLi et al. (2019); Schott et al. (2018). Several other recent findings are worth highlighting here, as they have shown improved performance on visual tasks, better approximations to biological vision, or both, through recurrent computations.

In particular, several studies have found that recurrence is required in order to explain or improve visual inference in challenging settings. Kar and colleagues Kar et al. (2019) identified a set of ’challenge images’ that required recurrent processing in order to be accurately recognized. A feedforward DNN struggled to interpret these images, whereas macaque monkeys recognized them as accurately as a set of control images. Challenge images were associated with longer processing times in the macaque inferior temporal (IT) cortex, consistent with recurrent computations. Neural responses in IT for images that took longer were well accounted for by a brain-inspired RNN model. In a different studyKubilius et al. (2019)

, this same recurrent architecture was found to account for behavior and neural responses in object recognition tasks, while also achieving good performance on an important computer vision benchmark (ImageNet

Deng et al. (2009)).

One prominent challenge to visual inference is posed by partial occlusions, which hide part of a target object from view. In two recent studies, recurrent architectures were shown to be more robust to occlusions than their feedforward counterparts Spoerer et al. (2017); Tang et al. (2018). Interestingly, in both human observers and in an RNN model, object recognition under occlusion was impaired by backward masking Tang et al. (2018) (the presentation of a meaningless noise image, shortly after a target stimulus, to disrupt recurrent processingEnns and Di Lollo (2000); Fahrenfort et al. (2007); Lamme et al. (2001)). Another challenge for human perception is crowding, which occurs when the detailed perception of a target stimulus is disrupted by nearby flanker stimuliLevi (2008). In certain instances, the target stimulus can be released from crowding if further flankers are added that form a larger, coherent structure with the original flankers. This uncrowding effect may be due to the flankers being ’explained away’, thus reducing their interference with the target representationManassi and Herzog (2012); Manassi et al. (2016). Recent workDoerig et al. (2020) has shown that both effects can be explained by architectures known as Capsule NetsSabour et al. (2017, 2018), which include recurrent information routing mechanisms that may be similar to perceptual grouping and segmentation processes in the visual cortex.

Note that, in all of these cases, it may be possible to develop a feedforward architecture that performs the task equally well or better. Trivially, and as we discussed previously, a successful recurrent architecture can always be unrolled (for a finite number of time steps) into a deep feedforward network with many more learnable connections. However, a realistic recurrent model, when unrolled, may map onto an unrealistic feedforward model (Fig. 2), where realism could refer to the real-world constraints faced by either biological or artificial visual systems. Future studies should compare RNN and FNN implementations for the same visual inference task, while matching the complexity of the models in a meaningful way. Setting a realistic budget of units, connections, and computational operations is one important approach. To understand the computational differences between RNN and FNN solutions, it is also interesting to (1) match the parameter count (number of connection weights that must be learned and stored), which requires granting the FNN larger feature kernels, more feature maps per layer, or more layers, or (2) match the computational graph, which equates the distribution of path lengths from input to output and all other statistics of the graph, but grants the FNN a much larger number of parameters Spoerer et al. (2019).

iv.2 Freeing ourselves from the feedforward framework

Deep feedforward neural networks constitute an essential building block for visual inference, but they are not the whole story. The missing element, recurrent dynamics, is central to a range of alternative conceptions of visual inference that have been proposed Ballard (1991); O’Regan and Noë (2001); Yuille and Kersten (2006); Findlay and Gilchrist (2003); Bajcsy et al. (2018); Buzsáki (2019). These ideas have a long history, they are essential to understanding biological vision, and they have great potential for engineering, especially in the context of modern hardware and software. The promise of active vision and recurrent visual inference is, in fact, boosted by the power of feedforward networks.

However, the beauty, power, and simplicity of feedforward neural networks also makes it difficult to engage and develop the space of recurrent neural network algorithms for vision. The feedforward framework, embellished by recurrent processes that serve auxiliary and modulatory functions like normalization and attention, enables computational neuroscientists to hold on to the idea of a hierarchy of feature detectors. This idea might not be entirely mistaken. However, it is likely to be severely incomplete and ultimately limiting.

The insight that any finite-time recurrent network can be unrolled compounds the problem by suggesting that the feedforward framework is essentially complete. More practically, the fact that we train RNNs by unrolling them for finite time steps might in some ways impede our progress. DNNs are usually trained by stochastic gradient descent using the backpropagation algorithm. This method retraces in reverse the computational steps that led to the response in the output layer, so as to estimate the influence that each connection in the network had on the response. Each connection weight is then adjusted, to bring the network output closer to a desired output. The deeper the network, the longer the computational path that needs to be retraced. RNNs for visual inference typically are trained through a variation on this method, known as backpropagation through time (BPTT). To retrace computations in reverse through cycles, the RNN is unrolled along time, so as to convert it into a feedforward network whose depth depends on the number of time steps as shown in Fig. 1b-d. This enables the RNN to be trained like an FNN.

BPTT is attractive for enabling us to train RNNs like FNNs on arbitrary objectives. When it comes to learning recurrent dynamics, however, BPTT strictly optimizes the output at the specific time points evaluated by the objective (e.g., the output after exactly steps). Outside of this time window, there is no guarantee that the network’s response will be well-behaved. The RNN might reach the desired objective at the desired time, but diverge immediately after. Ideally, we would like a visual RNN presented with a stable image to converge to an attractor that represents the image and behave stably for arbitrary lengths of time. This would be consistent with iterative optimization, in which each step improves the network’s approximation to its objective. While it is not impossible for BPTT to give rise to such dynamics, it does not specifically favor them.

In effect, BPTT shackles RNNs to the feedforward framework, in which the goal is still to map inputs to outputs, rather than to discover useful dynamics. BPTT is also computationally cumbersome, as every additional recurrent time step extends the computational path that must be retraced in order to update the connections. This complication also renders BPTT biologically implausible. Although the case for backpropagation as potentially biologically plausible has recently been strengthened Guerguiev et al. (2017); Sacramento et al. (2018); Whittington and Bogacz (2019), its extension through time is difficult to reconcile with biology or implement efficiently in a finite engineered system for online learning – precisely because it requires unrolling and keeping track of separate copies of each weight as computational cycles are retraced in reverse.

Given these drawbacks, we speculate that a true breakthrough in recurrent vision models will require a training regime that does not rely on BPTT. Rather than optimizing an RNN’s state in a finite time window, future RNN training methods might directly target the network’s dynamics, or the states that those dynamics are encouraged to converge to. This approach has some history in RNN models of vision. Predictive coding models, for instance, are designed with dynamics that explicitly implement iterative optimization. Such models can update their connections through learning rules that require only the converged network state as inputRao and Ballard (1999), rather than the entire computational path to this state. Marino et al. Marino et al. (2018) recently proposed iterative amortized inference, training inference networks to have recurrent dynamics that improve the network’s hypotheses in each iteration, without constraining these dynamics to a particular form (such as predictive coding).

iv.3 Going forward, in circles

We started this review with the puzzling observation that, whereas biological vision is implemented in a profoundly recurrent neural architecture, the most successful neural network models of vision to date are feedforward. We have argued, theoretically and empirically, that vision models will eventually converge to their biological roots and implement more powerful recurrent solutions. One appeal of this view is that it suggests that neuroscientists and engineers may work synergistically, to make progress on common challenges. After all, visual inference, and intelligence more generally, were solved once before.

V Acknowledgements

We thank Samuel Lippl, Heiko Schütt, Andrew Zaharia, Tal Golan and Benjamin Peters for detailed comments on a draft of this paper.


  • M. S. Advani and A. M. Saxe (2017) High-dimensional dynamics of generalization error in neural networks. pp. 1–32. External Links: 1710.03667, Link Cited by: §III.2.
  • J. C. Anderson, R. J. Douglas, K. A. C. Martin, and J. C. Nelson (1994) Synaptic output of physiologically identified spiny stellate neurons in cat visual cortex. The Journal of Comparative Neurology 341 (1), pp. 16–24. External Links: Document, ISSN 0021-9967, Link Cited by: §I.
  • A. Angelucci and P. C. Bressloff (2006) Contribution of feedforward, lateral and feedback connections to the classical receptive field center and extra-classical receptive field surround of primate V1 neurons. In Progress in Brain Research, Vol. 154, pp. 93–120. External Links: Document, ISSN 00796123, Link Cited by: §I.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural Machine Translation by Jointly Learning to Align and Translate. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15. External Links: 1409.0473, Link Cited by: §III.4.2.
  • R. Bajcsy, Y. Aloimonos, and J. K. Tsotsos (2018) Revisiting active perception. Autonomous Robots 42 (2), pp. 177–196. External Links: Document, ISSN 0929-5593, Link Cited by: §III.6, §IV.2.
  • D. H. Ballard (1991) Animate vision. Artificial Intelligence 48 (1), pp. 57–86. External Links: Document, ISSN 00043702, Link Cited by: §III.6, §IV.2.
  • D. G.T. Barrett, S. Denève, and C. K. Machens (2016) Optimal compensation for neuron loss. eLife 5 (DECEMBER2016), pp. 1–36. External Links: Document, ISSN 2050084X Cited by: §III.5.1.
  • M. Belkin, D. Hsu, S. Ma, and S. Mandal (2019) Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences of the United States of America 116 (32), pp. 15849–15854. External Links: Document, 1812.11118, ISSN 10916490 Cited by: §III.2.
  • M. Boerlin, C. K. Machens, and S. Denève (2013) Predictive Coding of Dynamical Variables in Balanced Spiking Networks. PLoS Computational Biology 9 (11). External Links: Document, ISBN 1553-7358 (Electronic)$\$r1553-734X (Linking), ISSN 1553734X Cited by: §III.5.1, §III.5.1.
  • G. Buzsáki (2019) The Brain from Inside Out. Oxford University Press. External Links: Document, ISBN 9780190905385, Link Cited by: §IV.2.
  • C. F. Cadieu, H. Hong, D. L.K. Yamins, N. Pinto, D. Ardila, E. A. Solomon, N. J. Majaj, and J. J. DiCarlo (2014) Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition. PLoS Computational Biology 10 (12). External Links: Document, 1406.3284, ISSN 15537358 Cited by: §I.
  • M. Carandini and D. J. Heeger (2012) Normalization as a canonical neural computation. Nature Reviews Neuroscience 13 (1), pp. 51–62. External Links: Document, ISSN 1471003X Cited by: §I, §III.5.1.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Journal of Clinical Microbiology 28 (4), pp. 828–829. External Links: 1406.1078, Link Cited by: §III.4.2.
  • C. Cremer, X. Li, and D. Duvenaud (2018)

    Inference suboptimality in variational autoencoders

    35th International Conference on Machine Learning, ICML 2018 3, pp. 1749–1760. External Links: 1801.03558, ISBN 9781510867963 Cited by: §III.5.4.
  • A.F. Dean (1981) The variability of discharge of simple cells in the cat striate cortex. Experimental Brain Research 44 (4). External Links: Document, ISSN 0014-4819, Link Cited by: §III.5.1.
  • S. Denève, J.-R. Duhamel, and A. Pouget (2007) Optimal Sensorimotor Integration in Recurrent Cortical Networks: A Neural Implementation of Kalman Filters. Journal of Neuroscience 27 (21), pp. 5744–5756. External Links: Document, ISBN 1529-2401 (Electronic), ISSN 0270-6474, Link Cited by: §III.4.2.
  • J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: A large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    pp. 248–255. External Links: Document, ISBN 978-1-4244-3992-8, Link Cited by: §IV.1.
  • R. Desimone and J. Duncan (1995) Neural Mechanisms of Selective Visual Attention. Annual Review of Neuroscience 18 (1), pp. 193–222. External Links: Document, ISSN 0147006X, Link Cited by: §I, §III.3.1, §III.5.1.
  • V. Di Lollo, J. T. Enns, and R. A. Rensink (2000) Competition for consciousness among visual events: The psychophysics of reentrant visual processes. Journal of Experimental Psychology: General 129 (4), pp. 481–507. External Links: Document, ISSN 00963445 Cited by: §I.
  • J. J. DiCarlo, D. Zoccolan, and N. C. Rust (2012) How does the brain solve visual object recognition?. Neuron 73 (3), pp. 415–434. External Links: Document, ISSN 08966273, Link Cited by: §I.
  • A. Doerig, A. Bornet, O.H. Choung, and M.H. Herzog (2020) Crowding reveals fundamental differences in local vs. global processing in humans and machines. Vision Research 167 (August 2019), pp. 39–45. External Links: Document, ISSN 00426989, Link Cited by: §IV.1.
  • R. J. Douglas, C. Koch, M. Mahowald, K. A.C. Martin, and H. H. Suarez (1995) Recurrent excitation in neocortical circuits. Science 269 (5226), pp. 981–985. External Links: Document, ISSN 00368075 Cited by: §I.
  • R. J. Douglas and K. A.C. Martin (2007) Recurrent neuronal circuits in the neocortex. Current Biology 17 (13), pp. 496–500. External Links: Document, ISSN 09609822 Cited by: §I.
  • J. T. Enns and V. Di Lollo (2000) What’s new in visual masking?. Trends in Cognitive Sciences 4 (9), pp. 345–352. External Links: Document, ISSN 13646613, Link Cited by: §IV.1.
  • J. J. Fahrenfort, H. S. Scholte, and V. A.F. Lamme (2007) Masking disrupts reentrant processing in human visual cortex. Journal of Cognitive Neuroscience 19 (9), pp. 1488–1497. External Links: Document, ISSN 0898929X Cited by: §I, §IV.1.
  • D. J. Felleman and D. C. Van Essen (1991) Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex 1 (1), pp. 1–47. External Links: Document, ISSN 14602199 Cited by: §I.
  • D. J. Field, A. Hayes, and R. F. Hess (1993) Contour integration by the human visual system: evidence for a local ”association field”.. Vision research 33 (2), pp. 173–93. External Links: Document, ISSN 0042-6989, Link Cited by: §III.5.2.
  • J. M. Findlay and I. D. Gilchrist (2003) Active Vision. Oxford University Press. External Links: Document, ISBN 9780198524793, Link Cited by: §III.6, §IV.2.
  • K. Friston and S. Kiebel (2009) Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society B: Biological Sciences 364 (1521), pp. 1211–1221. External Links: Document, ISSN 14712970 Cited by: §I, §III.5.1, §III.5.3, §III.5.3.
  • R. Geirhos, C. Michaelis, F. A. Wichmann, P. Rubisch, M. Bethge, and W. Brendel (2019) Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. 7th International Conference on Learning Representations, ICLR 2019 (c), pp. 1–22. External Links: 1811.12231 Cited by: §III.5.3.
  • W. S. Geisler, J. S. Perry, B. J. Super, and D. P. Gallogly (2001) Edge co-occurrence in natural images predicts contour grouping performance. Vision Research 41 (6), pp. 711–724. External Links: Document, ISBN 0042-6989, ISSN 00426989 Cited by: §III.5.2.
  • D. George, W. Lehrach, K. Kansky, M. Lázaro-Gredilla, C. Laan, B. Marthi, X. Lou, Z. Meng, Y. Liu, H. Wang, A. Lavin, and D. S. Phoenix (2017) A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs. Science 358 (6368). External Links: Document, ISSN 10959203 Cited by: §III.5.2, §IV.1.
  • S. J. Gershman, E. J. Horvitz, and J. B. Tenenbaum (2015) Computational rationality: A converging paradigm for intelligence in brains, minds, and machines. Science 349 (6245), pp. 273–278. External Links: Document, ISSN 0036-8075, Link Cited by: §III.6.
  • C. D. Gilbert and W. Li (2013) Top-down influences on visual processing. Nature Reviews Neuroscience 14 (5), pp. 350–363. External Links: Document, ISSN 1471003X Cited by: §III.3.
  • A. R. Girshick, M. S. Landy, and E. P. Simoncelli (2011) Cardinal rules: visual orientation perception reflects knowledge of environmental statistics.. Nature neuroscience 14 (7), pp. 926–32. External Links: Document, ISSN 1546-1726, Link Cited by: §III.3.2.
  • T. Golan, P. C. Raju, and N. Kriegeskorte (2019) Controversial stimuli: pitting neural networks against each other as models of human recognition. External Links: 1911.09288, Link Cited by: §III.5.3.
  • A. Graves, A. Mohamed, and G. Hinton (2013) Speech Recognition with Deep Recurrent Neural Networks. (3). External Links: 1303.5778, Link Cited by: §III.4.2.
  • K. Greff, R. K. Srivastava, and J. Schmidhuber (2016) Highway and Residual Networks learn Unrolled Iterative Estimation. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings (2015), pp. 1–14. External Links: 1612.07771, Link Cited by: §II.
  • T. L. Griffiths, F. Lieder, and N. D. Goodman (2015) Rational Use of Cognitive Resources: Levels of Analysis Between the Computational and the Algorithmic. Topics in Cognitive Science 7 (2), pp. 217–229. External Links: Document, ISSN 17568757, Link Cited by: §III.6.
  • U. Güçlü and M. A. J. van Gerven (2015) Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream. The Journal of Neuroscience 35 (27), pp. 10005 LP – 10014. External Links: Document, Link Cited by: §I.
  • J. Guerguiev, T. P. Lillicrap, and B. A. Richards (2017)

    Towards deep learning with segregated dendrites

    pp. 1–37. Cited by: §IV.2.
  • U. Hasson, E. Yang, I. Vallines, D. J. Heeger, and N. Rubin (2008) A Hierarchy of Temporal Receptive Windows in Human Cortex. Journal of Neuroscience 28 (10), pp. 2539–2550. External Links: Document, ISSN 0270-6474, Link Cited by: §III.4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. 2015 Inter, pp. 1026–1034. External Links: Document, 1502.01852, ISBN 978-1-4673-8391-2, ISSN 15505499, Link Cited by: §I.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016-December, pp. 770–778. External Links: Document, 1512.03385, ISBN 9781467388504, ISSN 10636919 Cited by: §I, §II.
  • K. Heinen, J. Jolij, and V. A.F. Lamme (2005) Figure-ground segregation requires two distinct periods of activity in VI: A transcranial magnetic stimulation study. NeuroReport 16 (13), pp. 1483–1487. External Links: Document, ISSN 09594965 Cited by: §I.
  • R. D. Hjelm, K. Cho, J. Chung, R. Salakhutdinov, V. Calhoun, and N. Jojic (2016) Iterative refinement of the approximate posterior for directed belief networks. Advances in Neural Information Processing Systems (Nips 2016), pp. 4698–4706. External Links: 1511.06382, ISSN 10495258 Cited by: §III.5.4.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-January, pp. 2261–2269. External Links: Document, 1608.06993, ISBN 9781538604571 Cited by: §II.
  • A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial Examples Are Not Bugs, They Are Features. Distill 4 (8). External Links: Document, 1905.02175, Link Cited by: §III.5.3.
  • J. H. Jacobsen, J. Behrmann, R. Zemel, and M. Bethge (2019) Excessive invariance causes adversarial vulnerability. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–17. External Links: 1811.00401 Cited by: §III.5.3.
  • S. Jastrzȩbski, D. Arpit, N. Ballas, V. Verma, T. Che, and Y. Bengio (2017) Residual Connections Encourage Iterative Inference. (2016), pp. 1–14. External Links: 1710.04773, Link Cited by: §II.
  • R. E. Kalman (1960) A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering 82 (1), pp. 35–45. External Links: Document, ISSN 0021-9223, Link Cited by: §III.4.2.
  • K. Kar, J. Kubilius, K. Schmidt, E. B. Issa, and J. J. DiCarlo (2019) Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior. Nature Neuroscience 22 (6), pp. 974–983. External Links: Document, ISSN 15461726, Link Cited by: §III.5.4, §IV.1.
  • S. Kastner and L. G. Ungerleider (2000) Mechanisms of Visual Attention in the Human Cortex. Annual Review of Neuroscience 23 (1), pp. 315–341. External Links: Document, ISBN 9781880653531, ISSN 0147-006X, Link Cited by: §I, §III.3.1.
  • I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard (2016) The MegaFace benchmark: 1 million faces for recognition at scale. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016-December, pp. 4873–4882. External Links: Document, 1512.00596, ISBN 9781467388504, ISSN 10636919 Cited by: §I.
  • S. M. Khaligh-Razavi and N. Kriegeskorte (2014) Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation. PLoS Computational Biology 10 (11). External Links: Document, ISSN 15537358 Cited by: §I.
  • S. R. Kheradpisheh, M. Ghodrati, M. Ganjtabesh, and T. Masquelier (2016) Deep Networks Can Resemble Human Feed-forward Vision in Invariant Object Recognition. Scientific Reports 6, pp. 1–24. External Links: Document, 1508.03929, ISSN 20452322 Cited by: §I.
  • G. Kreiman and T. Serre (2020) Beyond the feedforward sweep: feedback computations in the visual cortex. Annals of the New York Academy of Sciences, pp. nyas.14320. External Links: Document, ISSN 0077-8923, Link Cited by: §I.
  • N. Kriegeskorte (2015) Deep Neural Networks: A New Framework for Modeling Biological Vision and Brain Information Processing. Annual Review of Vision Science 1 (1), pp. 417–446. External Links: Document, ISSN 2374-4642 Cited by: §I.
  • R. G. Krishnan, D. Liang, and M. D. Hoffman (2018)

    On the challenges of learning with inference networks on sparse, high-dimensional data

    International Conference on Artificial Intelligence and Statistics, AISTATS 2018 84, pp. 143–151. External Links: 1710.06085 Cited by: §III.5.4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems, pp. 1–9. External Links: Document, 1102.0183, ISBN 9781627480031, ISSN 10495258 Cited by: §III.5.1.
  • J. Kubilius, S. Bracci, and H. P. Op de Beeck (2016) Deep Neural Networks as a Computational Model for Human Shape Sensitivity. PLOS Computational Biology 12 (4), pp. e1004896. External Links: Document, ISSN 1553-7358, Link Cited by: §I.
  • J. Kubilius, M. Schrimpf, H. Hong, N. J. Majaj, R. Rajalingham, E. B. Issa, K. Kar, P. Bashivan, J. Prescott-Roy, K. Schmidt, A. Nayebi, D. Bear, D. L. K. Yamins, and J. J. DiCarlo (2019) Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs. pp. 1–19. External Links: 1909.06161, Link Cited by: §IV.1.
  • O. Kwon, D. Tadin, and D. C. Knill (2015) Unifying account of visual motion and position perception. Proceedings of the National Academy of Sciences 112 (26), pp. 8142–8147. External Links: Document, arXiv:1011.1669v3, ISBN 9788578110796, ISSN 0027-8424, Link Cited by: §III.4.2.
  • V. A.F. Lamme, K. Zipser, and H. Spekreijse (2001) Masking interrupts figure-ground signals in V1. Journal of Vision 1 (3), pp. 1044–1053. External Links: Document, ISSN 15347362 Cited by: §I, §IV.1.
  • V. A.F. Lamme and P. R. Roelfsema (2000) The distinct modes of vision offered by feedforward and recurrent processing. Trends in Neurosciences 23 (11), pp. 571–579. External Links: Document, ISSN 01662236 Cited by: §I.
  • Y. Lecun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. External Links: Document, ISSN 14764687 Cited by: §I, §III.5.1.
  • T. S. Lee and D. Mumford (2003) Hierarchical Bayesian inference in the visual cortex.. Journal of the Optical Society of America. A, Optics, image science, and vision 20 (7), pp. 1434–48. External Links: ISSN 1084-7529, Link Cited by: §III.5.3.
  • D. M. Levi (2008) Crowding—An essential bottleneck for object recognition: A mini-review. Vision Research 48 (5), pp. 635–654. External Links: Document, ISSN 00426989, Link Cited by: §IV.1.
  • Y. Li, J. Bradshaw, and Y. Sharma (2019) Are generative classifiers more robust to adversarial attacks?. 36th International Conference on Machine Learning, ICML 2019 2019-June, pp. 6754–6783. External Links: 1802.06552, ISBN 9781510886988 Cited by: §III.5.3, §IV.1.
  • M. Liang and X. Hu (2015) Recurrent convolutional neural network for object recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 07-12-June (Figure 1), pp. 3367–3375. External Links: Document, ISBN 9781467369640, ISSN 10636919 Cited by: §III.5.4.
  • Q. Liao and T. Poggio (2016) Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex. (047), pp. 1–16. External Links: 1604.03640, Link Cited by: §II.
  • D. Linsley, J. Kim, V. Veerabadran, and T. Serre (2018) Learning long-range spatial dependencies with horizontal gated-recurrent units. External Links: 1805.08315, Link Cited by: §III.5.2, §IV.1.
  • W. Lotter, G. Kreiman, and D. Cox (2016)

    Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

    pp. 1–18. External Links: Document, 1605.08104, ISBN 9781633212817, ISSN 15826163, Link Cited by: §III.4.2.
  • W. J. Ma and M. Jazayeri (2014) Neural Coding of Uncertainty and Probability.. Annual Review of Neuroscience 37, pp. 205–220. External Links: Document, ISSN 1545-4126, Link Cited by: §III.5.1.
  • Z. F. Mainen and T. J. Sejnowski (1995) Reliability of spike timing in neocortical neurons.. Science 268 (5216), pp. 1503–6. External Links: ISSN 0036-8075, Link Cited by: §III.5.1.
  • N. J. Majaj and D. G. Pelli (2018) Deep learning-Using machine learning to study biological vision. Journal of Vision 18 (13), pp. 1–13. External Links: Document, ISSN 15347362 Cited by: §I.
  • P. Mamassian and R. Goutcher (2001) Prior knowledge on the illumination position. Cognition 81 (1), pp. 1–9. External Links: Document, ISSN 00100277 Cited by: §III.3.2.
  • M. Manassi and M. H. Herzog (2012) Grouping , pooling , and when bigger is better in visual crowding Bilge Sayim. 12, pp. 1–14. External Links: Document Cited by: §IV.1.
  • M. Manassi, S. Lonchampt, A. Clarke, and M. H. Herzog (2016) What crowding can tell us about object representations. Journal of Vision 16 (3), pp. 35. External Links: Document, ISSN 1534-7362, Link Cited by: §IV.1.
  • J. Marino, Y. Yue, and S. Mandt (2018) Iterative amortized inference. 35th International Conference on Machine Learning, ICML 2018 8, pp. 5444–5462. External Links: 1807.09356, ISBN 9781510867963 Cited by: §III.5.4, §III.5.4, §IV.2.
  • K. A.C. Martin (2002) Microcircuits in visual cortex. Current Opinion in Neurobiology 12 (4), pp. 418–425. External Links: Document, ISSN 09594388 Cited by: §I.
  • J. H.R. Maunsell and S. Treue (2006) Feature-based attention in visual cortex. Trends in Neurosciences 29 (6), pp. 317–322. External Links: Document, 0504378, ISBN 0166-2236 (Print), ISSN 01662236 Cited by: §I, §III.3.1.
  • N. Montobbio, L. Bonnasse-Gahot, G. Citti, and A. Sarti (2019) KerCNNs: biologically inspired lateral connections for classification of corrupted images. pp. 1–40. External Links: 1910.08336, Link Cited by: §III.5.2.
  • J. D. Murray, A. Bernacchia, D. J. Freedman, R. Romo, J. D. Wallis, X. Cai, C. Padoa-Schioppa, T. Pasternak, H. Seo, D. Lee, and X. Wang (2014) A hierarchy of intrinsic timescales across primate cortex. Nature Neuroscience 17 (12), pp. 1661–1663. External Links: Document, ISSN 1097-6256, Link Cited by: §III.4.1.
  • P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever (2019) Deep Double Descent: Where Bigger Models and More Data Hurt. pp. 1–24. External Links: 1912.02292, Link Cited by: §III.2.
  • A. Nayebi, D. Bear, J. Kubilius, K. Kar, S. Ganguli, D. Sussillo, J. J. DiCarlo, and D. L.K. Yamins (2018) Task-driven convolutional recurrent models of the visual system. Advances in Neural Information Processing Systems 2018-Decem (NeurIPS), pp. 5290–5301. External Links: ISSN 10495258 Cited by: §III.5.4.
  • J. Neyman and E. S. Pearson (1933) IX. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231 (694-706), pp. 289–337. External Links: Document, ISSN 0264-3952, Link Cited by: §III.5.3.
  • J. K. O’Regan and A. Noë (2001) A sensorimotor account of vision and visual consciousness. Behavioral and Brain Sciences 24 (5), pp. 939–973. External Links: Document, ISSN 0140-525X, Link Cited by: §IV.2.
  • J.-J. Orban de Xivry, S. Coppe, G. Blohm, and P. Lefevre (2013) Kalman Filtering Naturally Accounts for Visually Guided and Predictive Smooth Pursuit Dynamics. Journal of Neuroscience 33 (44), pp. 17301–17313. External Links: Document, ISSN 0270-6474, Link Cited by: §III.4.2.
  • G. Orbán, P. Berkes, J. Fiser, and M. Lengyel (2016) Neural Variability and Sampling-Based Probabilistic Representations in the Visual Cortex. Neuron 92 (2), pp. 530–543. External Links: Document, ISSN 08966273, Link Cited by: §III.5.1.
  • A. Pouget, J. M. Beck, W. J. Ma, and P. E. Latham (2013) Probabilistic brains: knowns and unknowns.. Nature neuroscience 16 (9), pp. 1170–8. External Links: Document, ISSN 1546-1726, Link Cited by: §III.5.1.
  • S. J. D. Prince (2012) Computer Vision: Models, Learning and Inference. Cambridge University Press, Cambridge. External Links: Document, ISBN 9780511996504, Link Cited by: §I, §III.5.
  • M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra (2014) Video (language) modeling: a baseline for generative models of natural videos. (December 2014). External Links: 1412.6604, Link Cited by: §III.4.2.
  • R. P. N. Rao and D. H. Ballard (1997) Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural computation 9 (November 1995), pp. 721–763. External Links: Document, ISBN 0899-7667 (Print), ISSN 0899-7667 Cited by: §III.4.2.
  • R. P. N. Rao and D. H. Ballard (1999) Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.. Nature neuroscience 2 (1), pp. 79–87. External Links: Document, ISSN 1097-6256, Link Cited by: §I, §III.5.1, §III.5.3, §III.5.3, §IV.2.
  • R. P. N. Rao (2004) Bayesian computation in recurrent neural circuits.. Neural computation 16 (1), pp. 1–38. External Links: ISSN 0899-7667, Link Cited by: §III.4.2.
  • W. Rawat and Z. Wang (2017) Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Computation 29 (9), pp. 2352–2449. External Links: Document, ISSN 0899-7667, Link Cited by: §III.2.
  • P. R. Roelfsema (2006) Cortical algorithms for perceptual grouping. Annual Review of Neuroscience 29 (1), pp. 203–227. External Links: Document, ISSN 0147-006X, Link Cited by: §III.5.2.
  • S. J. Russell (1997) Rationality and intelligence. Artificial Intelligence 94 (1-2), pp. 57–77. External Links: Document, ISSN 00043702, Link Cited by: §III.6.
  • S. Sabour, N. Frosst, and G. E. Hinton (2018) Matrix capsules with EM routing. Iclr 2018 (2011), pp. 1–12. External Links: 1710.09829, ISSN 10495258, Link Cited by: §IV.1.
  • S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3856–3866. External Links: Link Cited by: §IV.1.
  • J. Sacramento, R. Ponte Costa, Y. Bengio, and W. Senn (2018) Dendritic cortical microcircuits approximate the backpropagation algorithm. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8721–8732. External Links: Link Cited by: §IV.2.
  • H. Sak, A. Senior, and F. Beaufays (2014) Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. (Cd). External Links: 1402.1128, Link Cited by: §III.4.2.
  • P. A. Salin and J. Bullier (1995) Corticocortical connections in the visual system: structure and function. Physiological Reviews 75 (1), pp. 107–154. External Links: Document, ISSN 0031-9333, Link Cited by: §I.
  • P. H. Schiller, B. L. Finlay, and S. F. Volman (1976)

    Short-term response variability of monkey striate neurons.

    Brain research 105 (2), pp. 347–9. External Links: ISSN 0006-8993, Link Cited by: §III.5.1.
  • L. Schott, J. Rauber, M. Bethge, and W. Brendel (2018) Towards the first adversarially robust neural network model on MNIST. Iclr 3, pp. 1–16. External Links: Document, 1805.09190, ISBN 1-4244-0629-3, Link Cited by: §III.5.3, §IV.1.
  • M. Schrimpf, J. Kubilius, H. Hong, N. J. Majaj, R. Rajalingham, E. B. Issa, K. Kar, P. Bashivan, J. Prescott-Roy, K. Schmidt, D. L. K. Yamins, and J. J. DiCarlo (2018) Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like?. bioRxiv, pp. 407007. External Links: Document, Link Cited by: §I, §II.
  • C. J. Spoerer, P. McClure, and N. Kriegeskorte (2017) Recurrent convolutional neural networks: A better model of biological object recognition. Frontiers in Psychology 8 (SEP), pp. 1–14. External Links: Document, ISBN 978-3-642-24796-5, ISSN 16641078 Cited by: §III.5.2, §IV.1.
  • C. Spoerer, T. C. Kietzmann, and N. Kriegeskorte (2019) Recurrent networks can recycle neural resources to flexibly trade speed for accuracy in visual recognition. pp. 1–22. External Links: Document Cited by: §I, Figure 3, §III.1.2, §III.5.2, §III.5.4, §IV.1.
  • V. Srikumar, G. Kundu, and D. Roth (2012) On amortizing inference cost for structured prediction.

    EMNLP-CoNLL 2012 - 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Proceedings of the Conference

    (July), pp. 1114–1124.
    External Links: ISBN 9781937284435 Cited by: §III.5.4.
  • N. Srivastava, E. Mansimov, and R. Salakhutdinov (2015) Unsupervised Learning of Video Representations using LSTMs. External Links: Document, 1502.04681, ISBN 9781510810587, ISSN 1550-235X, Link Cited by: §III.4.2.
  • A. A. Stocker and E. P. Simoncelli (2006) Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience 9 (4), pp. 578–585. External Links: Document, ISBN 1097-6256 (Print), ISSN 1097-6256, Link Cited by: §III.3.2.
  • A. Stuhlmüller, J. Taylor, and N. D. Goodman (2013) Learning stochastic inverses. Advances in Neural Information Processing Systems, pp. 1–9. External Links: ISSN 10495258 Cited by: §III.5.4.
  • C. Summerfield and T. Egner (2009) Expectation (and attention) in visual cognition. Trends in Cognitive Sciences 13 (9), pp. 403–409. External Links: Document, ISSN 13646613 Cited by: §III.3.2.
  • H. Supèr, H. Spekreijse, and V. A.F. Lamme (2001) Two distinct modes of sensory processing observed in monkey primary visual cortex (VI). Nature Neuroscience 4 (3), pp. 304–310. External Links: Document, ISSN 10976256 Cited by: §I.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 4 (January), pp. 3104–3112. External Links: 1409.3215, ISSN 10495258 Cited by: §III.4.1, §III.4.2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, pp. 1–10. External Links: 1312.6199 Cited by: §III.5.3.
  • H. Tang, M. Schrimpf, W. Lotter, C. Moerman, A. Paredes, J. O. Caro, W. Hardesty, D. Cox, and G. Kreiman (2018) Recurrent computations for visual pattern completion. Proceedings of the National Academy of Sciences of the United States of America 115 (35), pp. 8835–8840. External Links: Document, 1706.02240, ISSN 10916490 Cited by: §IV.1.
  • R. S. van Bergen and J. F.M. Jehee (2019) Probabilistic Representation in Human Visual Cortex Reflects Uncertainty in Serial Decisions. The Journal of neuroscience : the official journal of the Society for Neuroscience 39 (41), pp. 8164–8176. External Links: Document, ISSN 15292401 Cited by: §III.4.2.
  • H. von Helmholtz (1867) Handbuch der physiologischen Optik. Added t.-p.: Allgemeine encyklop{ä}die der physik … hrsg. von G. Karsten. IX bd, Leopold Voss, Leipzig. Cited by: §III.3.2, §III.5.3, §III.5.
  • Y. Weiss, E. P. Simoncelli, and E. H. Adelson (2002) Motion illusions as optimal percepts. Nature Neuroscience 5 (6), pp. 598–604. External Links: Document, ISBN 1097-6256 (Print), ISSN 10976256, Link Cited by: §III.3.2.
  • H. Wen, K. Han, J. Shi, Y. Zhang, E. Culurciello, and Z. Liu (2018) Deep Predictive Coding Network for Object Recognition. pp. 1–10. External Links: 1802.04762, Link Cited by: §III.5.3.
  • J. C.R. Whittington and R. Bogacz (2019) Theories of Error Back-Propagation in the Brain. Trends in Cognitive Sciences 23 (3), pp. 235–250. External Links: Document, ISSN 13646613, Link Cited by: §IV.2.
  • D. M. Wolpert, Z. Ghahramani, and M. I. Jordan (1995) An Internal Model for Sensorimotor Integration. 269 (5232), pp. 1880–1882. Cited by: §III.4.2.
  • A. Yuille and D. Kersten (2006) Vision as Bayesian inference: analysis by synthesis?. Trends in Cognitive Sciences 10 (7), pp. 301–308. External Links: Document, ISSN 13646613 Cited by: §I, §III.5, §IV.2.
  • S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H.S. Torr (2015) Conditional random fields as recurrent neural networks. Proceedings of the IEEE International Conference on Computer Vision 2015 Inter, pp. 1529–1537. External Links: Document, 1502.03240, ISBN 9781467383912, ISSN 15505499 Cited by: §III.5.2, §IV.1.