A longstanding question in neuroscience is how animals excel at learning complex behavior involving temporal dependencies across multiple timescales and thereafter generalize this learned behavior to unseen data. This requires solving the temporal credit assignment problem: how to assign the contribution of past neural states to future outcomes. To address this, neuroscientists are increasingly turning to the mathematical framework provided by recurrent neural networks (RNNs) to model learning mechanisms in the brain [83, 123]
. Temporal credit assignment in RNNs is typically achieved by backpropagation through time (BPTT), or other gradient descent-based optimization algorithms, none of which are biologically-plausible (or bio-plausible for short). Therefore, the use of RNNs as a framework to understand the computational principles of learning in the brain has motivated an influx of bio-plausible learning rules that approximate gradient descent[83, 95, 123].
Performance of such rules is typically quantified by accuracy. Although the accuracy achieved by these rules is often comparable to true gradient descent, little is known about the breadth of the emergent solutions, namely how well they generalize. Broadly speaking, generalization refers to a trained model’s ability to adapt to previously unseen data, and is typically measured by the so-calledgeneralization gap: the difference between training and testing error. This is especially important when learning complex tasks with nonlinear RNNs where the loss landscape is non-convex, and therefore, many solutions with comparable training accuracy can exist. These solutions, characterized as (local) minima in the loss landscape, can nonetheless exhibit drastically different levels of generalization (Figure 1). It is not clear if gradient-based methods like BPTT and the existing bio-plausible alternatives have a different tendency to converge to loss minima that provide better or worse generalization.
While better predictors of the generalization performance remains an open research question in deep learning , recent extensive studies identify flatness of the loss landscape at the solution point as a promising predictor for generalization [62, 119, 33, 109, 157]. Leveraging these empirical and theoretical findings, we ask: how do proposed biologically-motivated gradient approximations affect flatness of the converged solution on loss landscape, and thereby, generalization?
Our overarching goal is to investigate generalization trends for existing bio-plausible temporal credit assignment rules, and in particular, examine how truncation-based bio-plausible gradient approximations can affect such trends. Specifically, our contributions are summarized as follows:
In numerical experiments, we demonstrate across several well-known neuroscience and machine learning benchmark tasks that state-of-the-art (SoTA) bio-plausible learning rules for training RNNs exhibit worse and more variable
generalization gaps, compared to true (stochastic) gradient descent (Figure2A-C).
Using the same experiments, we show that bio-plausible learning rules tend to approach high-curvature regions in synaptic weight space as measured by the loss’ Hessian eigenspectrum (Figure 2D-F). Further, we verify that this correlates with worse generalization (Figure 2D-F and 3), which is consistent with the literature.
Given the core components in designing artificial neural networks: data, objective functions, learning rules and architectures , we investigate different learning rules while holding the data, objective function and architecture constant. SoTA RNN learning rules investigated include a three-factor rule with symmetric feedback (symmetric e-prop ), a three-factor rule using random feedback weights (RFLO ) and a multi-factor rule using local modulatory signaling (MDGL) . To our knowledge, our analysis is the first to highlight and study this gap in solution quality between artificial and bio-plausible learning rules for RNNs, thereby motivating further investigations into how the brain learns solutions that generalize.
2 Related works
2.1 Bio-plausible gradient approximations
Investigating bio-plausible learning rules is of interest both to identify more efficient training strategies for artificial networks and to better understand learning in the brain [16, 9, 24, 112, 159, 98, 123, 95, 96, 51, 121, 40, 17, 55, 105, 152, 32, 106, 8, 48, 67, 160, 101, 7, 140, 94, 23, 35, 85, 84, 89, 165]. In order for learning to reach a certain goal quantified by an objective, learning algorithms often minimize a loss function . The error gradient, if available, tells us how each parameter should be adjusted in order to have the steepest local descent. For training RNNs, which are widely used as a model for neural circuits [162, 118, 142, 70, 138, 137, 174, 93, 47, 65, 31, 158, 108, 69, 161, 90, 156, 130, 29, 73], standard algorithms that follow this gradient — real time recurrent learning (RTRL) and BPTT — are not bio-plausible and have overwhelming memory storage demands [166, 95]. However, learning rules that only approximate the true gradient can sometimes be as effective as those that follow the gradient exactly [125, 86]
. Because of that, bio-plausible learning rules that approximate the gradient using known properties of real neurons have been proposed and led to successful learning outcomes across many different tasks in feedforward networks[82, 83, 122, 72, 100, 131, 6, 129, 115, 126, 81], with recent extensions to recurrently connected networks [104, 13, 87]. These existing bio-plausible rules for RNNs update rules estimate the gradient using known biological learning ingredients: eligibility traces, which maintain preceding activity on the molecular levels [91, 42, 134, 20, 173, 150], combined with top-down instructive signaling [42, 91, 139, 4, 38, 36, 75, 18, 114] as well as local cell-to-cell modulatory signaling within the network [145, 144, 87]. For efficient online learning in RNNs, other approximations (not necessarily bio-plausible) to RTRL [102, 151, 25, 127, 97, 180] have also demonstrated to produce good performance. Given the impressive accuracy achieved by these approximate rules, several studies began to investigate their convergence properties , e.g. for random backpropagation weights in feedforward networks [146, 46]. However, the trend of generalization capabilities of these rules, especially in RNNs, is underexamined.
2.2 Loss landscape curvature and generalization performance
Given the central importance of understanding how neural networks perform in situations unseen during training [172, 61, 5, 3, 58, 154, 120, 11], the deep learning community has made tremendous efforts to develop tools for understanding generalization that we leverage here. That flat minima could lead to better generalization was observed more than two decades ago . Intuitively, under the same amount of perturbation in parameter space (e.g. loss landscape changes due to addition of new data) worse performance degradation will be seen around the narrower minima (see Figure 1C). We note that perturbations in parameter space can be linked to that in the input space . Recently, many empirical studies have consistently supported the usefulness of this predictor [39, 21, 149, 68, 76, 37, 111, 163]. In particular, the authors of  performed an extensive study and found that flatness-based measures have a higher correlation with generalization than other alternatives. Motivated by this, several studies have characterized properties of the loss functions’s Hessian — whose eigenspectrum carries information about curvature [133, 43, 176, 80, 171]. Connections between flatness and generalization performance have shed light on the reason for greater generalization gaps in large batch training [68, 177, 170, 59, 181], and also have inspired optimization methods to favor flatter minima [53, 39, 21, 57, 10, 64, 183, 168, 66, 135]. Despite the criticism of scale-dependence of flatness , where parameter rescaling can drastically change flatness but not always generalization quality, flatness — with parameter scales taken into account [79, 124] — are connected to PAC-Bayesian generalization error bounds [33, 109, 157]. Moreover, a recent theoretical study rigorously connects flatness of the loss surface to generalization in classification tasks under the assumption that labels are (approximately) locally constant . Leveraging the great progress from deep learning community, we aim to study generalization properties of bio-plausible learning rules from a geometric perspective.
In this section, we first describe the network and learning setup we use (Figure 1A). Next, we present a number of numerical experiments where we compute generalization gap directly on three commonly used ML and neuroscience tasks (Figure 2A-C), for truncation-based bio-plausible gradient approximations. For the same experiments, we also quantify loss landscape curvature along learning trajectories, and connect these quantities to generalization behavior (Figures 2D-F and 3). Finally, we provide theoretical arguments and a theorem that explains how gradient alignment in bio-plausible gradient approximations can affect curvature preference (Theorem 1, Figure 3) and thus, generalization. Through additional experiments, we verify the predictive power of our theory. We conclude with discussions on potential remedies used by the brain (Appendix Figure 5) and future directions.
3.1 Network and learning setup
The detailed governing equations of our setup can be found in Methods (Appendix A). We consider a RNN with input units, hidden units and readout units (Figure 1A). We verified that trends hold for different network sizes and refer the reader to Appendix A.3 for more details. The update formula for (the hidden state at time ) is governed by:
where is the hidden state update function,
is the activation function,(resp. ) is the recurrent (resp. input) weight matrix and is the input. For , we consider a discrete-time implementation of a rate-based recurrent neural network (RNN) similar to the form in  (details in Appendix A). Readout , with readout weights , is defined as
We performed experiments on three tasks: sequential MNIST, pattern generation  and delayed match-to-sample tasks . The objective is to minimize scalar loss , which (for regression tasks) is defined as
given target readout , task duration and batch size . We consider cross-entropy loss for classification tasks (Methods in Appendix A).
Different learning algorithms examined in this work are BPTT (our menchmark), which update weights by computing the exact gradient ():
and three SoTA bio-plausible learning rules that update weights using approximate gradient:
where denotes a gradient approximation and denotes the learning rate. These three learning rules are explained further in Appendix A (Methods) but we note that these bio-plausible learning rules are based on truncations of dependency paths — on the computational graph for the exact gradient— that are biologically implausible to compute (Figure 1A). In all figures, learning rules are labelled as "Three-factor" (symmetric e-prop), "RF Three-factor" (RFLO) and "MDGL", respectively. We remark that the focus here is on comparing artificial to bio-plausible learning rules, rather than between biological rules.
Finally, we note that tasks were learned with mostly comparable training accuracies for all learning rules, and that generalization gaps reflect testing departure from these values. We refer the reader to supplemental materials for more details (Appendix A.3).
3.2 Generalization gap and loss landscape curvature
To study generalization performance, our first step is to compute the generalization gap empirically. Generalization gap is defined as the accuracy on the training dataset minus that of the test set; the larger the generalization gap, the worse the generalization performance. For various learning rules, we plot the histogram of generalization gap at the end of training across runs with distinct initializations; different colors represent different learning rules (Figure 2A-C). Notice that these bio-plausible rules achieve worse and more variable generalization performance than their machine learning counterpart (BPTT in black).
We now investigate if the generalization gap behavior described above correlates well with Loss landscape curvature. We use the leading eigenvalue of the loss’ Hessian (where derivatives are taken with respect to model parameters) as a measure for curvature following previous practice  and note that it is practical for both empirical  and theoretical analyses (Theorem 1). There exists other measures for flatness and due to the scale-dependence issue of Hessian spectrum , we also test using parameter scale-independent measures (see Appendix Figures 7 and 8). When we plot generalization gap points from Figure 2A-C against the corresponding leading Hessian eigenvalue, a statistically significant correlation is observed (Figure 2
D-F). We note that we did not expect this relation to be very tight since in addition to worse generalization gap on average, bio-plausible learning rules exhibit increased variance. This is an important and consistent trend we observed but might have been overlooked in previous studies.
So far, we investigated the endpoints of optimization trajectories, where training performance has converged. Now, we visualize the whole training trajectory. This done is for two reasons: 1) account for early stopping that can halt training anywhere along the trajectory due to time constraints; 2) flatness during training could give indications of avoiding or escaping high-curvature regions. We observe that the biologically motivated gradient approximations tend to rapidly approach high-curvature regions compared to their machine learning counterpart (Figure 3). Together, these results demonstrate a clear trend and a link between generalization gap and loss landscape curvature: both being increased and more variable for bio-plausible rules. We also note that the curvature convergence behavior seems to be a shared problem of temporal truncations of the gradient (Appendix Figure 9), which is what the existing bio-plausible gradient approximations for RNNs are based on. Next, we provide a theoretical argument as to why truncated temporal credit assignment rules favor high curvature regions of the loss landscape.
3.3 Theoretical analysis: link between curvature and gradient approximation error
We discussed how generalization can be linked to curvature, and we now examine the link between curvature and gradient alignment during learning dynamics. We represent approximate gradients in a rule agnostic
manner, where an arbitrary approximation is represented in terms of its component along the gradient direction plus an arbitrary orthogonal vector (Figure4A):
where and are the exact and approximate gradients, respectively (we reshaped into a vector here); is an arbitrary orthogonal vector to . Here, scalar represents the relative step length along the gradient direction that the approximate rule is making. As we will see in Theorem 1, is an important quantity in our analysis. One can easily compute from and by .
We express weight updates as discrete dynamical systems (with weights as the state variables):
where is the learning rate, is an approximate gradient, and (resp. ) denotes the map defined by BPTT (resp. an approximate) weight update rule. We note in passing that dynamical systems view of weight updates have been used previously [136, 45].
We introduce some additional notations before presenting our main theoretical result. (resp. ) is the Jacobian of the dynamical system of BPTT (resp. an approximate rule) in Eq. 7 (resp. Eq. 8). (resp. ) is the leading eigenvalue for BPTT (resp. approximate rule) Jacobian. is the leading eigenvalue of the loss’ Hessian matrix. (resp. ) is the final fixed point for BPTT (resp. an approximate rule). We are now ready to present Theorem 1.
Consider a RNN defined in Eq. 1 with a single scalar output and least squares loss as in Eq. 3 presented only at the last time step , and weights are updated according to the difference equation for BPTT 7 (resp. approximate rule 8) on a single example (stochastic gradient descent) using learning rate (resp. ). In the limit of stable fixed point convergence with zero training error, the dominant loss’ Hessian eigenvalue attained by BPTT (resp. approximate rule) is bounded by (resp. .
Observe that the Jacobian of BPTT discrete time dynamical system (Eq. 7) is precisely the loss’ Hessian scaled by a constant
Using the above relationship, we can bound from , which is the condition for discrete time dynamical system to converge to a fixed point.
However, the link between the Jacobian of an approximate update rule and the loss’ Hessian is less obvious. Thus, we derive a link between and , and then apply the step above.
The consequence of the upper bound derived in Theorem 1 is that truncated gradient rules can converge to minima with a higher dominant Hessian eigenvalue than BPTT, with the leading eigenvalue bound inversely proportional to ( usually). In practice, can vary depending on task settings and in our setup, we observed it to be somewhere between and . We remark that this higher upper-bound is consistent with the increased spread of curvature and generalization observed for bio-plausible rules in experiments. Theorem 1 highlights scalar (Eq. 6), relative step length along the gradient direction, as an important factor in the curvature bound. To test that, we eliminate the factor of by reducing learning rule of BPTT such that its step length is matched to that of the three-factor rule. This resulted in the blue curves in Figure 4, which is still trained using BPTT but with the update scaled by a factor of . By matching the step length of BPTT and a three-factor rule along the gradient direction, similar convergence behaviors were observed. This result then attributes the curvature preference behavior to relative step length along the gradient direction, and thereby indicating a link between curvature and gradient alignment. Consistent with earlier results, when the step length of BPTT is matched to that of a three-factor rule, its corresponding generalization performance also worsened (Appendix Figure 6).
We make a few more remarks regarding Theorem 1. is usually less than because otherwise the additional orthogonal component would imply a larger update size for the approximate rule compared to BPTT. This would make the approximate rule more prone to numerical instabilities (Eq. 9
) and would require a change in solver hyperparameters such as overall learning rate which in turn, influence the update size. Curiously,did not seem to play a helpful role in finding flatter minima; we provide the reasoning for that in Figure 4B and Appendix Figure 10. Despite making assumptions including scalar output, MSE loss and loss available at the last step, our empirical results indicate that our conclusions also extend to other setups: vector output, cross-entropy loss and loss that accumulates over time steps (Figure 4). We discuss the generality of Theorem 1 in Appendix B.
One might assume increasing learning rate of the approximate rules could compensate for the reduced step length due to . However, this can raise other issues. Suppose for exact gradient and we increase the learning rate of approximate update until (magnitude of is scaled accordingly), and because :
In other words, if approximate updates make the same amount of progress as BPTT along the gradient direction, would have a larger magnitude due to the orthogonal component , and large update magnitude can be very problematic for numerical stability . Thus, the magnitude of limits the learning rate that can be used. To balance between this stability issue and the potential benefit of large learning rates, one can consider using a large learning rate early in training to prevent premature stabilization in sharp minima followed by gradual decay to mitigate the stability issue (see Appendix Figure 5).
Taken together, Theorem 1 and Eq. 9 connect gradient alignment with the curvature of converged solution. For an approximation that is aligned poorly with the exact gradient, large approximation error vector limit the step length for stability reasons (Eq. 9) and small relative step length correspond to a larger curvature bound (Theorem 1). This is consistent with a larger spread in curvature and generalization gap observed for bio-plausible learning rules (Figure 2, Appendix Figure 6).
While developing bio-plausible learning rules is of an interest for both answering neuroscience questions and searching for more efficient training strategies for artificial networks, generalization properties of solutions found by these rules are severely underexamined. Through various well-known machine learning and working memory tasks, we first demonstrate empirically that existing bio-plausible learning rules in RNNs attain worse generalization performance, which is consistent with their tendency to converge to high-curvature regions in loss landscape. Second, our theoretical analysis offers an explanation for this preference of high curvature regions based on worse alignment to true gradient. This regime corresponds to the situation where the step length along the gradient direction is small and the approximation error vector is large. Finally, we test this theory empirically by matching the relative step length along the gradient direction resulting in similar convergence behavior (Figure 4).
Our study — a stepping stone toward understanding biological generalization using deep learning methods — raises many exciting questions for future investigations, both on the front of stronger deep learning theory and more sophisticated biological ingredients.
Deep learning theory and its implications:
In this study, we investigate generalization properties using loss landscape flatness, a promising predictor with recent rigorous connection to generalization gap . Yet, to what extent can flatness explain generalization is still an open question in deep learning. For instance, curvature is a local measure, which means that its informativeness of robustness against global perturbation is limited. Moreover, the theoretical association between flatness and generalization is provided in the form of upper bound [119, 157] and the bound may not always be tight, which is consistent with more variability in generalization gap for bio-plausible learning rules but offers less predictive power. Despite observing a significant correlation between generalization gap and a curvature-based measure in Figure 2, the relationship appears to be messy, suggesting other factors involved for explaining generalization. Our results are also consistent with existing findings linking learning rate to loss landscape curvature in deep networks . In the case of bio-plausible gradient approximations, small step length along the gradient direction cannot be compensated by increasing the learning rate, as that would inadvertently increase the error vector, causing numerical issues (Eq. 9). Curiously, the approximation error vector did not seem to play a role other than restricting the learning rate. While it is well-known that stochastic gradient noise (SGN) can help with finding flat minima due to the alignment of SGN covariance and Hessian near minima [185, 164, 170, 143], that may not apply to approximation error vector resulting primarily from temporal truncation of the gradient (see Figure 1A and Appendix Figure 10). Given that developing better predictors of generalization is still a work in progress , we anticipate stronger theoretical tools to be applied for studying biological generalization in the future.
Toward more detailed biological mechanisms:
On the front of more sophisticated biological ingredients that may improve generalization performance, we see two lines of approaches: 1- develop bio-plausible learning rules that align better with the gradient, as suggested by our rule agnostic analysis (Theorem 1, Figure 4); and 2- instead of studying learning rules in isolation, consider also other neural circuit components [56, 15, 27, 28, 103, 178, 71, 14] that could interact with the learning rule. An important component would be the architecture, including connectivity structure and neuron model, found through evolution  (see also ). To address our main question, this study focuses on the effect of various learning rules while holding data, objective function and architecture constant (see ). However, these different components can interact, and more sophisticated architecture can facilitate task learning [63, 117, 88, 175, 153, 167, 41, 147]. Given the exploding parameter space resulting from such interactions, we believe it requires careful future analysis and is outside of the scope for this one paper.
Additionally, learning rate modulation [77, 92] could be one of many possible remedies employed by the brain. We conjecture that neuromodulatory mechanisms could be coupled with these learning rules to improve the convergence behavior through our scheduled learning rate experiments (Appendix Figure 5), where an initial high learning rate could prevent the learning trajectory from settling in sharp minima prematurely followed by gradual decay to avoid instabilities. One possible way to realize such learning rate modulation could be through serotonin neurons via uncertainty tracking, where the learning rate is high when the reward prediction error is high (this can happen at the beginning of learning) . Since the authors of  showed that inhibiting serotonin led to failure in learning rate modulation, we conjecture such inhibition might have an impact on the generalization performance of learning outcomes. This result also connects with the finding that sensory depletion during critical periods in training deep networks, which can be related to a small learning rate early in training, can impair learning and yield convergence to sharp minima . Other ingredients for future investigations could include intrinsic noise with certain structures [113, 184, 27]. Taken together, we hope to see follow up investigations — riding on the rapid advancements both at the front of deep learning theory and sophisticated biological mechanisms — into how the brain attains solutions that generalize.
We thank Gauthier Gidel, Jonathan Cornford, Aristide Baratin, Thomas George and Mohammad Pezeshki for insightful discussions and helpful feedback at an early stage of this project. We also thank Henning Petzka and Michael Kamp for helpful exchanges. This research was supported by NSERC PGS-D (Y.H.L.); NSF AccelNet IN-BIC program, Grant No. OISE-2019976 AM02 (Y.H.L.); CIFAR Learning in Machines and Brains Program (B.A.R.), NSERC Discovery Grant RGPIN-2018-04821 (G.L), FRQS Research Scholar Award, Junior 1 LAJGU0401-253188 (G.L.), Canada Research Chair in Neural Computations and Interfacing (G.L.), Canada CIFAR AI Chair program (G.L. & B.A.R.) . We also thank the Allen Institute founder, Paul G. Allen, for his vision, encouragement, and support.
-  (2016) tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283. Cited by: §A.3.
-  (2018) Critical learning periods in deep networks. In International Conference on Learning Representations, Cited by: §5.
-  (2020) High-dimensional dynamics of generalization error in neural networks. Neural Networks 132, pp. 428–446. Cited by: §2.2.
-  (2019) Cortical credit assignment by hebbian, neuromodulatory and inhibitory plasticity. arXiv preprint arXiv:1911.00307. Cited by: §2.1.
-  (2019) Can sgd learn recurrent neural networks with provable generalization?. Advances in Neural Information Processing Systems 32. Cited by: §2.2.
-  (2019) Deep learning with asymmetric connections and hebbian updates. Frontiers in computational neuroscience 13, pp. 18. Cited by: §2.1.
-  (2021) The functional role of sequentially neuromodulated synaptic plasticity in behavioural learning. PLoS Computational Biology 17 (6), pp. e1009017. Cited by: §2.1.
-  (2020) Inferring learning rules from animal decision-making. Advances in Neural Information Processing Systems 33, pp. 3442–3453. Cited by: §2.1.
A normative and biologically plausible algorithm for independent component analysis. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2020) Shaping the learning landscape in neural networks around wide flat minima. Proceedings of the National Academy of Sciences 117 (1), pp. 161–170. Cited by: §2.2.
Implicit regularization via neural feature alignment.
International Conference on Artificial Intelligence and Statistics, pp. 2269–2277. Cited by: §2.2.
-  (2018) Long short-term memory and learning-to-learn in networks of spiking neurons. Advances in neural information processing systems 31. Cited by: §A.3.
-  (2020-12) A solution to the learning dilemma for recurrent networks of spiking neurons. Nature Communications 11 (1), pp. 1–15. External Links: Cited by: §A.2, §A.2, §1, §2.1.
-  (2020) Systematic integration of structural and functional data into multi-scale models of mouse primary visual cortex. Neuron 106 (3), pp. 388–403. Cited by: §5.
-  (2021) Online learning of neural computations from sparse temporal feedback. Advances in Neural Information Processing Systems 34. Cited by: §5.
-  (2021) Impression learning: online representation learning with synaptic plasticity. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2020) Learning efficient task-dependent representations with synaptic plasticity. Advances in Neural Information Processing Systems 33, pp. 15714–15724. Cited by: §2.1.
-  (2019) Neuromodulation of spike-timing-dependent plasticity: past, present, and future. Neuron 103 (4), pp. 563–581. Cited by: §2.1.
-  (2020) Characterizing emergent representations in a space of candidate learning rules for deep networks. Advances in Neural Information Processing Systems 33, pp. 8660–8670. Cited by: §2.1.
-  (2012-02) Conditional modulation of spike-timing-dependent plasticity for olfactory learning. Nature 482 (7383), pp. 47–51. External Links: Cited by: §2.1.
-  (2019) Entropy-sgd: biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment 2019 (12), pp. 124018. Cited by: §2.2.
-  (2021) Analysis of visual processing capabilities and neural coding strategies of a detailed model for laminar cortical microcircuits in mouse v1. bioRxiv. Cited by: §A.1.
-  (2021) Credit assignment through broadcasting a global error vector. Advances in Neural Information Processing Systems 34, pp. 10053–10066. Cited by: §2.1.
-  (2020) A meta-learning approach to (re) discover plasticity rules that carve a desired function into a neural network. Advances in Neural Information Processing Systems 33, pp. 16398–16408. Cited by: §2.1.
-  (2019) On the variance of unbiased online recurrent optimization. arXiv preprint arXiv:1902.02405. Cited by: §2.1.
-  (2022) Surrogate gradients for analog neuromorphic computing. Proceedings of the National Academy of Sciences 119 (4). Cited by: §A.2.
-  (2021) Neural population geometry reveals the role of stochasticity in robust perception. Advances in Neural Information Processing Systems 34. Cited by: §5, §5.
-  (2016) Astrocytes: orchestrating synaptic plasticity?. Neuroscience 323, pp. 43–61. Cited by: §5.
-  (2018) Full-force: a target-based method for training recurrent networks. PloS one 13 (2), pp. e0191527. Cited by: §2.1.
-  (2017) Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pp. 1019–1028. Cited by: §A.1, §2.2, §3.2.
-  (2020) Reservoir computing meets recurrent kernels and structured transforms. Advances in Neural Information Processing Systems 33, pp. 16785–16796. Cited by: §2.1.
-  (2020) Organizing recurrent network dynamics by task-computation to enable continual learning. Advances in neural information processing systems 33, pp. 14387–14397. Cited by: §2.1.
-  (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008. Cited by: §1, §2.2.
-  (2021) PsychRNN: an accessible and flexible python package for training recurrent neural network models on cognitive tasks. Eneuro 8 (1). Cited by: §A.2, §3.1.
-  (2022) Towards scaling difference target propagation by learning backprop targets. arXiv preprint arXiv:2201.13415. Cited by: §2.1.
-  (2007-12) Reinforcement Learning With Modulated Spike Timing–Dependent Synaptic Plasticity. Journal of Neurophysiology 98 (6), pp. 3648–3665. External Links: Cited by: §2.1.
-  (2021) The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima. Proceedings of the National Academy of Sciences 118 (9). Cited by: §2.2.
-  (2007) Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural computation 19 (6), pp. 1468–1502. Cited by: §2.1.
-  (2020) Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412. Cited by: §2.2.
-  (2021) Neural optimal feedback control with local learning rules. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2022) Goal-driven optimization of single-neuron properties in artificial networks reveals regularization role of neural diversity and adaptation. bioRxiv. Cited by: §5.
-  (2018-07) Eligibility Traces and Plasticity on Behavioral Time Scales: Experimental Support of NeoHebbian Three-Factor Learning Rules. Frontiers in Neural Circuits 12, pp. 53. External Links: Cited by: §2.1.
-  (2019) An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pp. 2232–2241. Cited by: §2.2, Assumption 1.
-  (2022) Investigating power laws in deep representation learning. arXiv preprint arXiv:2202.05808. Cited by: §A.1.
-  (2019) Implicit regularization of discrete gradient dynamics in linear neural networks. Advances in Neural Information Processing Systems 32. Cited by: §3.3.
-  (2021) Convergence analysis and implicit regularization of feedback alignment for deep linear networks. arXiv preprint arXiv:2110.10815. Cited by: §2.1.
-  (2020) Recurrent switching dynamical systems models for multiple interacting neural populations. Advances in neural information processing systems 33, pp. 14867–14878. Cited by: §2.1.
-  (2020) A simple normative network approximates local non-hebbian learning in the cortex. Advances in neural information processing systems 33, pp. 7283–7295. Cited by: §2.1.
-  (2016) Deep learning. MIT press. Cited by: §3.3.
-  (2021) Serotonin neurons modulate learning rate through uncertainty. Current Biology. Cited by: §5.
-  (2021) Latent equilibrium: a unified learning theory for arbitrarily fast computation with arbitrarily slow neurons. Advances in Neural Information Processing Systems 34, pp. 17839–17851. Cited by: §2.1.
-  (1997) Flat minima. Neural computation 9 (1), pp. 1–42. Cited by: §2.2.
-  (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems 30. Cited by: §2.2.
-  (2018) Gradient descent for spiking neural networks. Advances in neural information processing systems 31. Cited by: §A.2.
-  (2021) Local plasticity rules can learn deep representations using self-supervised contrastive predictions. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2021) Increasing liquid state machine performance with edge-of-chaos dynamics organized by astrocyte-modulated plasticity. Advances in Neural Information Processing Systems 34. Cited by: §5.
-  (2018) Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407. Cited by: §2.2.
-  (2018) Neural tangent kernel: convergence and generalization in neural networks. Advances in neural information processing systems 31. Cited by: §2.2.
-  (2018) Finding flatter minima with sgd. Cited by: §2.2, §5.
-  (2018) On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031. Cited by: §A.1, Figure 10.
-  (2020) Neurips 2020 competition: predicting generalization in deep learning. arXiv preprint arXiv:2012.07976. Cited by: §1, §2.2, §5.
-  (2019) Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178. Cited by: §1, §2.2.
-  (2020) Can single neurons solve mnist? the computational power of biological dendritic trees. arXiv preprint arXiv:2009.01269. Cited by: §5.
-  (2022) Questions for flat-minima optimization of modern neural networks. arXiv preprint arXiv:2202.00661. Cited by: §2.2.
-  (2020) Predictive coding in balanced neural networks with noise, chaos and delays. Advances in neural information processing systems 33, pp. 16677–16688. Cited by: §2.1.
-  (2019) The normalization method for alleviating pathological sharpness in wide neural networks. Advances in neural information processing systems 32. Cited by: §2.2.
-  (2021) Curriculum learning as a tool to uncover learning principles in the brain. In International Conference on Learning Representations, Cited by: §2.1.
-  (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §2.2.
-  (2021) Strong inhibitory signaling underlies stable temporal dynamics and working memory in spiking neural networks. Nature Neuroscience 24 (1), pp. 129–139. Cited by: §2.1.
-  (2021) A mechanistic multi-area recurrent network model of decision-making. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2022) Biological underpinnings for lifelong learning machines. Nature Machine Intelligence 4 (3), pp. 196–210. Cited by: §5.
-  (2021) Scaling equilibrium propagation to deep convnets by drastically reducing its gradient estimator bias. Frontiers in neuroscience 15, pp. 129. Cited by: §2.1.
-  (2022) Latent circuit inference from heterogeneous neural responses during cognitive tasks. bioRxiv. Cited by: §2.1.
-  (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §A.3, §3.1.
-  (2008) A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback. PLoS computational biology 4 (10), pp. e1000180. Cited by: §2.1.
-  (2018) Visualizing the loss landscape of neural nets. Advances in neural information processing systems 31. Cited by: §2.2.
-  (2019) Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems 32. Cited by: §5.
-  (2021) Differentiable spike: rethinking gradient-descent for training spiking neural networks. Advances in Neural Information Processing Systems 34. Cited by: §A.2.
-  (2019) Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd international conference on artificial intelligence and statistics, pp. 888–896. Cited by: §2.2.
-  (2021) Hessian eigenspectra of more realistic nonlinear models. Advances in Neural Information Processing Systems 34. Cited by: §2.2.
-  (2016) Random synaptic feedback weights support error backpropagation for deep learning. Nature communications 7 (1), pp. 1–10. Cited by: §A.2, §2.1.
-  (2020) Backpropagation and the brain. Nature Reviews Neuroscience, pp. 1–12. Cited by: §2.1.
-  (2019) Backpropagation through time and the brain. Current opinion in neurobiology 55, pp. 82–89. Cited by: §1, §2.1.
-  (2015) Inferring learning rules from distributions of firing rates in cortical neurons. Nature neuroscience 18 (12), pp. 1804–1810. Cited by: §2.1.
-  (2020) Learning to learn with feedback and local plasticity. Advances in Neural Information Processing Systems 33, pp. 21213–21223. Cited by: §2.1.
-  (2020) Stable and expressive recurrent vision models. arXiv preprint arXiv:2005.11362. Cited by: §2.1.
-  (2021) Cell-type–specific neuromodulation guides synaptic credit assignment in a spiking neural network. Proceedings of the National Academy of Sciences 118 (51). Cited by: §A.2, §A.2, §A.2, §1, §2.1.
-  (2021) Thalamic control of cortical dynamics in a model of flexible motor sequencing. Cell reports 35 (9), pp. 109090. Cited by: §5.
-  (2022-01) Neurons learn by predicting future activity. Nature Machine Intelligence 4 (1), pp. 62–72. External Links: Cited by: §2.1.
-  (2021) Natural and artificial intelligence: a brief introduction to the interplay between ai and neuroscience research. Neural Networks 144, pp. 603–613. Cited by: §2.1.
-  (2020-07) Synaptic Plasticity Forms and Functions. Annual Review of Neuroscience 43 (1), pp. 95–117. External Links: Cited by: §2.1.
-  (2021) Reverse engineering learned optimizers reveals known and novel mechanisms. Advances in Neural Information Processing Systems 34. Cited by: §5.
-  (2013) Context-dependent computation by recurrent dynamics in prefrontal cortex. nature 503 (7474), pp. 78–84. Cited by: §2.1.
-  (2019) Using local plasticity rules to train recurrent neural networks. arXiv preprint arXiv:1905.12100. Cited by: §2.1.
-  (2020) A unified framework of online learning algorithms for training recurrent neural networks. Journal of Machine Learning Research 21 (135), pp. 1–34. Cited by: §1, §2.1.
-  (2021) Learning rule influences recurrent network representations but not attractor structure in decision-making tasks. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2020) A practical sparse approximation for real time recurrent learning. arXiv preprint arXiv:2006.07232. Cited by: §2.1.
-  (2021) Credit assignment in neural networks through deep feedback control. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2011) Stimulus selectivity in dorsal and ventral prefrontal cortex after training in working memory tasks. Journal of neuroscience 31 (17), pp. 6266–6276. Cited by: §3.1.
-  (2020) Activation relaxation: a local dynamical approximation to backpropagation in the brain. arXiv preprint arXiv:2009.05359. Cited by: §2.1.
Using noise to probe recurrent neural network structure and prune synapses. Advances in Neural Information Processing Systems 33, pp. 14046–14057. Cited by: §2.1.
-  (2018) Approximating real-time recurrent learning with random kronecker factors. Advances in Neural Information Processing Systems 31. Cited by: §2.1.
-  (2020) Remembrance of things practiced with fast and slow learning in cortical and subcortical pathways. Nature Communications 11 (1), pp. 1–12. Cited by: §5.
-  (2019) Local online learning in recurrent networks with random feedback. ELife 8, pp. e43299. Cited by: §A.2, §A.3, §1, §2.1.
-  (2020) Meta-learning through hebbian plasticity in random networks. Advances in Neural Information Processing Systems 33, pp. 20719–20731. Cited by: §2.1.
-  (2020) Identifying learning rules from neural network observables. Advances in Neural Information Processing Systems 33, pp. 2639–2650. Cited by: §2.1.
-  (2019) Surrogate gradient learning in spiking neural networks: bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine 36 (6), pp. 51–63. Cited by: §A.2.
-  (2020) Unfolding recurrence by green’s functions for optimized reservoir computing. Advances in Neural Information Processing Systems 33, pp. 17380–17390. Cited by: §2.1.
-  (2017) Exploring generalization in deep learning. Advances in neural information processing systems 30. Cited by: §1, §2.2.
-  (2017) Supervised learning in spiking neural networks with force training. Nature communications 8 (1), pp. 1–15. Cited by: §3.1.
-  (2018) Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, Cited by: §2.2.
-  (2021) Tensor decompositions of higher-order correlations by nonlinear hebbian plasticity. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2022) Anticorrelated noise injection for improved generalization. arXiv preprint arXiv:2202.02831. Cited by: §5.
-  (2010) Timing is not everything: neuromodulation opens the stdp gate. Frontiers in synaptic neuroscience 2, pp. 146. Cited by: §2.1.
-  (2021) Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuits. Nature neuroscience, pp. 1–10. Cited by: §2.1.
-  (2021) Sparse spiking gradient descent. Advances in Neural Information Processing Systems 34. Cited by: §A.2.
-  (2021) Neural heterogeneity promotes robust learning. Nature communications 12 (1), pp. 1–9. Cited by: §5.
-  (2021) Inferring brain-wide interactions using data-constrained recurrent neural network models. bioRxiv, pp. 2020–12. Cited by: §2.1.
-  (2021) Relative flatness and generalization. Advances in Neural Information Processing Systems 34. Cited by: §A.1, Figure 8, §1, §2.2, §5.
-  (2021) Gradient starvation: a learning proclivity in neural networks. Advances in Neural Information Processing Systems 34. Cited by: §2.2.
-  (2021) Towards biologically plausible convolutional networks. Advances in Neural Information Processing Systems 34, pp. 13924–13936. Cited by: §2.1.
-  (2018) A biologically plausible learning rule for deep learning in the brain. arXiv preprint arXiv:1811.01768. Cited by: §2.1.
-  (2021) CCN gac workshop: issues with learning in biological recurrent neural networks. arXiv preprint arXiv:2105.05382. Cited by: §1, §2.1.
-  (2019) A scale invariant flatness measure for deep network minima. arXiv preprint arXiv:1902.02434. Cited by: §2.2.
-  (2019) A deep learning framework for neuroscience. Nature neuroscience 22 (11), pp. 1761–1770. Cited by: §1, §2.1, §5.
-  (2018-02) Control of synaptic plasticity in deep cortical networks. Nature Reviews Neuroscience 19 (3), pp. 166–180. External Links: Cited by: §2.1.
-  (2018) Kernel rnn learning (kernl). In International Conference on Learning Representations, Cited by: §2.1.
-  (2019) Towards spike-based machine intelligence with neuromorphic computing. Nature 575 (7784), pp. 607–617. Cited by: §A.2.
-  (2021) The credit assignment problem in cortico-basal ganglia-thalamic networks: a review, a problem and a possible solution. European Journal of Neuroscience 53 (7), pp. 2234–2253. Cited by: §2.1.
-  (2022) Probabilistic visual processing in humans and recurrent neural networks. Journal of Vision 22 (3), pp. 24–24. Cited by: §2.1.
-  (2018) Dendritic cortical microcircuits approximate the backpropagation algorithm. arXiv preprint arXiv:1810.11393. Cited by: §2.1.
-  (2016) Eigenvalues of the hessian in deep learning: singularity and beyond. arXiv preprint arXiv:1611.07476. Cited by: §A.1, Assumption 1.
-  (2017) Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454. Cited by: §2.2, Assumption 1.
-  (2013-02) The CaMKII/NMDAR complex as a molecular memory. Molecular Brain 6 (1), pp. 10. External Links: Cited by: §2.1.
-  (2020) A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. arXiv preprint arXiv:2012.03801. Cited by: §2.2.
-  (2019) A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences 116 (23), pp. 11537–11546. Cited by: §3.3.
-  (2020) Reverse-engineering recurrent neural network solutions to a hierarchical inference task for mice. bioRxiv. Cited by: §2.1.
-  (2020) The interplay between randomness and structure during learning in rnns. Advances in neural information processing systems 33, pp. 13352–13362. Cited by: §2.1.
-  (2016) Dopamine reward prediction-error signalling: a two-component response. Nature Reviews Neuroscience 17 (3), pp. 183. Cited by: §2.1.
-  (2021) A rapid and efficient learning rule for biological neural circuits. BioRxiv. Cited by: §2.1.
-  (2021) Learning to time-decode in spiking neural networks through the information bottleneck. arXiv preprint arXiv:2106.01177. Cited by: §A.2.
-  (2021) Reverse engineering recurrent neural networks with jacobian switching linear dynamical systems. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2017) A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451. Cited by: §5.
-  (2020) New light on cortical neuropeptides and synaptic network plasticity. Current Opinion in Neurobiology 63, pp. 176–188. Cited by: §A.2, §2.1.
-  (2019) Single-cell transcriptomic evidence for dense intracortical neuropeptide networks. Elife 8, pp. e47889. Cited by: §A.2, §2.1.
-  (2021) Convergence and alignment of gradient descent with random backpropagation weights. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2021) Probabilistic skeletons endow brain-like neural networks with innate computing capabilities. bioRxiv. Cited by: §5.
-  (2019) High-dimensional geometry of population responses in visual cortex. Nature 571 (7765), pp. 361–365. Cited by: §A.1, §A.3.
-  (2020) Exploring the vulnerability of deep neural networks: a study of parameter corruption. arXiv preprint arXiv:2006.05620. Cited by: §2.2.
-  (2019-02) Beyond STDP — towards diverse and functionally relevant plasticity rules. Current Opinion in Neurobiology 54, pp. 12–19. External Links: Cited by: §2.1.
-  (2017) Unbiased online recurrent optimization. arXiv preprint arXiv:1702.05043. Cited by: §2.1.
-  (2020) A local temporal difference code for distributional reinforcement learning. Advances in Neural Information Processing Systems 33, pp. 13662–13673. Cited by: §2.1.
Generalized leaky integrate-and-fire models classify multiple neuron types. Nature communications 9 (1), pp. 1–15. Cited by: §A.2, §5.
-  (2020) On the interplay between noise and curvature and its effect on optimization and generalization. In International Conference on Artificial Intelligence and Statistics, pp. 3503–3513. Cited by: §2.2.
-  (1997) Numerical linear algebra. Vol. 50, Siam. Cited by: §A.1.
-  (2022) Neuromodulators generate multiple context-relevant behaviors in a recurrent neural network by shifting activity hypertubes. bioRxiv, pp. 2021–05. Cited by: §2.1.
-  (2020) Normalized flat minima: exploring scale invariant definition of flat minima for neural networks using pac-bayesian analysis. In International Conference on Machine Learning, pp. 9636–9647. Cited by: §1, §2.2, §5.
-  (2021) Charting and navigating the space of solutions for recurrent neural networks. Advances in Neural Information Processing Systems 34, pp. 25320–25333. Cited by: §2.1.
-  (2021) Biological key-value memory networks. Advances in Neural Information Processing Systems 34. Cited by: §2.1.
-  (2022) Meta-learning synaptic plasticity and memory addressing for continual familiarity detection. Neuron 110 (3), pp. 544–557. Cited by: §2.1.
-  (2021) Probing the relationship between linear dynamical systems and low-rank recurrent neural network models. arXiv preprint arXiv:2110.09804. Cited by: §2.1.
-  (2020) Computation through neural population dynamics. Annual Review of Neuroscience 43, pp. 249–275. Cited by: §2.1.
-  (2018) Identifying generalization properties in neural networks. arXiv preprint arXiv:1809.07402. Cited by: §2.2.
-  (2019) How noise affects the hessian spectrum in overparameterized neural networks. arXiv preprint arXiv:1910.00195. Cited by: §5.
-  (2019) Theories of Error Back-Propagation in the Brain. Trends in Cognitive Sciences. External Links: Cited by: §2.1.
-  (1995) Gradient-based learning algorithms for recurrent networks and their computational complexity. In Back-propagation: Theory, Architectures and Applications, Y. Chauvin and D. E. Rumelhart (Eds.), pp. 433–486. Cited by: §2.1.
-  (2022) Heterogeneity in neuronal dynamics is learned by gradient descent for temporal processing tasks. bioRxiv. Cited by: §5.
-  (2020) Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems 33, pp. 2958–2969. Cited by: §2.2.
-  (2021) Training feedback spiking neural networks by implicit differentiation on the equilibrium state. Advances in Neural Information Processing Systems 34. Cited by: §A.2.
-  (2020) A diffusion theory for deep learning dynamics: stochastic gradient descent exponentially favors flat minima. arXiv preprint arXiv:2002.03495. Cited by: §B.3, §2.2, §5.
-  (2022) On the power-law spectrum in deep learning: a bridge to protein science. arXiv preprint arXiv:2201.13011. Cited by: §A.1, §2.2.
-  (2012) Robustness and generalization. Machine learning 86 (3), pp. 391–423. Cited by: §2.2.
-  (2014-09) A critical time window for dopamine actions on the structural plasticity of dendritic spines. Science 345 (6204), pp. 1616–1620. External Links: Cited by: §2.1.
-  (2019) Task representations in neural networks trained to perform many cognitive tasks. Nature neuroscience 22 (2), pp. 297–306. Cited by: §2.1.
-  (2021) Next-generation of recurrent neural network models for cognition. Cited by: §5.
-  (2020) Pyhessian: neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), pp. 581–590. Cited by: §A.1, §2.2, §3.2.
-  (2018) Hessian-based analysis of large batch training and robustness to adversaries. Advances in Neural Information Processing Systems 31. Cited by: §A.1, §2.2, §3.2.
-  (2019) A critique of pure learning and what artificial neural networks can learn from animal brains. Nature communications 10 (1), pp. 1–7. Cited by: §5.
-  (2018) Superspike: supervised learning in multilayer spiking neural networks. Neural computation 30 (6), pp. 1514–1541. Cited by: §A.2.
-  (2020) Brain-inspired learning on neuromorphic substrates. arXiv preprint arXiv:2010.11931. Cited by: §A.2, §2.1.
-  (2018) Theory of deep learning iib: optimization properties of sgd. arXiv preprint arXiv:1801.02254. Cited by: §2.2.
-  (2020) Temporal spike sequence learning via backpropagation for deep spiking neural networks. Advances in Neural Information Processing Systems 33, pp. 12022–12033. Cited by: §A.2.
-  (2021) Regularizing neural networks via adversarial model perturbation. In , pp. 8156–8165. Cited by: §2.2.
-  (2019) Toward understanding the importance of noise in training neural networks. In International Conference on Machine Learning, pp. 7594–7602. Cited by: §5.
-  (2018) The anisotropic noise in stochastic gradient descent: its behavior of escaping from sharp minima and regularization effects. arXiv preprint arXiv:1803.00195. Cited by: §5.
Appendix A Methods
a.1 Hessian eigenspectrum analysis
As mentioned, we focus on the leading Hessian eigenvalue because it has been used previously [177, 60] and is very feasible for both empirical [176, 132] and theoretical analyses (Theorem 1). The leading Hessian eigenvalue can be computed by performing power iterations on the Hessian vector product without knowing the full Hessian matrix (Algorithm 2 in 
). To find multiple top eigenvalues, e.g. top 200 eigenvalues, one can use generalized power method via QR decomposition. We focused on the loss’ Hessian for recurrent weights but observed similar trend for input weights as well. Stopping procedure for the power iteration were similar to that of ; we set the tolerance to , and observed that all iterations in our experiments converged before reaching the maximum iterations allowed, which was set to .
Due to the scale-dependence issue of Hessian spectrum , we also used scale-independent measures. For instance, we looked at the recently proposed relative flatness measure  (Figure 8 in Appendix). We also examined the power-law decay coefficient for the Hessian eigenvalues (Figure 7 in Appendix). We used the code in  to fit a power-law distribution to top 200 eigenvalues. We found similar results had we chosen top 50 or 100 eigenvalues instead, and 200 was chosen mainly due to computational load. We note that the link of power-law decay to generalization has also been examined in some recent studies [44, 171, 22].
a.2 Network setup and learning rule implementations
Neuron Model: We consider a discrete-time implementation of a rate-based recurrent neural network (RNN) similar to the form in . The model denotes the internal hidden state as and the observable states, i.e. firing rates, as at time
, and we use ReLU activation for. The dynamics of those states are governed by
where denotes the leak factor for simulation time step and membrane time constant , denotes the weight of the recurrent synaptic connection, denotes the strength of the input synaptic connection and denotes the external input at time . we use subscripts to represent indices of neurons and time steps. For instance, represents the hidden activity of neuron at time . represents the entry of recurrent weight matrix . Model in Eq. 10 were used for the sequential MNIST and pattern generation tasks.
We mention in passing that the choice of ReLU activation, which has discontinuous first derivative, means that the loss Hessian matrix is not guaranteed to be symmetric. A real matrix that is not symmetric can have complex eigenvalues come in conjugate pairs, and if they were amongst the top eigenvalues, power iterations may not converge. However, all iterations have converged in our experiments as mentioned above. Also, because of potential technical issues resulting from non-symmetric Hessian matrices, we foresee challenges in applying our methodology to spiking neural networks (SNNs), which have discontinuous activation functions. Due to the energy efficiency and biological realism of SNNs [107, 128, 26, 54, 179, 182, 169, 116, 78, 141, 180], we believe extending to SNNs is an important future direction.
For the delayed match to sample task, which is a working memory task, it was found in  and  that units with adaptive threshold as an additional hidden variable can play an important role in the computing capabilities of RNNs. Thus, we implemented adaptive threshold neuron units  for that task. In our rate-based implementation, this turns out to be a simple addition of a second hidden variable that represents the dynamic threshold component:
where denotes the dynamic threshold that adapts based on past neuron activity. The decay factor is given by for simulation time step and adaptation time constant , which is typically chosen on the behavioral task time scale .
Network output and loss function:
Readout is defined as
for readout weights .
We quantify how well the network output matches the desired target using loss function , which in the case of regression tasks is defined as
for target output, task duration , output neurons and batch size .
is the one-hot encoded target and
is the predicted category probability.
Biological gradient approximations (truncation-based)
The goal of this subsection is to explain where the approximation happens for each of the bio-plausible learning rules. For full details regarding these rules, we encourage to reader to refer to the respective references. We start by writing down the gradient in terms of real-time recurrent learning (RTRL) factorization:
Key problems that RTRL poses to biological plausibility and computational cost reside in the second factor that arises during the factorization of the gradient (Eq. 15). The factor keeps track of all recursive dependencies of on weight arising from recurrent connections. These recurrent dependencies can be obtained recursively as follows:
Thus, the factor and it poses a serious problem for biological plausibility: it involves nonlocal terms that should be inaccessible to neural circuits, i.e. that knowledge of all other weights in the network is required in order to update the weight .
RFLO  (labeled as "RF Three-factor") and symmetric e-prop  (labeled as "Three-factor") seek to address this by truncating the expensive nonlocal term in Eq. 17 so that the updates to weight would only depend on pre- and post-synaptic activity:
which results in a much simpler factor than the triple tensor in Eq. 17.
After the truncation, RFLO and e-prop implement:
The main difference between symmetric e-prop and RFLO implementation is that symmetric feedback is used for symmetric e-prop, i.e. output weight is used as the feedback weight for the , whereas RFLO uses fixed random feedback weights  for greater biological plausibility. We note in passing that the authors of e-prop have tested their formulation with fixed random feedback weights as well. MDGL  also truncates RTRL, but it restores some of the non-local dependencies – those within one connection step — that could potentially be communicated via mechanisms similar to the abundant cell-type-specific local modulatory signaling unveiled by recent transcriptomics data [144, 145]. With that, the expensive memory trace term in Eq. 17 becomes
MDGL introduces one additional approximation: replace with type-specific weights to mimick the cell-type-specific nature of local modulatory signaling (for cell in group and cell in group , where for a total of cell groups). For simplicity, we just use , i.e. without cell-type approximation. This results in overall MDGL implementation as
Interpretation of the above update rule in terms of biological processes can be found in the MDGL paper .
We note that input, recurrent and output weights were all being trained. This section illustrates the approximate gradient for updating recurrent weights , and similar expressions can be obtained for updating input weights . The approximations, however, do not apply to output weights, as the gradient for that does not violate the aforementioned issue of nonlocality (Eq. 17).
a.3 Simulation details
We used TensorFlow  version 1.14 and based it on top of . We modified the code for rate-based neurons (Eq. 10 and 12). 111Our code is available: https://anonymous.4open.science/r/BiolHessRNN-4DA5/README.md. We used the code in  for the power-law analysis (Figure 7 in Appendix). SGD optimizer was used to study the effect of gradient approximation in isolation without the complication of additional factors, as Adam optimizer with adaptive learning rate could convolute our matching step length experiments in Figure 4. That said, we verified that the curvature convergence behavior is also observed for Adam optimizer (Figure 11 in Appendix. Learning rates were optimized by picking within for each algorithm. For the sequential MNIST task, we explored batch sizes within . Trainings were stopped when both the loss and leading Hessian eigenvalue stabilized. As stated, we repeated runs with different random initialization to quantify uncertainty and weights were initialized similarly as in .
Simulations were completed on a computer server with x2 20-core Intel(R) Xeon(R) CPU E5-2698 v4 at 2.20GHz. The average time to complete one run of sequential MNIST, pattern generation and delayed match to sample tasks in Figure 3 were approximately 2 hours, 1 hour and 1 hour, respectively. Since the computation of second order gradient becomes prohibitively expensive as sequence length becomes large, all tasks involved no more than 50 time steps. For instance, this was achieved for the sequential MNIST task using the row-by-row implementation. Using fewer time steps, however, should not affect the general trend as the gradient truncation effects were still significant. Because of the use of fewer steps, we dropped the leak factor in Eq. 10 (i.e. set ).
For the matching step length experiments (Figure 4), we simply obtained for the three-factor learning rule and scaled BPTT updates by that amount. For scheduled learning rate experiments (Figure 5), the additional hyperparameters include initial learning rate, decay percentage and decay frequency. We used an initial learning rate that is three times the uniform rate (used in other figures) and decay the learning rate by every X iterations, where X is roughly the total number of training iterations (used in other figures) divided by 30. Since the point of that figure was to show that learning rate scheduling could lead to flatter minima than using a fixed learning rate, we did not search extensively across these additional hyperparameters as the first set of hyperparameters we tried was enough to demonstrate that point.
For the pattern generation task, our network consisted of neurons described in Eq. 10. Input to this network was provided by a random Gaussian input (). The fixed target signal had a duration of 50 steps and given by the sum of four sinusoids, with fixed period of 10, 40, 70 and 100 steps. For learning, we used mean squared loss function. Training for this task used full batch. For testing, we perturbed the input with additive zero-mean Gaussian noise (with picked uniformly between and across runs), to mimick the situation where the agent has to faithfully produce the desired pattern even under perturbations. Unlike the other tasks, this task measures accuracy by mean squared error, for which the lower the better. To maintain the convention of higher generalization gap being worse, the generalization gap for this task is computed by test error minus train error.
For the delayed match to sample task, our network consisted of neurons, which include 50 neurons with (Eq. 12) and 50 neurons without (Eq. 10) threshold adaptation. The task involved the presentation of two sequential cues, each taking on a binary value, lasting 2 steps and separated by a delay of 16 steps. Input to this network was provided by neurons. The first (resp. second) input neuron sends a value of 1 when the presented cue takes on a value of 1 (resp. cue 0), and 0 otherwise. The network was trained to output 1 (resp. 0) when the two cues have matching (resp. non-matching) values. For learning, we used cross-entropy loss function and the target corresponding to the correct output was given at the end of the trial. Training for this task used full batch. For testing, we tested on increased delay, with the period picked uniformly between the training delay period and twice the training delay period, to mimick situations where the animal has to hold the memory longer than it did during the learning phase.
For the sequential MNIST task , our network consisted of neurons described in Eq. 10. Input to this network was provided by units that represented the grey-scaled value of a single row, totalling 28 steps and the network prediction was made at the last step. For learning, we used the cross-entropy loss function and the target corresponding to the correct output was given at the end of the trial. For testing, we used the existing MNIST test set  with additive zero-mean Gaussian input noise. As mentioned, this task did not train with full batch but we found the trend to hold across different batch sizes.
Finally, we note that comparisons BPTT and approximate rules are done at comparable training accuracies for the pattern generation and delayed match to sample tasks. For the sequential MNIST task, the three-factor rule achieves only around training accuracy, and even then the curvature convergence behavior does not predict the training accuracy. To see this, while three-factor theory (blue in Fig 4), which corresponds to BPTT with reduced step length, achieves an accuracy of but still attains similar curvature convergence to that of the three-factor rule.
Appendix B Theorem 1
b.1 Proof of Theorem 1
First, we note that the Jacobian of the dynamical system for BPTT update (Eq. 7) is simply the loss Hessian scaled by . This implies that
where we remind the reader that (resp. ) is the leading eigenvalue for BPTT (resp. approximate rule) Jacobian; is the leading eigenvalue of the loss’ Hessian matrix.
Second, with the assumption of a single output , loss presented only at the last time step and stochastic gradient descent (updates on a single data example as opposed to batch updates), least squares loss in Eq. 3 can be simplified to:
where we remind the reader that is the notation for an approximate gradient.
We then compute the Jacobian of the difference equations above:
In the limit of zero error , i.e. close to an optimum, the term involving becomes negligible. That simplifies the Jacobian to
In this case, and are rank-1 matrices. A rank-1 square matrix has only one nonzero eigenvalue, and by inspection, that one eigenvalue is
where equality (a) is explained as follows. We first remind the reader that is defined such that (Eq. 6). For the case of a scalar output , and . So if we divide both sides of by we get . Since we have by definition, then because is just a scaled when the output is a scalar. This leads to .
Since we assume the gradient descent dynamical system converges to an optimum, this corresponds to an assymptotic stable fixed point. Hence, and , which implies:
b.2 Discussion on tightness of the bound
Following the derivation, it is clear that the tightness of the bound will depend on how close the magnitude of leading Jacobian eigenvalues are to upon convergence. That is related to the distribution of minima flatness along the loss landscape, which impacts the probability of a rule converging to a minima with flatness in a certain range. Such distribution is likely problem dependent. In the extreme scenario where the loss is convex, there is just one minimum, so the question of minima preference becomes irrelevant.
b.3 Discussion on generality of Theorem 1
The proof above examines a special case where the Jacobian of weight update equations becomes rank-1. We remind the reader that for the case of multivariate loss, higher rank case or batch updates, we would not arrive at the rank-1 Jacobian step (Eq. 30). For many tasks considered in neuroscience, the rank 1 case can apply, which explains the validity of Theorem 1. We also note in passing that for the case of stochastic gradient descent, the Hessian Jacobian correspondence (Eq. 23) would point to loss’ Hessian matrix evaluated on a single example, which could reflect the robustness against perturbing that particular example. Moreover, simulation results show our conclusion holds in higher rank cases (Figure 4
). The challenge of generalizing the proof to higher Jacobian rank case is that we are no longer guaranteed that the leading eigenvectors of BPTT Jacobian coincide with the leading eigenvectors of approximate rule Jacobian. Thus, it becomes much harder to relateand . Rather than providing further proof, we provide an intuition for why our conclusion — where the convergence behavior between rules differs by their step length along the gradient direction — can hold in higher rank case under Assumption 1.
Approximation error vector (but not ) lies orthogonal to the subspace spanned by the leading Hessian eigenvectors. Here, leading Hessian eigenvectors refer to the eigenvectors corresponding to the outlier Hessian eigenvalues, given the well known observation that there exists only a few large (outlier) eigenvalues and the rest are near zero
) lies orthogonal to the subspace spanned by the leading Hessian eigenvectors. Here, leading Hessian eigenvectors refer to the eigenvectors corresponding to the outlier Hessian eigenvalues, given the well known observation that there exists only a few large (outlier) eigenvalues and the rest are near zero[132, 133, 43]).
The ramification of Assumption 1 is that will lie in the subspace spanned by eigenvectors corresponding to tiny eigenvalues, making tiny. In the extreme scenario where lies in the null space of , would be . We verify this assumption numerically in Figure 10. We saw from the proof above that this assumption is automatically satisfied in the rank-1 Jacobian case. We remark that this assumption should not hold for stochastic gradient noise (SGN), as SGN covariance matrix is well aligned with the Hessian matrix near a minima . This could be why , unlike stochastic gradient noise, does not seem to be contributing much to escaping narrow minima.
We consider the case of small enough weight updates such that the loss surface can be approximated using second order Taylor expansion. Thus, the loss change after one update becomes:
We next focus on the first and second Taylor term ( and ) for the exact rule as well as the terms ( and ) for the approximate rule:
and we note that first and second Taylor terms can determine how likely the update will be trapped in a local minima: