Learning compositional functions via multiplicative weight updates

by   Jeremy Bernstein, et al.

Compositionality is a basic structural feature of both biological and artificial neural networks. Learning compositional functions via gradient descent incurs well known problems like vanishing and exploding gradients, making careful learning rate tuning essential for real-world applications. This paper proves that multiplicative weight updates satisfy a descent lemma tailored to compositional functions. Based on this lemma, we derive Madam—a multiplicative version of the Adam optimiser—and show that it can train state of the art neural network architectures without learning rate tuning. We further show that Madam is easily adapted to train natively compressed neural networks by representing their weights in a logarithmic number system. We conclude by drawing connections between multiplicative weight updates and recent findings about synapses in biology.



There are no comments yet.


page 1

page 2

page 3

page 4


Robust Implicit Backpropagation

Arguably the biggest challenge in applying neural networks is tuning the...

Faster Biological Gradient Descent Learning

Back-propagation is a popular machine learning algorithm that uses gradi...

Convergence Analysis of Gradient Descent Algorithms with Proportional Updates

The rise of deep learning in recent years has brought with it increasing...

Inductive Bias of Gradient Descent for Exponentially Weight Normalized Smooth Homogeneous Neural Nets

We analyze the inductive bias of gradient descent for weight normalized ...

Multiplicative Reweighting for Robust Neural Network Optimization

Deep neural networks are widespread due to their powerful performance. Y...

Exponentiated Gradient Meets Gradient Descent

The (stochastic) gradient descent and the multiplicative update method a...

Learning Associative Inference Using Fast Weight Memory

Humans can quickly associate stimuli to solve problems in novel contexts...

Code Repositories


Pytorch and Jax code for the Madam optimiser.

view repo


Optimization Algorithms for Machine Learning with TensorFlow

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural computation in living systems emerges from the collective behaviour of large numbers of low precision and potentially faulty processing elements. This is a far cry from the precision and reliability of digital electronics. Looking at the numbers, a synapse on a computer is often represented using 32 bits taking more than 4 billion distinct values. In contrast, a biological synapse is estimated to take 26 distinguishable strengths requiring only 5 bits to store

(Bartol et al., 2015)

. This is a discrepancy between nature and engineering that spans many orders of magnitude. So why does the brain learn stably whereas deep learning is notoriously finicky and sensitive to myriad hyperparameters?

Meanwhile, an industrial effort is underway to scale artificial networks up to run on supercomputers and down to run on resource-limited edge devices. While learning algorithms designed to run natively on low precision hardware could lead to smaller and more power efficient chips (Baker and Hammerstrom, 1989; Horowitz, 2014), progress is hampered by our poor understanding of how precision impacts learning. As such, existing numerical representations have developed somewhat independently from learning algorithms. The next generation of neural hardware could benefit from more principled algorithmic co-design (Sze et al., 2017).

Our contributons:

  1. Building on recent results in the perturbation analysis of compositional functions (Bernstein et al., 2020), we show that a multiplicative learning rule satisfies a descent lemma tailored to neural networks.

  2. We propose and benchmark Madam—a multiplicative version of the Adam optimiser. Empirically, Madam seems to not require learning rate tuning and further may be used to train neural networks with low bit width synapses stored in a logarithmic number system.

  3. We point out that multiplicative weight updates respect certain aspects of neuroanatomy. First, synapses are exclusively excitatory or inhibitory since their sign is preserved under the update. Second, multiplicative weight updates are most naturally implemented in a logarithmic number system, in line with anatomical findings about biological synapses (Bartol et al., 2015).

2 Related work

Multiplicative weight updates

Multiplicative algorithms have a storied history in computer science. Examples in machine learning include the Winnow algorithm

(Littlestone, 1988) and the exponentiated gradient algorithm (Kivinen and Warmuth, 1997)

—both for learning linear classifiers in the face of irrelevant input features. The Hedge algorithm

(Freund and Schapire, 1997)

, which underpins the AdaBoost framework for boosting weak learners, is also multiplicative. In algorithmic game theory, multiplicative weight updates may be used to solve two-player zero sum games

(Grigoriadis and Khachiyan, 1995). Arora et al. (2012) survey many more applications.

Multiplicative updates are typically viewed as appropriate for problems where the geometry of the optimisation domain is described by the relative entropy (Kivinen and Warmuth, 1997)

, as is often the case when optimising over probability distributions. Since the relative entropy is a Bregman divergence, the algorithm may then be studied under the framework of mirror descent

(Dhillon and Tropp, 2008). We suggest that multiplicative updates may arise under a broader principle: when the geometry of the optimisation domain is described by any relative distance measure.

Deep network optimisation

Finding the right distance measure to describe deep neural networks is an ongoing research problem. Some theoretical work has supposed that the underlying geometry is Euclidean via the assumption of Lipschitz-continuous gradients (Bottou et al., 2018), but Zhang et al. (2020) suggest that this assumption may be too strong and consider relaxing it. Similarly, Azizan et al. (2019) consider more general Bregman divergences under the mirror descent framework, although these distance measures are not tailored to compositional functions like neural networks. Neyshabur et al. (2015), on the other hand, derive a distance measure based on paths through a neural network in order to better capture the scaling symmetries over layers. Still, it is difficult to place this distance within a formal optimisation theory in order to derive descent lemmas.

Recent research has looked at learning algorithms that make relative updates to network layers, such as LARS (You et al., 2017), LAMB (You et al., 2020), and Fromage (Bernstein et al., 2020). Empirically, these algorithms appear to stabilise large batch network training and require little to no learning rate tuning. Bernstein et al. (2020) suggested that these algorithms are accounting for the compositional structure of the neural network function class, and derived a new distance measure called deep relative trust to describe this analytically. It is the relative nature of deep relative trust that leads us to propose multiplicative updates in this work.

Numerics of a synapse

A basic goal of theoretical neuroscience is to connect the numerical properties of a synapse with network function. For example, following the observation that synapses are exclusively excitatory or inhibitory, van Vreeswijk and Sompolinsky (1996) studied how the balance of excitation and inhibition can affect network dynamics and Amit et al. (1989)

studied perceptron learning with sign-constrained synapses. More recently, based on the observation that synapse size and strength are correlated,

Bartol et al. (2015) used the number of distinguishable synapse sizes to estimate the information content of a synapse. Their results suggest that biological synapses may occupy just 26 levels in a logarithmic number system, thus storing less than 5 bits of information. This leads to one estimate that a human brain may store no more than:

Low-precision hardware

In their bid to outrun the end of Moore’s Law, chip designers have also taken an interest in understanding and improving the efficiency of artificial synapses. This work dates back at least to the 1980s and 1990s—for example, Iwata et al. (1989) designed a 24-bit neural network accelerator while Baker and Hammerstrom (1989) suggested that learning may break down below 12 bits per synapse. In 1993, Holt and Hwang (1993)

analysed round-off error for compounding operators and proposed a heuristic connection between numerics and optimisation (their Equation 54).

Last decade there was renewed interest in low precision synaptic weights both for deployment (Courbariaux et al., 2015; Hubara et al., 2018; Zhou et al., 2017) and training (Gupta et al., 2015; Müller and Indiveri, 2015; Sun et al., 2019; Wang et al., 2018; Wu et al., 2018) of artificial networks. This research has included the exploration of logarithmic number systems (Lee et al., 2017; Vogel et al., 2018). A general trend has emerged: a trained network may be quantised to just a few bits per synapse, but 8 to 16 bits are typically required for stable learning (Gupta et al., 2015; Wang et al., 2018; Sun et al., 2019). Given the lack of theoretical understanding of how precision relates to learning, these works often introduce subtle but significant complexities. For example, existing works using logarithmic number systems combine them with additive optimisation algorithms like Adam and SGD (Lee et al., 2017), thus requiring tuning of both the learning algorithm and the numerical representation. And many works must resort to using high precision weights in the final layer of the network to maintain accuracy (Sun et al., 2019; Wang et al., 2018; Wu et al., 2018).

3 Mathematical model

A basic question in the theory of neural networks is as follows:

How far can we perturb the synapses before we damage the network as a whole?

In this paper, this question is important on two fronts: our learning rule must not destroy the information contained in the synapses. And our numerical representation must be precise enough to encode non-destructive perturbations. Once it has been established that multiplicative updates are a good learning rule (addressing the first point) it becomes natural to represent them using a logarithmic number system (addressing the second). Therefore, this section shall focus on establishing—first as a sketch, then rigorously—the benefits of multiplicative updates for learning compositional functions.

3.1 Sketch of the main idea

The raison d’être

of a synaptic weight is to support learning in the network as a whole. In machine learning, this is formalised by constructing a loss function

that measures the error of the network in weight configuration . Learning proceeds by perturbing the synapses in order to reduce the loss function. A good perturbation direction is the negative gradient of the loss: . But how far can this direction be trusted?

Intuitively, the negative gradient should only be trusted until its approximation quality breaks down. This breakdown could be measured by the Hessian of the loss function, but this is intractable for large networks since it involves all pairs of weights. Instead, Bernstein et al. (2020) suggest how to operate without the Hessian. To get a handle on exactly how this is done, consider an

layer multilayer perceptron

and the gradient of its loss with respect to the weights at the th layer:


where denote the activations at the

th hidden layer of the network. By the backpropagation algorithm

(Rumelhart et al., 1986)—and ignoring the nonlinearity for the sake of this sketch—the second term in Equation 1 depends on the product of weight matrices over layers to , and the third term depends on the product of weight matrices over layers 1 to . It is therefore natural to model the relative change in the whole expression via the formula for the relative change of a product:

This neglects the specific choice of loss function which enters via the first term in Equation 1.

To recap, we have proposed that neural networks ought to be trained by following the gradient of their loss until that gradient breaks down. We then sketched that the relative breakdown in gradient depends on a product over relative perturbations to each network layer. The simplest perturbation that follows the gradient direction and keeps the layerwise relative perturbation small was first introduced by You et al. (2017):

The downside of this rule is that it requires knowing precisely which weights act together as a layer and normalising those updates jointly. It is difficult to imagine this happening in the brain, and for exotic artificial networks it is sometimes unclear what constitutes a layer (Huang et al., 2017). A convenient way to sidestep this issue is to update each individual weight multiplicatively, via:

This update ensures that is small for every subset of the weights whilst only using information local to a synapse. Therefore by appropriate choice of the signs of the multiplicative factors, it can be arranged that this perturbation is roughly aligned with the negative gradient whilst keeping the relative breakdown in gradient small.

On the next page, we shall develop this sketch into a rigorous optimisation theory.

3.2 First-order optimisation of continuously differentiable functions

To elucidate the connection between gradient breakdown and optimisation, consider the following inequality. Though it applies to all continuously differentiable functions, we will think of as a loss function measuring the performance of a neural network of depth at some task.

Lemma 1.

Consider a continuously differentiable function that maps

. Suppose that parameter vector

decomposes into parameter groups: , and consider making a perturbation . Let measure the angle between and negative gradient . Then:

The proof is in Appendix A. The result says: to reduce a function, follow its negative gradient until it breaks down. Descent is formally guaranteed when the bracketed terms are positive. That is, when:


According to Equation 2, to rigorously guarantee descent for neural networks we must bound their relative breakdown in gradient. To this end, Bernstein et al. (2020) propose the notion of deep relative trust based on a perturbation analysis of compositional functions.

Modelling assumption 1.

[Deep relative trust] Consider a neural network with layers and parameters . Consider parameter perturbation . Let denote the gradient of the loss. Then the gradient breakdown is bounded by:

The product reflects the compositional structure of the network. Crucially, compared to a Hessian that may contain as many as entries for modern networks, this is a tractable analytic expression.

3.3 Descent via multiplicative weight updates

Since deep relative trust penalises the relative size of the perturbation to each layer, it is natural that our learning algorithm would bound these perturbations on a relative scale. A simple way to achieve this is via the following multiplicative update rule:


where the sign is taken elementwise and denotes elementwise multiplication. Synapses shrink where the signs of and agree and grow where the signs differ. In the following theorem, we establish descent under this update for compositional functions described by deep relative trust.

Theorem 1.

Let be the continuously differentiable loss function of a neural network of depth that obeys deep relative trust. For , let denote the angle between and . Then the multiplicative update in Equation 3 will decrease the loss function provided that:

Theorem 1 tells us that for small enough , multiplicative updates achieve descent. It also tells us on what scale must be small. The proof is given in Appendix A.

We can bring Theorem 1 to life by plugging in numbers. First, we must consider what values the angles are likely to take. Since and are nonnegative vectors, the angle between them can be no larger than . This can only happen when the support of the two vectors is totally disjoint. For problems occurring in practice we would expect that the support is not disjoint. Therefore it seems reasonable to substitute into Theorem 1, whence we obtain that:

  • For a 50 layer network, setting guarantees descent.

  • For a 500 layer network, setting guarantees descent.

  • For a 5000 layer network, setting guarantees descent.

We find that works well in all our experiments with the Madam optimiser in later sections.

initial weight scale ; max weight .
typical perturbation ; max perturbation ; averaging constant .
initialise weights randomly on scale , for example: .

initialise second moment estimate

      collect gradient
      update second moment estimate
      update weights multiplicatively
      clamp weights between
until converged
Algorithm 1 Madam—a multiplicative adaptive moments based optimiser. Good default hyperparameters are: , , , . can be lifted from a standard initialisation.

4 Making the algorithm practical—the Madam optimiser

In the previous section, we built an optimisation theory for the multiplicative update rule appearing in Equation 3. While that update yields to straightforward mathematical analysis, two modifications render it more practically useful. First, we use the fact that for small to approximate Equation 3 by:


where is shorthand for . This change makes it easier to represent weights in a logarithmic number system, since the weights are restricted to integer multiples of in log space.

Second, in practice it may be overly stringent to restrict to the 1 bit gradient sign. To retain more gradient precision, we propose:


In this expression, denotes the root mean square gradient. This is estimated by a running average over iterations. Each iteration, this is updated by:

This idea is borrowed from the Adam and RMSprop optimisers

(Kingma and Ba, 2015; Tieleman and Hinton, 2012). It is important to realise that the quantity is typically , which explains why it may be viewed as a higher precision version of . Finally, the function projects its argument on to the interval . This means that a weight can change by no more than a factor of per iteration, ensuring that the algorithm still has bounded relative perturbations and respects deep relative trust. We refer to Equation 5 as Madam—full pseudocode is given in Algorithm 1.

4.1 -bit Madam

The multiplicative nature of Madam suggests storing synapse strengths in a logarithmic number system, where numbers are represented just by a sign and exponent. To see why this is natural for Madam consider that, since is typically , Madam’s typical relative perturbation to a synapse is . Therefore, in log space the synapse strengths typically change by . This suggests efficiently representing a synapse by its sign and an integer multiple of .

In practice, a slightly more fine-grained discretisation is beneficial. We define the base precision which divides both the learning rate and maximum perturbation strength . We then round appearing in Madam to the nearest multiple of . This leads to quantised multiplicative updates:

A good setting in practice is , and . Finally, to obtain a -bit weight representation, we must restrict the number of allowed weight levels to . The resulting representation is:

The scale parameter is shared by a whole network layer, and may be taken from an existing weight initialisation scheme. A good mental picture is that the weights live on a ladder in log space with rungs spaced apart by . The Madam update moves the weights up or down the rungs of the ladder.

5 Benchmarking Madam in FP32

Figure 1: Learning rate tuning for different optimisers. We tuned the learning rate over a logarithmic grid on various deep learning benchmarks. For each run, a fixed is used without decay. For each algorithm and setting of , we plot the best error across all for that algorithm divided by the error for that specific , meaning that a value of (highlighted with a red cross) indicates the best setting of for that algorithm and task. For Madam, the best setting () was independent of task.

Note: all Madam runs use initial learning rate .
For all algorithms, is decayed by 10 when the loss plateaus.

Dataset Task
CIFAR-10 Resnet18 Adam 0.001 0.01
CIFAR-100 Resnet18 SGD 0.1 0.01
ImageNet Resnet50 SGD 0.1 0.01
CIFAR-10 cGAN Adam 0.0001 0.01
Wikitext-2 Transformer SGD 1.0 0.01
Table 1: Results after tuning the learning rate . For each task, we compare to the better performing optimiser in and list the associated inital . For Madam we use the same intial across all tasks. We quote top-1 test error, FID (Heusel et al., 2017) and perplexity for the classifiers, GAN and transformer respectively. Lower is better in all cases. The mean and range are based on three repeats.

In this section, we benchmark Madam (Algorithm 1) with weights represented in 32-bit floating point. In the next section, we shall benchmark -bit Madam. The results in this section show that—across various tasks, including image classification, language modeling and image generation—Madam without learning rate tuning is consistently competitive with a tuned SGD or Adam.

In Figure 1, we show the results of a learning rate grid search undertaken for Madam, SGD and Adam. The optimal learning rate setting for each benchmark is shown with a red cross. Notice that for Madam the optimal learning rate is always , whereas for SGD and Adam it varies.

In Table 1, we compare the final results using tuned learning rates for Adam and SGD and using for Madam. Since we are comparing Madam to the better algorithm out of Adam and SGD for each benchmark, this comparison is technically biased against Madam. Still, Madam’s results are competitive, and substantially better in the GAN experiment where SGD obtained FID .

Implementation details

The code for these experiments is to be found at https://github.com/jxbz/madam

. Because bias terms are initialised to zero by default in Pytorch

(Paszke et al., 2019), a multiplicative update would have no effect on these parameters. Therefore we intialised these terms away from zero which led to slight performance improvements in some experiments. Madam benefited from light tuning of the hyperparameter (see Algorithm 1) which regularises the network by controlling the maximum size of the weights. In each experiment, was set to be 1, 2 or 3 times larger than the initialisation scale on a layerwise basis. The effect of tuning is comparable to the effect of tuning weight decay in SGD.

6 Benchmarking -bit Madam

The results in this section demonstrate that -bit Madam can be used to train networks that use 8–12 bits per weight, often with little to no loss in accuracy compared to an FP32 baseline. This compression level is in the range of 8–16 bits suggested by prior work (Gupta et al., 2015; Wang et al., 2018; Sun et al., 2019). However, we must emphasise the ease with which these results were attained. Just as Madam did not require learning rate tuning (see Figure 1), neither did -bit Madam. In all 12-bit runs, a learning rate of combined with a base precision could be relied upon to achieve stable learning.

The results are given in Table 2. Though little deterioration is experienced at 12 bits, we believe that the results could be improved by making minor hyperparameter tweaks. For example, in the 12-bit ImageNet experiment we were able to reduce the error from to by borrowing layerwise parameter scales (see Section 4.1) from a pre-trained model instead of using the standard Pytorch (Paszke et al., 2019) initialisation scale. Still, it would be against the spirit of this work to present results with over-tuned hyperparameters.

To get more of a feel for the relative simplicity of our approach, we shall briefly comment on some of the subtleties introduced by prior work that -bit Madam avoids. Studies often maintain higher-precision copies of the weights as part of their low-precision training process (Wu et al., 2018; Wang et al., 2018). For example, in their paper on 8-bit training, Wang et al. (2018) actually maintain a 16-bit master copy of the weights. Furthermore, it is common to keep certain network layers such as the output layer at higher precision (Sun et al., 2019; Wang et al., 2018; Wu et al., 2018). In contrast, we use the same bit width to represent every layer’s weights, and weights are both stored and updated in their -bit representation.

Furthermore, we want to emphasise how natural it is to combine multiplicative updates with a logarithmic number system. Prior research on deep network training using logarithmic number systems has combined them with additive optimisation algorithms like SGD (Lee et al., 2017). This necessitates tuning both the number system hyperparameters (dynamic range and base precision) and the optimisation hyperparameter (learning rate). As was demonstrated in Figure 1, tuning the SGD learning rate is already a computationally intensive task. Moreover, the cost of hyperparameter grid search grows exponentially in the number of hyperparameters.

Dataset Task FP32 Madam 12-bit 10-bit 8-bit
CIFAR-10 Resnet18
CIFAR-100 Resnet18
ImageNet Resnet50
Wikitext-2 Transformer
Table 2: Benchmarking -bit Madam. We tested 12-bit, 10-bit and 8-bit Madam on various tasks. The results for the FP32 baseline are reproduced from Table 1. For each result we give the mean and range over three repeats. In all experiments, an initial learning rate of was used. In all 12-bit experiments, the base precision was chosen as . In order to reduce the bit width from 12 to 8 bits, we increased the base precision of the numerical representation finding this to work better than the alternative: reducing the dynamic range of the numerical representation.

Implementation details

The code for these experiments is to be found at https://github.com/jxbz/madam. The -bit Madam hyperparameters are defined in Section 4.1. We choose the layerwise scale in -bit Madam following the same strategy as in Madam: 1, 2 or 3 times the default Pytorch (Paszke et al., 2019) initialisation scale, except for biases where the default Pytorch initialisation is zero. We choose the initial learning rate to be across all experiments. In the 12-bit experiments, we choose the base precision . In the 10-bit and 8-bit experiments, a larger base precision is used—still satisfying . For each layer, we initialise the weights uniformly from:

Notice that sets the precision of the representation, and the dynamic range is given by . For bits and base precision as used in the 12-bit experiments, the dynamic range is to 1 decimal place. Finally, during training the learning rate is decayed toward the base precision whenever the loss plateaus.

















1-bit sign

8-bit exponent

7-bit mantissa















1-bit sign

12-bit exponent

Figure 2: Upper left: the bfloat16 number system used in Google’s TPU chips (Kalamkar et al., 2019). Lower left: the logarithmic number system suggested by our theory. Right: the synaptic number system suggested by Bartol et al. (2015) based on 3D electron microscope images of hippocampal neuropil in three adult male rats. Synapses are suggested to take 26 distinguishable strengths (believed to be correlated with spine head volume) on a logarithmic number system. This plot is reproduced from (Bartol et al., 2015, Figure 8).

7 Discussion

After studying the optimisation properties of compositional functions, we have confirmed that neural networks may be trained via multiplicative updates to weights stored in a logarithmic number system. We shall conclude by discussing the possible implications for both chip design and neuroscience.

Computer number systems

In an effort to accelerate and reduce the cost of machine learning workflows, chip designers are currently exploring low precision arithmetic in both GPUs and TPUs. For example, Google has used bfloat16—or brain floating point—in their TPUs (Kalamkar et al., 2019) and NVIDIA has developed a mixed precision GPU training system (Micikevicius et al., 2018). A basic question in the design of low-precision number systems is how the bits should be split between the exponent and mantissa. As shown in Figure 2 (left), bfloat16 opts for an 8-bit exponent and 7-bit mantissa. Our work supports a prior suggestion (Lee et al., 2017) that to represent network weights, a mantissa may not be needed at all.

Curiously, the same observation is made by Bartol et al. (2015) in the context of neuroscience. In Figure 2 (right), we reproduce a plot from their paper that illustrates this. The authors suggest that the brain may use “a form of non-uniform quantization which efficiently encodes the dynamic range of synaptic strengths at constant precision”—or in other words, a logarithmic number system. The authors found that “spine head volumes ranged in size over a factor of 60 from smallest to largest”, where spine head volume is a correlate of synapse strength. Surprisingly, -bit Madam with and base precision (as in our experiments) has the same dynamic range of .

Frozen signs and Dale’s principle

The neuroscientific principle that synapses cannot change sign is sometimes referred to as Dale’s principle (Amit et al., 1989). When compositional functions are learnt via multiplicative updates, the signs of the weights are frozen and can be thought to satisfy Dale’s principle. This means that after training with multiplicative updates, there is no need to store the sign bits since they may be regenerated from the random seed. This could also impact hardware design since it would technically be possible to freeze a random sign pattern into the microcircuitry itself.

Plasticity in the brain

The precise mechanisms for plasticity in the brain are still under debate. Popular models include Hebbian learning and its variant spike time dependent plasticity, both of which adjust synapse strengths based on local firing history. Although these rules are usually modeled via additive updates, it has been suggested that multiplicative updates may better explain the available data—both in terms of the induced stationary distribution of synapse strengths, and also time-dependent observations (van Rossum et al., 2000; Barbour et al., 2007; Buzsáki and Mizuseki, 2014). For example, Loewenstein et al. (2011) image dendritic spines in the mouse auditory cortex over several weeks. Upon finding that changes in spine size are “proportional to the size of the spine”, the authors suggest that multiplicative updates are at play.

In contrast to these studies that concentrate on matching candidate update rules to available data—and thus probing how the brain learns—this paper has focused on using perturbation theory and calculus to link update rules to the stability of network function. This is a complementary approach that may shed light on why the brain learns the way it does—paving the way, perhaps, for computer microarchitectures that mimic it.

Broader Impact

Authors are required to include a statement of the broader impact of their work, including its ethical aspects and future societal consequences.

This paper proposes that multiplicative update rules are better suited to the compositional structure of neural networks than additive update rules. It concludes by discussing possible implications of this idea for chip design and neuroscience. The authors believe the work to be fairly neutral in terms of propensity to cause positive or negative societal outcomes.



  • Bartol et al. [2015] Thomas M. Bartol, Jr., Cailey Bromer, Justin Kinney, Michael A. Chirillo, Jennifer N. Bourne, Kristen M. Harris, and Terrence J. Sejnowski. Nanoconnectomic upper bound on the variability of synaptic plasticity. eLife, 2015.
  • Baker and Hammerstrom [1989] Tom Baker and Dan Hammerstrom. Characterization of artificial neural network algorithms. In International Symposium on Circuits and Systems, 1989.
  • Horowitz [2014] Mark Horowitz. Computing’s energy problem (and what we can do about it). In International Solid-State Circuits Conference, 2014.
  • Sze et al. [2017] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 2017.
  • Bernstein et al. [2020] Jeremy Bernstein, Arash Vahdat, Yisong Yue, and Ming-Yu Liu. On the distance between two neural networks and the stability of learning, 2020. arXiv:2002.03432.
  • Littlestone [1988] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 1988.
  • Kivinen and Warmuth [1997] Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 1997.
  • Freund and Schapire [1997] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997.
  • Grigoriadis and Khachiyan [1995] Michael D. Grigoriadis and Leonid G. Khachiyan. A sublinear-time randomized approximation algorithm for matrix games. Operations Research Letters, 1995.
  • Arora et al. [2012] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 2012.
  • Dhillon and Tropp [2008] Inderjit S. Dhillon and Joel A. Tropp. Matrix nearness problems with Bregman divergences. SIAM Journal on Matrix Analysis and Applications, 2008.
  • Bottou et al. [2018] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 2018.
  • Zhang et al. [2020] Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie.

    Why gradient clipping accelerates training: A theoretical justification for adaptivity.

    In International Conference on Learning Representations, 2020.
  • Azizan et al. [2019] Navid Azizan, Sahin Lale, and Babak Hassibi. Stochastic mirror descent on overparameterized nonlinear models: Convergence, implicit regularization, and generalization, 2019. arXiv:1906.03830.
  • Neyshabur et al. [2015] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In Neural Information Processing Systems, 2015.
  • You et al. [2017] Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32K for Imagenet training. Technical Report UCB/EECS-2017-156, University of California, Berkeley, 2017.
  • You et al. [2020] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training BERT in 76 minutes. In International Conference on Learning Representations, 2020.
  • van Vreeswijk and Sompolinsky [1996] Carl van Vreeswijk and Haim Sompolinsky.

    Chaos in neuronal networks with balanced excitatory and inhibitory activity.

    Science, 1996.
  • Amit et al. [1989] Daniel J. Amit, K. Y. Michael Wong, and Colin Campbell. Perceptron learning with sign-constrained weights. Journal of Physics A, 1989.
  • Iwata et al. [1989] Iwata, Yoshida, Matsuda, Sato, and Suzumura. An artificial neural network accelerator using general purpose 24 bit floating point digital signal processors. In International Joint Conference on Neural Networks, 1989.
  • Holt and Hwang [1993] Jordan L. Holt and Jenq-Neng Hwang. Finite precision error analysis of neural network hardware implementations. Transactions on Computers, 1993.
  • Courbariaux et al. [2015] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Neural Information Processing Systems, 2015.
  • Hubara et al. [2018] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 2018.
  • Zhou et al. [2017] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless CNNs with low-precision weights, 2017. arXiv:1702.03044.
  • Gupta et al. [2015] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning, 2015.
  • Müller and Indiveri [2015] Lorenz K. Müller and Giacomo Indiveri. Rounding methods for neural networks with low resolution synaptic weights, 2015. arXiv:1504.05767.
  • Sun et al. [2019] Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In Neural Information Processing Systems, 2019.
  • Wang et al. [2018] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. In Neural Information Processing Systems, 2018.
  • Wu et al. [2018] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. In International Conference on Learning Representations, 2018.
  • Lee et al. [2017] Edward H. Lee, Daisuke Miyashita, Elaina Chai, Boris Murmann, and S. Simon Wong. LogNet: Energy-efficient neural networks using logarithmic computation. In International Conference on Acoustics, Speech and Signal Processing, 2017.
  • Vogel et al. [2018] Sebastian Vogel, Mengyu Liang, Andre Guntoro, Walter Stechele, and Gerd Ascheid. Efficient hardware acceleration of CNNs using logarithmic data representation with arbitrary log-base. In International Conference on Computer-Aided Design, 2018.
  • Rumelhart et al. [1986] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 1986.
  • Huang et al. [2017] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Computer Vision and Pattern Recognition, 2017.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
  • Tieleman and Hinton [2012] Tijmen Tieleman and Geoffrey E. Hinton. Lecture 6.5—RMSprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Neural Information Processing Systems, 2017.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems, 2019.
  • Kalamkar et al. [2019] Dhiraj D. Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. A study of bfloat16 for deep learning training, 2019. arXiv:1905.12322.
  • Micikevicius et al. [2018] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018.
  • van Rossum et al. [2000] Mark C. van Rossum, Guo-qiang Bi, and Gina G. Turrigiano. Stable Hebbian learning from spike timing-dependent plasticity. Journal of Neuroscience, 2000.
  • Barbour et al. [2007] Boris Barbour, Nicolas Brunel, Vincent Hakim, and Jean-Pierre Nadal. What can we learn from synaptic weight distributions? Trends in Neurosciences, 2007.
  • Buzsáki and Mizuseki [2014] György Buzsáki and Kenji Mizuseki.

    The log-dynamic brain: how skewed distributions affect network operations.

    Nature Reviews Neuroscience, 2014.
  • Loewenstein et al. [2011] Yonatan Loewenstein, Annerose Kuras, and Simon Rumpel.

    Multiplicative dynamics underlie the emergence of the log-normal distribution of spine sizes in the neocortex in vivo.

    Journal of Neuroscience, 2011.

Appendix A Proofs

See 1


By the fundamental theorem of calculus,

The result follows by replacing the first term on the righthand side by the cosine formula for the dot product, and bounding the second term via the integral estimation lemma. ∎

See 1 See 1


Using the gradient reliability estimate from deep relative trust, we obtain that:

Descent is guaranteed if the bracketed terms in Lemma 1 are positive. By the previous inequality, this will occur provided that:


where measures the angle between and . For the update in Equation 3,

Therefore the perturbation is given by . For this perturbation, for any possible subset of weights . Also, letting return the angle between its arguments, and are related by:

Substituting these two results back into Equation 6 and rearranging, we are done. ∎