madam
Pytorch and Jax code for the Madam optimiser.
view repo
Compositionality is a basic structural feature of both biological and artificial neural networks. Learning compositional functions via gradient descent incurs well known problems like vanishing and exploding gradients, making careful learning rate tuning essential for realworld applications. This paper proves that multiplicative weight updates satisfy a descent lemma tailored to compositional functions. Based on this lemma, we derive Madam—a multiplicative version of the Adam optimiser—and show that it can train state of the art neural network architectures without learning rate tuning. We further show that Madam is easily adapted to train natively compressed neural networks by representing their weights in a logarithmic number system. We conclude by drawing connections between multiplicative weight updates and recent findings about synapses in biology.
READ FULL TEXT VIEW PDFPytorch and Jax code for the Madam optimiser.
Optimization Algorithms for Machine Learning with TensorFlow
Neural computation in living systems emerges from the collective behaviour of large numbers of low precision and potentially faulty processing elements. This is a far cry from the precision and reliability of digital electronics. Looking at the numbers, a synapse on a computer is often represented using 32 bits taking more than 4 billion distinct values. In contrast, a biological synapse is estimated to take 26 distinguishable strengths requiring only 5 bits to store
(Bartol et al., 2015). This is a discrepancy between nature and engineering that spans many orders of magnitude. So why does the brain learn stably whereas deep learning is notoriously finicky and sensitive to myriad hyperparameters?
Meanwhile, an industrial effort is underway to scale artificial networks up to run on supercomputers and down to run on resourcelimited edge devices. While learning algorithms designed to run natively on low precision hardware could lead to smaller and more power efficient chips (Baker and Hammerstrom, 1989; Horowitz, 2014), progress is hampered by our poor understanding of how precision impacts learning. As such, existing numerical representations have developed somewhat independently from learning algorithms. The next generation of neural hardware could benefit from more principled algorithmic codesign (Sze et al., 2017).
Our contributons:
Building on recent results in the perturbation analysis of compositional functions (Bernstein et al., 2020), we show that a multiplicative learning rule satisfies a descent lemma tailored to neural networks.
We propose and benchmark Madam—a multiplicative version of the Adam optimiser. Empirically, Madam seems to not require learning rate tuning and further may be used to train neural networks with low bit width synapses stored in a logarithmic number system.
We point out that multiplicative weight updates respect certain aspects of neuroanatomy. First, synapses are exclusively excitatory or inhibitory since their sign is preserved under the update. Second, multiplicative weight updates are most naturally implemented in a logarithmic number system, in line with anatomical findings about biological synapses (Bartol et al., 2015).
Multiplicative algorithms have a storied history in computer science. Examples in machine learning include the Winnow algorithm
(Littlestone, 1988) and the exponentiated gradient algorithm (Kivinen and Warmuth, 1997)—both for learning linear classifiers in the face of irrelevant input features. The Hedge algorithm
(Freund and Schapire, 1997), which underpins the AdaBoost framework for boosting weak learners, is also multiplicative. In algorithmic game theory, multiplicative weight updates may be used to solve twoplayer zero sum games
(Grigoriadis and Khachiyan, 1995). Arora et al. (2012) survey many more applications.Multiplicative updates are typically viewed as appropriate for problems where the geometry of the optimisation domain is described by the relative entropy (Kivinen and Warmuth, 1997)
, as is often the case when optimising over probability distributions. Since the relative entropy is a Bregman divergence, the algorithm may then be studied under the framework of mirror descent
(Dhillon and Tropp, 2008). We suggest that multiplicative updates may arise under a broader principle: when the geometry of the optimisation domain is described by any relative distance measure.Finding the right distance measure to describe deep neural networks is an ongoing research problem. Some theoretical work has supposed that the underlying geometry is Euclidean via the assumption of Lipschitzcontinuous gradients (Bottou et al., 2018), but Zhang et al. (2020) suggest that this assumption may be too strong and consider relaxing it. Similarly, Azizan et al. (2019) consider more general Bregman divergences under the mirror descent framework, although these distance measures are not tailored to compositional functions like neural networks. Neyshabur et al. (2015), on the other hand, derive a distance measure based on paths through a neural network in order to better capture the scaling symmetries over layers. Still, it is difficult to place this distance within a formal optimisation theory in order to derive descent lemmas.
Recent research has looked at learning algorithms that make relative updates to network layers, such as LARS (You et al., 2017), LAMB (You et al., 2020), and Fromage (Bernstein et al., 2020). Empirically, these algorithms appear to stabilise large batch network training and require little to no learning rate tuning. Bernstein et al. (2020) suggested that these algorithms are accounting for the compositional structure of the neural network function class, and derived a new distance measure called deep relative trust to describe this analytically. It is the relative nature of deep relative trust that leads us to propose multiplicative updates in this work.
A basic goal of theoretical neuroscience is to connect the numerical properties of a synapse with network function. For example, following the observation that synapses are exclusively excitatory or inhibitory, van Vreeswijk and Sompolinsky (1996) studied how the balance of excitation and inhibition can affect network dynamics and Amit et al. (1989)
studied perceptron learning with signconstrained synapses. More recently, based on the observation that synapse size and strength are correlated,
Bartol et al. (2015) used the number of distinguishable synapse sizes to estimate the information content of a synapse. Their results suggest that biological synapses may occupy just 26 levels in a logarithmic number system, thus storing less than 5 bits of information. This leads to one estimate that a human brain may store no more than:In their bid to outrun the end of Moore’s Law, chip designers have also taken an interest in understanding and improving the efficiency of artificial synapses. This work dates back at least to the 1980s and 1990s—for example, Iwata et al. (1989) designed a 24bit neural network accelerator while Baker and Hammerstrom (1989) suggested that learning may break down below 12 bits per synapse. In 1993, Holt and Hwang (1993)
analysed roundoff error for compounding operators and proposed a heuristic connection between numerics and optimisation (their Equation 54).
Last decade there was renewed interest in low precision synaptic weights both for deployment (Courbariaux et al., 2015; Hubara et al., 2018; Zhou et al., 2017) and training (Gupta et al., 2015; Müller and Indiveri, 2015; Sun et al., 2019; Wang et al., 2018; Wu et al., 2018) of artificial networks. This research has included the exploration of logarithmic number systems (Lee et al., 2017; Vogel et al., 2018). A general trend has emerged: a trained network may be quantised to just a few bits per synapse, but 8 to 16 bits are typically required for stable learning (Gupta et al., 2015; Wang et al., 2018; Sun et al., 2019). Given the lack of theoretical understanding of how precision relates to learning, these works often introduce subtle but significant complexities. For example, existing works using logarithmic number systems combine them with additive optimisation algorithms like Adam and SGD (Lee et al., 2017), thus requiring tuning of both the learning algorithm and the numerical representation. And many works must resort to using high precision weights in the final layer of the network to maintain accuracy (Sun et al., 2019; Wang et al., 2018; Wu et al., 2018).
A basic question in the theory of neural networks is as follows:
How far can we perturb the synapses before we damage the network as a whole?
In this paper, this question is important on two fronts: our learning rule must not destroy the information contained in the synapses. And our numerical representation must be precise enough to encode nondestructive perturbations. Once it has been established that multiplicative updates are a good learning rule (addressing the first point) it becomes natural to represent them using a logarithmic number system (addressing the second). Therefore, this section shall focus on establishing—first as a sketch, then rigorously—the benefits of multiplicative updates for learning compositional functions.
The raison d’être
of a synaptic weight is to support learning in the network as a whole. In machine learning, this is formalised by constructing a loss function
that measures the error of the network in weight configuration . Learning proceeds by perturbing the synapses in order to reduce the loss function. A good perturbation direction is the negative gradient of the loss: . But how far can this direction be trusted?Intuitively, the negative gradient should only be trusted until its approximation quality breaks down. This breakdown could be measured by the Hessian of the loss function, but this is intractable for large networks since it involves all pairs of weights. Instead, Bernstein et al. (2020) suggest how to operate without the Hessian. To get a handle on exactly how this is done, consider an
layer multilayer perceptron
and the gradient of its loss with respect to the weights at the th layer:(1) 
where denote the activations at the
th hidden layer of the network. By the backpropagation algorithm
(Rumelhart et al., 1986)—and ignoring the nonlinearity for the sake of this sketch—the second term in Equation 1 depends on the product of weight matrices over layers to , and the third term depends on the product of weight matrices over layers 1 to . It is therefore natural to model the relative change in the whole expression via the formula for the relative change of a product:This neglects the specific choice of loss function which enters via the first term in Equation 1.
To recap, we have proposed that neural networks ought to be trained by following the gradient of their loss until that gradient breaks down. We then sketched that the relative breakdown in gradient depends on a product over relative perturbations to each network layer. The simplest perturbation that follows the gradient direction and keeps the layerwise relative perturbation small was first introduced by You et al. (2017):
The downside of this rule is that it requires knowing precisely which weights act together as a layer and normalising those updates jointly. It is difficult to imagine this happening in the brain, and for exotic artificial networks it is sometimes unclear what constitutes a layer (Huang et al., 2017). A convenient way to sidestep this issue is to update each individual weight multiplicatively, via:
This update ensures that is small for every subset of the weights whilst only using information local to a synapse. Therefore by appropriate choice of the signs of the multiplicative factors, it can be arranged that this perturbation is roughly aligned with the negative gradient whilst keeping the relative breakdown in gradient small.
On the next page, we shall develop this sketch into a rigorous optimisation theory.
To elucidate the connection between gradient breakdown and optimisation, consider the following inequality. Though it applies to all continuously differentiable functions, we will think of as a loss function measuring the performance of a neural network of depth at some task.
Consider a continuously differentiable function that maps
. Suppose that parameter vector
decomposes into parameter groups: , and consider making a perturbation . Let measure the angle between and negative gradient . Then:The proof is in Appendix A. The result says: to reduce a function, follow its negative gradient until it breaks down. Descent is formally guaranteed when the bracketed terms are positive. That is, when:
(2) 
According to Equation 2, to rigorously guarantee descent for neural networks we must bound their relative breakdown in gradient. To this end, Bernstein et al. (2020) propose the notion of deep relative trust based on a perturbation analysis of compositional functions.
[Deep relative trust] Consider a neural network with layers and parameters . Consider parameter perturbation . Let denote the gradient of the loss. Then the gradient breakdown is bounded by:
The product reflects the compositional structure of the network. Crucially, compared to a Hessian that may contain as many as entries for modern networks, this is a tractable analytic expression.
Since deep relative trust penalises the relative size of the perturbation to each layer, it is natural that our learning algorithm would bound these perturbations on a relative scale. A simple way to achieve this is via the following multiplicative update rule:
(3) 
where the sign is taken elementwise and denotes elementwise multiplication. Synapses shrink where the signs of and agree and grow where the signs differ. In the following theorem, we establish descent under this update for compositional functions described by deep relative trust.
Let be the continuously differentiable loss function of a neural network of depth that obeys deep relative trust. For , let denote the angle between and . Then the multiplicative update in Equation 3 will decrease the loss function provided that:
Theorem 1 tells us that for small enough , multiplicative updates achieve descent. It also tells us on what scale must be small. The proof is given in Appendix A.
We can bring Theorem 1 to life by plugging in numbers. First, we must consider what values the angles are likely to take. Since and are nonnegative vectors, the angle between them can be no larger than . This can only happen when the support of the two vectors is totally disjoint. For problems occurring in practice we would expect that the support is not disjoint. Therefore it seems reasonable to substitute into Theorem 1, whence we obtain that:
For a 50 layer network, setting guarantees descent.
For a 500 layer network, setting guarantees descent.
For a 5000 layer network, setting guarantees descent.
We find that works well in all our experiments with the Madam optimiser in later sections.
In the previous section, we built an optimisation theory for the multiplicative update rule appearing in Equation 3. While that update yields to straightforward mathematical analysis, two modifications render it more practically useful. First, we use the fact that for small to approximate Equation 3 by:
(4) 
where is shorthand for . This change makes it easier to represent weights in a logarithmic number system, since the weights are restricted to integer multiples of in log space.
Second, in practice it may be overly stringent to restrict to the 1 bit gradient sign. To retain more gradient precision, we propose:
(5) 
In this expression, denotes the root mean square gradient. This is estimated by a running average over iterations. Each iteration, this is updated by:
This idea is borrowed from the Adam and RMSprop optimisers
(Kingma and Ba, 2015; Tieleman and Hinton, 2012). It is important to realise that the quantity is typically , which explains why it may be viewed as a higher precision version of . Finally, the function projects its argument on to the interval . This means that a weight can change by no more than a factor of per iteration, ensuring that the algorithm still has bounded relative perturbations and respects deep relative trust. We refer to Equation 5 as Madam—full pseudocode is given in Algorithm 1.The multiplicative nature of Madam suggests storing synapse strengths in a logarithmic number system, where numbers are represented just by a sign and exponent. To see why this is natural for Madam consider that, since is typically , Madam’s typical relative perturbation to a synapse is . Therefore, in log space the synapse strengths typically change by . This suggests efficiently representing a synapse by its sign and an integer multiple of .
In practice, a slightly more finegrained discretisation is beneficial. We define the base precision which divides both the learning rate and maximum perturbation strength . We then round appearing in Madam to the nearest multiple of . This leads to quantised multiplicative updates:
A good setting in practice is , and . Finally, to obtain a bit weight representation, we must restrict the number of allowed weight levels to . The resulting representation is:
The scale parameter is shared by a whole network layer, and may be taken from an existing weight initialisation scheme. A good mental picture is that the weights live on a ladder in log space with rungs spaced apart by . The Madam update moves the weights up or down the rungs of the ladder.
Note: all Madam runs use initial learning rate .
For all algorithms, is decayed by 10 when the loss plateaus.
Dataset  Task 







CIFAR10  Resnet18  Adam  0.001  0.01  
CIFAR100  Resnet18  SGD  0.1  0.01  
ImageNet  Resnet50  SGD  0.1  0.01  
CIFAR10  cGAN  Adam  0.0001  0.01  
Wikitext2  Transformer  SGD  1.0  0.01 
In this section, we benchmark Madam (Algorithm 1) with weights represented in 32bit floating point. In the next section, we shall benchmark bit Madam. The results in this section show that—across various tasks, including image classification, language modeling and image generation—Madam without learning rate tuning is consistently competitive with a tuned SGD or Adam.
In Figure 1, we show the results of a learning rate grid search undertaken for Madam, SGD and Adam. The optimal learning rate setting for each benchmark is shown with a red cross. Notice that for Madam the optimal learning rate is always , whereas for SGD and Adam it varies.
In Table 1, we compare the final results using tuned learning rates for Adam and SGD and using for Madam. Since we are comparing Madam to the better algorithm out of Adam and SGD for each benchmark, this comparison is technically biased against Madam. Still, Madam’s results are competitive, and substantially better in the GAN experiment where SGD obtained FID .
The code for these experiments is to be found at https://github.com/jxbz/madam
. Because bias terms are initialised to zero by default in Pytorch
(Paszke et al., 2019), a multiplicative update would have no effect on these parameters. Therefore we intialised these terms away from zero which led to slight performance improvements in some experiments. Madam benefited from light tuning of the hyperparameter (see Algorithm 1) which regularises the network by controlling the maximum size of the weights. In each experiment, was set to be 1, 2 or 3 times larger than the initialisation scale on a layerwise basis. The effect of tuning is comparable to the effect of tuning weight decay in SGD.The results in this section demonstrate that bit Madam can be used to train networks that use 8–12 bits per weight, often with little to no loss in accuracy compared to an FP32 baseline. This compression level is in the range of 8–16 bits suggested by prior work (Gupta et al., 2015; Wang et al., 2018; Sun et al., 2019). However, we must emphasise the ease with which these results were attained. Just as Madam did not require learning rate tuning (see Figure 1), neither did bit Madam. In all 12bit runs, a learning rate of combined with a base precision could be relied upon to achieve stable learning.
The results are given in Table 2. Though little deterioration is experienced at 12 bits, we believe that the results could be improved by making minor hyperparameter tweaks. For example, in the 12bit ImageNet experiment we were able to reduce the error from to by borrowing layerwise parameter scales (see Section 4.1) from a pretrained model instead of using the standard Pytorch (Paszke et al., 2019) initialisation scale. Still, it would be against the spirit of this work to present results with overtuned hyperparameters.
To get more of a feel for the relative simplicity of our approach, we shall briefly comment on some of the subtleties introduced by prior work that bit Madam avoids. Studies often maintain higherprecision copies of the weights as part of their lowprecision training process (Wu et al., 2018; Wang et al., 2018). For example, in their paper on 8bit training, Wang et al. (2018) actually maintain a 16bit master copy of the weights. Furthermore, it is common to keep certain network layers such as the output layer at higher precision (Sun et al., 2019; Wang et al., 2018; Wu et al., 2018). In contrast, we use the same bit width to represent every layer’s weights, and weights are both stored and updated in their bit representation.
Furthermore, we want to emphasise how natural it is to combine multiplicative updates with a logarithmic number system. Prior research on deep network training using logarithmic number systems has combined them with additive optimisation algorithms like SGD (Lee et al., 2017). This necessitates tuning both the number system hyperparameters (dynamic range and base precision) and the optimisation hyperparameter (learning rate). As was demonstrated in Figure 1, tuning the SGD learning rate is already a computationally intensive task. Moreover, the cost of hyperparameter grid search grows exponentially in the number of hyperparameters.
Dataset  Task  FP32 Madam  12bit  10bit  8bit 

CIFAR10  Resnet18  
CIFAR100  Resnet18  
ImageNet  Resnet50  
CIFAR10  cGAN  
Wikitext2  Transformer 
The code for these experiments is to be found at https://github.com/jxbz/madam. The bit Madam hyperparameters are defined in Section 4.1. We choose the layerwise scale in bit Madam following the same strategy as in Madam: 1, 2 or 3 times the default Pytorch (Paszke et al., 2019) initialisation scale, except for biases where the default Pytorch initialisation is zero. We choose the initial learning rate to be across all experiments. In the 12bit experiments, we choose the base precision . In the 10bit and 8bit experiments, a larger base precision is used—still satisfying . For each layer, we initialise the weights uniformly from:
Notice that sets the precision of the representation, and the dynamic range is given by . For bits and base precision as used in the 12bit experiments, the dynamic range is to 1 decimal place. Finally, during training the learning rate is decayed toward the base precision whenever the loss plateaus.
After studying the optimisation properties of compositional functions, we have confirmed that neural networks may be trained via multiplicative updates to weights stored in a logarithmic number system. We shall conclude by discussing the possible implications for both chip design and neuroscience.
In an effort to accelerate and reduce the cost of machine learning workflows, chip designers are currently exploring low precision arithmetic in both GPUs and TPUs. For example, Google has used bfloat16—or brain floating point—in their TPUs (Kalamkar et al., 2019) and NVIDIA has developed a mixed precision GPU training system (Micikevicius et al., 2018). A basic question in the design of lowprecision number systems is how the bits should be split between the exponent and mantissa. As shown in Figure 2 (left), bfloat16 opts for an 8bit exponent and 7bit mantissa. Our work supports a prior suggestion (Lee et al., 2017) that to represent network weights, a mantissa may not be needed at all.
Curiously, the same observation is made by Bartol et al. (2015) in the context of neuroscience. In Figure 2 (right), we reproduce a plot from their paper that illustrates this. The authors suggest that the brain may use “a form of nonuniform quantization which efficiently encodes the dynamic range of synaptic strengths at constant precision”—or in other words, a logarithmic number system. The authors found that “spine head volumes ranged in size over a factor of 60 from smallest to largest”, where spine head volume is a correlate of synapse strength. Surprisingly, bit Madam with and base precision (as in our experiments) has the same dynamic range of .
The neuroscientific principle that synapses cannot change sign is sometimes referred to as Dale’s principle (Amit et al., 1989). When compositional functions are learnt via multiplicative updates, the signs of the weights are frozen and can be thought to satisfy Dale’s principle. This means that after training with multiplicative updates, there is no need to store the sign bits since they may be regenerated from the random seed. This could also impact hardware design since it would technically be possible to freeze a random sign pattern into the microcircuitry itself.
The precise mechanisms for plasticity in the brain are still under debate. Popular models include Hebbian learning and its variant spike time dependent plasticity, both of which adjust synapse strengths based on local firing history. Although these rules are usually modeled via additive updates, it has been suggested that multiplicative updates may better explain the available data—both in terms of the induced stationary distribution of synapse strengths, and also timedependent observations (van Rossum et al., 2000; Barbour et al., 2007; Buzsáki and Mizuseki, 2014). For example, Loewenstein et al. (2011) image dendritic spines in the mouse auditory cortex over several weeks. Upon finding that changes in spine size are “proportional to the size of the spine”, the authors suggest that multiplicative updates are at play.
In contrast to these studies that concentrate on matching candidate update rules to available data—and thus probing how the brain learns—this paper has focused on using perturbation theory and calculus to link update rules to the stability of network function. This is a complementary approach that may shed light on why the brain learns the way it does—paving the way, perhaps, for computer microarchitectures that mimic it.
Authors are required to include a statement of the broader impact of their work, including its ethical aspects and future societal consequences.
This paper proposes that multiplicative update rules are better suited to the compositional structure of neural networks than additive update rules. It concludes by discussing possible implications of this idea for chip design and neuroscience. The authors believe the work to be fairly neutral in terms of propensity to cause positive or negative societal outcomes.
Why gradient clipping accelerates training: A theoretical justification for adaptivity.
In International Conference on Learning Representations, 2020.Chaos in neuronal networks with balanced excitatory and inhibitory activity.
Science, 1996.The logdynamic brain: how skewed distributions affect network operations.
Nature Reviews Neuroscience, 2014.Multiplicative dynamics underlie the emergence of the lognormal distribution of spine sizes in the neocortex in vivo.
Journal of Neuroscience, 2011.See 1
By the fundamental theorem of calculus,
The result follows by replacing the first term on the righthand side by the cosine formula for the dot product, and bounding the second term via the integral estimation lemma. ∎
Using the gradient reliability estimate from deep relative trust, we obtain that:
Descent is guaranteed if the bracketed terms in Lemma 1 are positive. By the previous inequality, this will occur provided that:
(6) 
where measures the angle between and . For the update in Equation 3,
Therefore the perturbation is given by . For this perturbation, for any possible subset of weights . Also, letting return the angle between its arguments, and are related by:
Substituting these two results back into Equation 6 and rearranging, we are done. ∎
Comments
There are no comments yet.