Measuring Arithmetic Extrapolation Performance

by   Andreas Madsen, et al.

The Neural Arithmetic Logic Unit (NALU) is a neural network layer that can learn exact arithmetic operations between the elements of a hidden state. The goal of NALU is to learn perfect extrapolation, which requires learning the exact underlying logic of an unknown arithmetic problem. Evaluating the performance of the NALU is non-trivial as one arithmetic problem might have many solutions. As a consequence, single-instance MSE has been used to evaluate and compare performance between models. However, it can be hard to interpret what magnitude of MSE represents a correct solution and models sensitivity to initialization. We propose using a success-criterion to measure if and when a model converges. Using a success-criterion we can summarize success-rate over many initialization seeds and calculate confidence intervals. We contribute a generalized version of the previous arithmetic benchmark to measure models sensitivity under different conditions. This is, to our knowledge, the first extensive evaluation with respect to convergence of the NALU and its sub-units. Using a success-criterion to summarize 4800 experiments we find that consistently learning arithmetic extrapolation is challenging, in particular for multiplication.



There are no comments yet.


page 1

page 2

page 3

page 4


Neural Arithmetic Units

Neural networks can approximate complex functions, but they struggle to ...

iNALU: Improved Neural Arithmetic Logic Unit

Neural networks have to capture mathematical relationships in order to l...

A Primer for Neural Arithmetic Logic Modules

Neural Arithmetic Logic Modules have become a growing area of interest, ...

Neural Arithmetic Logic Units

Neural networks can learn to represent and manipulate numerical informat...

Learning Division with Neural Arithmetic Logic Modules

To achieve systematic generalisation, it first makes sense to master sim...

Simulating Problem Difficulty in Arithmetic Cognition Through Dynamic Connectionist Models

The present study aims to investigate similarities between how humans an...

Magic: the Gathering is as Hard as Arithmetic

Magic: the Gathering is a popular and famously complicated card game abo...

Code Repositories


Code for Neural Arithmetic Units (ICLR) and Measuring Arithmetic Extrapolation Performance (SEDL|NeurIPS)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When using neural networks to learn simple arithmetic problems, such as counting, multiplication, or comparison they systematically fail to extrapolate onto unseen ranges (stillNotSystematic; suzgun2019evaluating; trask-nalu). The absence of inductive bias makes it difficult for neural networks to extrapolate well on arithmetic tasks as they lack the underlying logic to represent the required operations.

A recently proposed model, called NALU (trask-nalu), attempts to solve the problem of arithmetic extrapolation. However, for arithmetic extrapolation there are no broadly accepted guidelines for evaluating model performance. As a result, single-instance MSE is used for comparison.

As exact extrapolation requires correctly solving a logical problem we advocate that the performance metrics of interest should be: 1) has it learned the underlying logic, 2) how often does it learn the correct solution, and 3) how fast does it converge?

Motivated by these questions we propose using a success-criterion to determine if the underlying logic has been learned. We measure success-rate and provide a binomial confidence interval by initializing and training the NALU over multiple seeds. For each seed, we use the first iteration that satisfy the success-criterion to measure when the model has succeeded. As the success-criterion is based on an MSE divergence from an optimal solution it can be generalized to any model.

Finally, we propose and report a sparsity measurement for models that satisfy the success-criterion. Sparsity of the parameters has previously been emphasized as important for a correct solution (trask-nalu).

2 Related work

GridLSTM; lte; NeuralGPU; FreivaldsL17 solves integer arithmetic operations as a classification task and reports exact match accuracy. Using accuracy is useful for well-defined classification tasks, but is hard to use for real number regression problems. Our criterion mimics exact match by defining an MSE -threshold.

3 Simple Function Learning Tasks

Figure 1: Shows success-rate given different dataset parameters and the models hidden-size using the multiplication operation. Means are over 50 different seeds, with 95% confidence intervals.

The “Simple Function Learning Tasks” is a synthetic dataset that tests arithmetic extrapolation. The problem is defined as summing two random subsets of followed by an arithmetic operation on these sums. Extrapolation can be tested by modifying the sampling range of .

1:function Dataset(, , , , )
2:      Uniform() Sample elements uniformly
3:      Uniform()

Sample offset. Same for interpolation and extrapolation.

4:      Sum() Create sum from a subset of length
5:      Sum() Create sum from a subset of length
6:      Op() Perform operation on and
7:     return
Algorithm 1 Dataset sampling algorithm. Default values are specified for input-size (), subset-ratio (), and overlap-ratio (). Default interpolation range is and default extrapolation range is .

Solving the task on extrapolation requires learning the underlying logic of arithmetic operations from the training range. As logic is discrete, a solution to the problem is either correct or wrong.

To evaluate a solution we propose comparing the MSE, of the entire testset, to the MSE of a nearly-perfect solution on the extrapolation range. The nearly-perfect solution is defined as performing the operation perfectly, but allowing a small error in the sum-of-subsets (line 4 and 5 in Algorithm 1). This threshold can be simulated with for , where and is the perfect required to compute the optimal solution. We set .

Using a success-criterion has the advantage of being more interpretable, models that failed to converge will not obscure the mean, and as the number of successes will follow a binomial distribution we can calculate a confidence interval


With a success-criterion we can evaluate when a model succeeds. Since this metric cannot be negative, we model the confidence interval with a gamma distribution and report a 95% confidence intervals of the mean, by using maximum likelihood profiling.

Finally, the parameters of the NALU are argued to be “biased to be close to -1, 0, -1” (trask-nalu). We propose to measure a sparsity error of the NALU parameters with . As the sparsity error is between

we use a modified beta distribution with support in

and report a 95% confidence interval of the mean, by using maximum likelihood profiling.

The choice of gamma and beta distribution may not be perfect. However, a normal distribution would be problematic when the mean is close to the bounds, as it will have a large probability mass outside of the support bounds and thus provide inaccurate confidence intervals.

4 Results

Figure 2: Shows success-rate, when models converged, and sparsity error for the multiplication operation. Means are over 50 seeds. We provide experimental details in Appendix A.
Op Model Success Rate Solved at Sparsity error
Table 1: Shows success-rate, when models converged, and sparsity error. Means are over 100 seeds.

5 Conclusion

We provide the most extensive study of the Neural Arithmetic Logic Unit to date using a generalized version of the “Simple Function Learning Tasks”. Our study, through varying task complexities, evaluates the NALUs ability to learn the logic of arithmetic operations.

To evaluate performance on solving arithmetic operations we define a new success-criterion that approximates an exact match. With a success-criterion we measure how often a model successfully solve the problem given different initailization seeds, a binomial confidence interval, and at what iteration the model satisfy the criterion. Our results find that the NALU and its sub-units can require many trials to learn. In particularly for multiplication and division. Furthermore, we find that for subtraction and addition the solution is not always sparse.

Our results are not different from the original results, but highlights the importance of also discussing a models sensitivity to initialization. We hope that future research will consider using success-rates as a comparison for the performance of arithmetic units.


We would like to thank Andrew Trask and the other authors of the NALU paper, for highlighting the importance and challenges of extrapolation in Neural Networks. We would also like to thank the students Raja Shan Zaker Kreen and William Frisch Møller from The Technical University of Denmark, who initially showed us that the NALU does not converge consistently.

This research is funded by the Innovation Foundation Denmark through the DABAI project.


Appendix A Experimental details

a.1 NALU definition

The Neural Arithmetic Logic Unit (NALU) consists of two sub-units; the and . The sub-units represent either the or the operations. The NALU then assumes that either or will be selected exclusively, using a sigmoid gating-mechanism.

The and are defined accordingly,


where are weight matrices and is the input. The matrices are combined using a tanh-sigmoid transformation to bias the parameters towards a solution. Having allows to perform exact

operations between elements of a vector. The

uses an exponential-log transformation to create the operations (within precision).

The NALU combines these units with a gating mechanism given . Thus allowing NALU to decide between all of the

operations using backpropagation.

a.2 Model definitions and setup

Models are defined in table 2 and are all optimized with Adam optimization [adam-optimization] using default parameters, and trained over iterations. Training takes about 6 hours on a single CPU core (8-Core Intel Xeon E5-2665 2.4GHz). We run 4800 experiments on a HPC cluster.

The training dataset is continuously sampled from the interpolation range where a different seed is used for each experiment, all experiments use a mini-batch size of 128 observations, a fixed validation dataset with observations sampled from the interpolation range, and a fixed test dataset with observations sampled from the extrapolation range.

We evaluate each metric every iterations on the test set that uses the extrapolation range, and choose the best iteration based on the validation dataset that uses the interpolation range.

For figure 2, the following extrapolation ranges were used: , , , , , .

Model Layer 1 Layer 2
Linear Linear Linear
Table 2: Model definitions