Machine Learning meets Number Theory: The Data Science of Birch-Swinnerton-Dyer

by   Laura Alessandretti, et al.

Empirical analysis is often the first step towards the birth of a conjecture. This is the case of the Birch-Swinnerton-Dyer (BSD) Conjecture describing the rational points on an elliptic curve, one of the most celebrated unsolved problems in mathematics. Here we extend the original empirical approach, to the analysis of the Cremona database of quantities relevant to BSD, inspecting more than 2.5 million elliptic curves by means of the latest techniques in data science, machine-learning and topological data analysis. Key quantities such as rank, Weierstrass coefficients, period, conductor, Tamagawa number, regulator and order of the Tate-Shafarevich group give rise to a high-dimensional point-cloud whose statistical properties we investigate. We reveal patterns and distributions in the rank versus Weierstrass coefficients, as well as the Beta distribution of the BSD ratio of the quantities. Via gradient boosted trees, machine learning is applied in finding inter-correlation amongst the various quantities. We anticipate that our approach will spark further research on the statistical properties of large datasets in Number Theory and more in general in pure Mathematics.


page 21

page 28

page 31


Supersingular Ratio of Elliptic Curves

This paper starts with an overview of elliptic curves and then summarize...

Cluster Algebras: Network Science and Machine Learning

Cluster algebras have recently become an important player in mathematics...

Murmurations of elliptic curves

We investigate the average value of the pth Dirichlet coefficients of el...

Machine-Learning the Sato–Tate Conjecture

We apply some of the latest techniques from machine-learning to the arit...

Machine-Learning Dessins d'Enfants: Explorations via Modular and Seiberg-Witten Curves

We apply machine-learning to the study of dessins d'enfants. Specificall...

Machine-Learning Mathematical Structures

We review, for a general audience, a variety of recent experiments on ex...

Three principles of data science: predictability, computability, and stability (PCS)

We propose the predictability, computability, and stability (PCS) framew...

1 Introduction and Summary

Elliptic curves occupy a central stage in modern mathematics, their geometry and arithmetic providing endless insights into the most profound structures. The celebrated Conjecture of Birch and Swinnerton-Dyer [BSD] is the key result dictating the behaviour of over finite number fields and thereby, arithmetic. Despite decades of substantial progress, the proof of the conjecture remains elusive. To gain intuition, a highly explicit and computational programme had been pursued by Cremona [Cre], in cataloguing all elliptic curves up to isogeny and expressed in canonical form, to conductors into the hundreds of thousands.

Interestingly, a somewhat similar situation exists for the higher dimensional analogue of elliptic curves considered as Ricci-flat Kähler manifolds, viz., Calabi-Yau manifolds. Though Yau [Yau] settled the Calabi-Yau Conjecture [Ca], much remains unknown about the landscape of such manifolds, even over the complex numbers. For instance, even seemingly simple questions of whether there is a finite number of topological types of Calabi-Yau -folds for is not known – even though it is conjectured so. Nevertheless, because of the pivotal importance of Calabi-Yau manifolds to superstring theory, theoretical physicists have been constructing ever-expanding datasets thereof over the last few decades (cf. [HeBook] for a pedagogical introduction).

Given the recent successes in the science of “big data” and machine learning, it is natural to examine the database of Cremona [Cre2] using the latest techniques of Data Science. Indeed, such a perspective has been undertaken for Calabi-Yau manifolds and the landscape of compactifications in superstring theory in high-energy physics, ranging from machine-learning [He] to statistics [Doug, HJP]. Indeed, [He, KS, CHKN, Rue] brought about a host of new activities in machine-learning within string theory; moreover, [He, HeBook] and the subsequent work in [BHJM, ACHN, HL, JKP, HK, BCDL, Ashmore:2019wzb] introduced the curious possibility that machine-learning should be applied to at least stochastically avoid expensive algorithms in geometry and combinatorics and to raise new conjectures.

Can artificial intelligence help with understanding the syntax and semantics of mathematics? While such profound questions are better left to the insights of Turing and Voevodsky, the more humble question of using machine-learning to recognizing patterns which might have been missed by standard methods should be addressed more immediately. Preliminary experiments such as being able to “guess”- to over 99% accuracy and confidence - the ranks of cohomology groups without exact-sequence-chasing (having seen tens of thousands of examples of known bundle cohomologies)

[HeBook], or whether a finite group is simple without recourse to theorem of Noether and Sylow (having been trained on thousands of Cayley tables) [HK] already point to this potentiality.

In our present case of number theory, extreme care should, of course, be taken. Patterns in the primes can be notoriously deceptive, as exemplified by the likes of Skewe’s constant. Indeed, a sanity check to let neural networks predict the next prime number in

[He] yielded a reassuring null result. Nevertheless, one should not summarily disregard all experimentation in number theory: after all, the best neural network of the 19th century - the mind of Gauß - was able to pattern-spot to raise the profound Prime Number Theorem years before the discovery of complex analysis to allow its proof.

The purpose of this paper is to open up dialogue between data scientists and number theorists, as the aforementioned works have done for the machine-learning community with geometers and physicists, using Cremona’s elliptic curve database as a concrete arena. The organization is as follows. We begin with a rapid introduction in Section 2, bearing in mind of the diversity of readership, to elliptic curves in light of BSD. In Section 3, we summarize Cremona’s database and perform preliminary statistical analyses beyond simple frequency count. Subsequently, Section 4 is devoted to machine-learning various aspects of the BSD quantities and Section 5, to their topological data analyses and persistent homology. Finally, we conclude in Section 5 with the key results and discuss prospects for the future.

2 Elliptic Curves and BSD

Our starting point is the Weierstraß model


of an elliptic curve over , where and the coefficients . The discriminant and J-invariant of are obtained in a standard way as


where , , , and . Smoothness is ensured by the non-vanishing of and isomorphism (isogeny) between two elliptic curves, by the equality of .

An algorithm of Tate and Laska [Tat, Las] *** In particular, consider the transformation between the coefficiens and between two elliptic curves and :

for , then relating the points and as
yields and hence , and thus the isomorphism. can then be used to bring the first 3 coefficients to be , transforming (2.1) into a minimal Weierstraß model. Thus, for our purposes, an elliptic curve is specified by the pair of integers together with a triple . From the vast subject of elliptic curves, we will need this 5-tuple, together with arithmetic data to be presented in the ensuing.

2.1 Rudiments on the Arithmetic of

This subsection serves as essentially a lexicon for the quantities which we require, presented in brief, itemized form.

Conductor and Good/Bad Reduction:

The conductor is the product over all (finite many) number of primes - the primes of bad reduction - where reduced modulo becomes singular (where ). All other primes are called good reduction.

Rank and Torsion:

The set of rational points on has the structure of an Abelian group, . The non-negative integer is called the rank, its non-vanishing would signify an infinite number of rational points on . The group is called the torsion group and can be only one of 15 groups by Mazur’s celebrated theorem [Maz], viz., the cyclic group for and , as well as the direct product for .

L-Function and Conductor:

The Hasse-Weil Zeta-function of can be defined, given a finite field , as the generating functions (the local) and (the global):


Here, in the local zeta function , is the number of points of over the finite field and the product is taken over all primes to give the global zeta function .

The definition (2.3) is applicable to general varieties, and for elliptic curves, the global zeta function simplifies (cf. [Sil]) to a product of the Riemann zeta function and a so-called L-function as



In the above, counts the number of points of mod for primes of good reduction and depending on type of bad reduction. The positive integer which controls, via its factorization, these primes, is the conductor of .

Importantly, the L-function has analytic continuation [TW] to so that the variable

is not a merely dummy variable like

in the local generating function, but renders a complex analytic function.

Real Period:

The periods of a complex variety is usually defined to be the integral of some globally defined holomorphic differential over a basis in homology. Here, we are interested in the real period, defined as (using the minimal Weierstraß model)


over the set of real points of the elliptic curve.

Tamagawa Number:

Now, over any field is a group, thus in particular we can define over the -adic field for a given prime, as well as its subgroup of points which have good reduction. We define the index in the sense of groups


which is clearly equal to 1 for primes of good reduction since then . The Tamagawa number is defined to be the product over all primes of bad-reduction of , i.e.,

Canonical Height:

For a rational point , written in minimal fraction form with , and , we can define a naive height . Then, a canonical height can be defined as


where (-times) is the addition of under the group law of . This limit exists and renders the unique quadratic form on such that is bounded. An explicit expression of in terms of can be found, e.g., in Eq4 and Eq5 of [BGZ]. The canonical height defines a natural bilinear form


for two points and as always, is done via the group law.


Given the infinite (free Abelian) part of , viz., , let its generators be , then we can define the regulator


where the pairing is with the canonical height and defines an integer matrix. For , is taken to be 1 by convention.

Tate-Shafarevich Group:

Finally, one defines group cohomologies and between which there is a homomorphism (cf. e.g., [Mil]IV.2 for a detailed description). We can then define the Tate-Shafarevich Group of as the kernel of the homomorphism


This is the most mysterious part of the arithmetic of elliptic curves, it is conjectured to be a finite Abelian group. For ranks , this has been proven (cf. the survey of [RS]) but in general this is not known.

2.2 The Conjecture

With the above definitions, we can at least present the celebrated

Conjecture 1 (Birch–Swinnerton-Dyer (Weak Version))

The order of the zero of at is equal to the rank ,

That is, the Taylor series around 1 is with some complex coefficient .

In fact, a stronger version of the conjecture predicts precisely what the Taylor coefficient should be:

Conjecture 2 (Birch–Swinnerton-Dyer (Strong Version))

The Taylor coefficient of at is given in terms of the regulator , Tamagawa number , (analytic) order of the Tate-Shafarevich group , and the order of the torsion group . Specifically, let , then

3 Elliptic Curve Data

BSD arose from extensive computer experimentation, the earliest of its kind, by Birch and Swinnerton-Dyer. Continuing along this vein, Cremona [Cre] then compiled an impressive of list of 2,483,649 isomorphism classes of elliptic curves over and explicitly computed the relevant quantities introduced above. This is available freely online at [Cre2].

3.1 Cremona Database

The database of Cremona, on which the rest of the paper will focus, associates to each minimal Weierstraß model (given by the coefficients and ; generically, these last two coefficients have very large magnitude), the following:

  • the conductor , ranging from 1 to 400,000;

  • the rank , ranging from 0 to 4;

  • the torsion group , whose size ranges from 1 to 16;

  • the real period , a real number ranging from approximately to .

  • the Tamagawa number , ranging from 1 to 87040;

  • the order of the Tate-Shafarevich group (exactly when known, otherwise given numerically), ranging from 1 to 2500;

  • the regulator , ranging from approximately to .

A typical entry, corresponding to the curve (labelled as “314226b1” and with and which are readily computed from(2.2)) would be


3.2 Weierstraß Coefficients

Let us begin with a statistical analysis of the minimal Weierstraß coefficients themselves. It so happens that in the entire database, there are only 12 different sets of values of , we tally all of the curves in the following histogram, against rank and :

We see that most of the curves are of smaller rank, with only a single instance of . This is in line with the recent result of [BS] that most elliptic curves are rank 1; in fact, over 2/3 of elliptic curves obey the BSD conjecture [BSZ].

To give an idea of the size of the a-coefficients involved, the largest one involved in the database is


which is of rank 0.

Even though it is a conjecture that the rank can be arbitrarily large, the largest confirmed rank [Elk] so far known in the literature is 19, corresponding (the last term being a 72-digit integer!) to


This extraordinary result is clearly not in the database due to the large rank.

One of the first steps in data visualization is a

principle component analysis where features of the largest variations are extracted. The minimal Weierstraß model gives a natural way of doing so, since the coefficients take only 12 values and we can readily see scatter plot of . Now, due to the large variation in these coefficients, we define a signed natural logarithm for as


We present this scatter plot of in Fig. 1. Therein, we plot all the data points (i.e., for all different values of ) together, distinguishing rank by colour (rank 4 has only a single point as seen from the table above).

Figure 1: A scatter plot of for all 2,483,649 elliptic curves in the Cremona database. Different ranks are marked with different colours.

The first thing one would notice is the approximate cross-like symmetry, even within rank. However, this is not trivial because the transformation


is by no means a rank preserving map. For instance, a single change in sign in , could result in rank change from 1 to 3:


Examples of a similar nature abound. The next feature to notice is that the size of the cross shrinks as the rank increases. This is rather curious since the largest rank case of (3.3) has far larger coefficients. This symmetry is somewhat reminiscent of mirror symmetry for Calabi-Yau 3-folds, where every compact smooth such manifold with Hodge numbers has a mirror manifold with these values exchanged.

3.3 Distributions of and

Figure 2: The distributions of and .

(A) Probability mass of

(blue line) and (orange line). (B) Joint probability of and . The colour indicates the density of points within each bin (see colour-bar). Note that the figure axes are in symlog scale (linear scale between and , logarithmic scale for other values in the range). Hence, we can see also the density corresponding to and (cross in the middle). (C) Probability density distribution for the logarithm(in base 10) of (when and , filled line), the corresponding Beta distribution fit with parameters , and (dashed line), and the median value (dotted line).

Fortified by the above initial observations, let us continued with more refined study of the distribution of and . First, let us plot the distribution of each individually, normalized by the total, i.e., as probability mass functions. These are shown in part (A) to the left of Fig 2. Note that the horizontal axes (binning) is done logarithmically. We see that the distributions of both and are symmetric with respect to (Fig. 2-A), with spanning orders of magnitude smaller as compared to . This is just to give an idea of the balanced nature of Cremona’s data, that elliptic curves with and are all constructed.

Next, in part (B) of Fig 2, we plot the joint probability mass function of the pair with colour showing the frequency as indicated by the colour-bar to the right. We see that, as discussed in Fig. 1, there is a cross-like symmetry. Here, since we are not separating by rank, the symmetry is merely a reflection of the constructions of the dataset, that and are all present. What is less explicable is that it should be a cross shape and what is the meaning of the boundary curve beyond which there does not seem to be any minimal models. For reference, the central rectilinear cross indicates the cases of and respectively.

Finally, we compute the Euclidean distance

from the origin and study its probability distribution. This is shown in part (C) of Fig. 

2, We find that half of the data lies within a radius of from the origin. The logarithm of can be well fitted with a Beta probability distribution:


with parameters , and . Thus, whilst there are a number of coefficients of enormous magnitude, the majority still have relatively small ones.

Figure 3: Different distributions of and for different ranks. (A-D) Joint probability distribution of and for values of the rank (A), (B), (C), (D). (E) Probability density distribution for the logarithm(in base 10) of (when and , filled lines) for various value of the rank , the corresponding Beta distribution fit (dashed lines), and the corresponding median values (dotted lines).
Differences by Rank:

As with our initial observations, we now study the variation of with respect to the rank of the elliptic curves. First, in Fig. 3

parts A-D, we plot the joint distributions of

and for respectively. We can see that they differ signigicantly from each other, under permutation test [PJ] at confidence level .

Next, Fig. 3 E show the probability distribution functions for our Euclidean distance for the different ranks. We find that the median Euclidean distance from the center decreases for higher values of . In fact, we see that the median values of and increase with the rank (see Fig. 4

D-E which we will discuss shortly). Again, each is individually well-fitted by the Gamma distribution. In tables

9 and 10 in the Appendix, we show some statistics of and

including their mean, standard deviation, median, and the number of zero entries, for given rank

and values of , , .

3.4 Distributions of various BSD Quantities

Figure 4: Characteristics of the elliptic curves based on their rank. Boxplots of , , , , in parts (A)-(E) respectively; boxplots of , , , , and in parts (F)-(K) respectively, all for different values of the rank . The red line shows the median value, the boxes enclose of the distribution, and the whiskers, of the distribution.

Now, the coefficients are the inputs of the dataset, each specifying a minimal model of an elliptic curve, and to each should be associated the numerical tuple for the rank, the conductor, the regulator, the real period, the Tamagawa number, the order of the torsion group and the order of Tate-Shafarevich group, respectively. It is now expedient to examine the distribution of “output” parametres.

As always, we arrange everything by rank and in Fig. 4 show the box plots of the variation around the median (drawn in red). The boxes enclose of the distribution, and the whiskers, of the distribution. We see that, as detailed above, have only variation and has many orders of magnitude more in variation that . The conductor has a fairly tame distribution while the other BSD quantities vary rather wildly – the is part of the difficulty of the conjecture, the relevant quantities behave quite unpredictably.

The RHS of Conjecture 2:
Figure 5: The RHS of Conjecture 2.

(A) Probability density function of the RHS (filled blue line) and the corresponding Beta distribution fit (dashed black line). (B) Probability density function of the RHS for different values of the rank

(filled lines), and the corresponding best fits chosen with Akaike information criterion (dashed lines) (C) Boxplot showing the value of RHS for different values of . The red line shows the median value, the boxes to the of the distribution, and the whiskers to the of the distribution.

We now put all the quantities together according to the RHS of the Strong BSD Conjecture, which we recall to be . We test which statistical distribution best describe this data, by comparing 85 continuous distribution under the Akaike information criterion. We find that the distribution best describing the data is the Beta distribution (see Figure 5 A):


where is the standard gamma function, , , and the original variable has been re-scaled such that .

The selected distribution changes for elliptic with a specific rank . For , the selected distribution is the exponentiated Weibull, while for larger values of , the selected distribution is the Johnson SB (see Figure 5 B). We find that the median value of the increases both as a function of the rank (see Figure 5 C) and (see Figure 5 D).

4 Topological Data Analysis

Let us gain some further intuition by visualizing the data. As far as the data is concerned, to each elliptic curve, specified by the Weierstraß coefficients, one associates a list of quantities, the conductor, the rank, the real-period, etc. The define a point cloud in Euclidean space of rather high dimension, each of point of which is defined by the list of these quantities which enter BSD. In the above, we have extracted the last two coefficients of the Weierstraß form of the elliptic curves and studied them against the variations of the first three which, in normal form, can only be one of the 9 possible 3-tuples of and 0. The normal form thus conveniently allows us to at least “see” the Weierstraß coefficients because we have projected to two dimensions. However, the full list of the relevant quantities for the elliptic curve has quite a number of entries and cannot be visualized directly.

Luckily, there is precisely a recently developed method in data science which allows for the visualization of “high dimensionality”, viz., persistent homolgy in topological data analysis [CZCG] (cf. an excellent introductory survey of [OPTGH]). In brief, one creates a Vietoris-Rips simplex from the data points in Euclidean space, with a notion of neighbourhood (by Euclidean distance). The Betti numbers of the simplex is then computed as one varies , whether the values are non-zero for each gives a notion of whether non-trivial topology (such as holes) persists for different scales . The result is a so-called barcode for the data. In the ensuing, we will use G. Henselman’s nice implementation of the standard methods in topological data analysis, the package Eirene for Julia/Python [Ei].

Figure 6: The barcodes for (a) all the Weierstraß coefficients at Betti numbers 0 and 1; (b) on the (principle component) coefficients .

4.1 The Weierstraß Coefficients

We begin by studying the barcodes for the full set of five Weierstraß coefficients (as always, we take for and due to their size). Of course, computing the homology of the Vietoris-Rips complex for over 2 million points is computationally impossible. The standard method is to consider random samples. Moreover, the first two Betti numbers and are usually sufficiently illustrative. Thus, we will take 1000 random samples (we do not separate the ranks since we find there is no significant difference for the barcodes amongst different ranks) of the coefficients (with the usual for the last two). The barcodes are shown in part (a) of Figure 6. For reference, we also plot the barcodes for the pair only since do not vary so much.

4.2 A 6-dimensional Point-Cloud

Let us now try to visualize the relevant BSD quantities together. Organized by the ranks which dominate the data by far, the 6-tuple


naturally form three point-clouds in . Due to the high dimensionality, we sample 100 random points for each of the values and compute the full barcodes . It turns out that the main visible features are in dimension 0. We present these in Figure 7 and observe that indeed there is some variation in the barcode amongst the different ranks.

Figure 7: The barcodes for 6-dimensional point-cloud , with 100 random samples, for .

4.3 Conductor Divisibility

The factors of the conductor is of defining importance in the L-function, which appear to the LHS of BSD, meanwhile, the RHS is governed, in the strong case, by the combination . It is therefore expedient to consider the point cloud of the 3-tuple organized by divisibility properties of . For instance, one could contrast the barcodes for the triple for even versus odd. Again, the features are prominent for dimension 0 and the barcodes are shown in Figure 8.

even odd
Figure 8: The barcodes for 3-dimensional point-cloud , with 200 random samples for even/odd .

Simiarly, we could group by modulo 3, as shown in Figure 9.

Figure 9: The barcodes at dimension 0 for 3-dimensional point-cloud , with 100 random samples, for distinguished modulo 3.

5 Machine Learning

In [He, HeBook]

, a paradigm was proposed to using machine-learning, and in particular deep neural networks to help computations in various problems in algebraic geometry. Exemplified by computing cohomology of vector bundles, it was shown that to very high precision, AI can guess the correct answer without using the standard method of Gröbner basis construction and chasing long exact sequences, both of which are computationally intensive. Likewise,


showed that machine-learning can identify algebraic structures such as distinguishing simple from non-simple finite groups. At over 99% precision, the requisite answers can be estimated without recourse to standard computations which are many orders of magnitude slower.

It is therefore natural to wonder whether the elliptic curve data can be “machine-learned”. Of course, we need to be careful. While computational algebraic geometry over hinged on finding kernels and cokernels of integer matrices, a task in which AI excels. Problems in number theory are much less controlled. Indeed, trying to predict prime numbers [He] seems like a hopeless task, as mentioned in the introduction. Nevertheless, let us see how far we can go with our present dataset for BSD.

5.1 Predicting from the Weierstraß Coefficients

Figure 10: Correlation matrix for the quantities characterizing ellipses. Colours are assigned based on the value of the Pearson correlation coefficient between all pairs of quantities. The strength of the correlation is also reported for each pair.

We begin with quantifying the performance of machine learning models in predicting, one by one, the quantities , , , , , , , together with the RHS of the Conjecture 2, given the Weierstraß coefficients , , , , alone. Straight-away, this is expected to be a difficult, if not impossible task (as impossible as, perhaps, the prediction of prime numbers). This is confirmed by the correlation between the Weierstraßcoefficients and the BSD quantities: from the correlation matrix, we see that the relationship is indeed weak (cf. Fig. 10), implying this is not a straightforward prediction task. Here, the correlation matrix contains the values of the Spearman correlation [Pear], that is the Pearson correlation [Pear] between the rank variables associated to each pairs of the original variables.


NMAE (XGBoost)

NMAE (Dummy) NMAE (Linear)
RHS Conj. 2
RMSE (XGBoost) RMSE (Dummy) RMSE (Linear)
RHS Conj. 2
Table 1: Performance of the regression models. The Normalized Median Absolute Error () and the Root Mean Squared Error (

), for XGboost (left column), the dummy regressor (central column) and a linear regression (right column). The reported values are averages across

-fold cross-validations, with the corresponding standard deviations.

Our analysis relies on gradient boosted trees [BST], using the implementation of XGboost [XGBoost]

, an open-source scalable machine learning system for tree boosting used in a number of winning Kaggle solutions (17/29 in 2015). In Appendix


, we present the similar results using a support vector machine, another highly popular machine-learning model, and see that the XGboost indeed performs better. Furthermore, based on the learning curves of the XGboost models (discussed in Appendix

A), we have chosen a fold cross-validation, such that the training set includes of the values, and the validation set the remaining .

5.1.1 Numerical Quantities

First, we train regression models to predict the values of , , , and . We evaluate the performance of the regression, by computing the normalized median absolute error:


where are the observed values and are the predicted values, and the rooted mean squared error:


where is the size of the test set. We desire that both NMAE and RMSE to be close to 0 for a good prediction.

We compare the result of the XGBoost regression with two baselines: (1) a linear regression model and (2) a dummy regressor, that always predicts the mean of the training set. We find that, in all cases, the machine learning algorithms perform significantly better than the baseline models (see Table 1) with respect to the NMAE and RMSE. However, XGboost performs only marginally better than the baselines in predicting the value of . We report also the so-called  Imimportance of the features for the XGBoost regressor in Fig. 15

. Here, importance indicates how useful each feature is in the construction of the boosted decision trees, and is calculated as the average importance across trees. For a single tree, the importance of a feature is computed as the relative increase in performance resulting from the tree splits based on that given feature


Figure 11: Feature importance. Feature importance of the XGBoost regression models for predicting (A), (B), (C), (D) and (E).

We see that overall our measures, NMAE and RMSE, are not too close to 0, except perhaps , for which even a simple linear regression does fairly well. In table 2, we report the values of the coefficients of the linear regression fit for .

coef std err t P>|t| [0.025 0.975]
const 1.5946 0.004 380.761 0.000 1.586 1.603
0.0658 0.005 14.524 0.000 0.057 0.075
-0.0065 0.004 -1.543 0.123 -0.015 0.002
-0.0518 0.005 -11.473 0.000 -0.061 -0.043
-0.6320 0.006 -110.282 0.000 -0.643 -0.621
0.4877 0.006 85.112 0.000 0.477 0.499
Table 2: Prediction of . Coefficients of the linear regression for each of the features (inputs), with the associated standard deviation, the value of the statistics and corresponding

-value. Here, we can reject the null hypothesis that the coefficients are equal to

at significance level , for all coefficients, except the one associated to ().

What Table 2 means is that .

Predicted variable R-squared Adj. R-squared F-statistic Prob (F-statistic)
0 0 95.67 3.86e-101
0.006 0.006 2777 0
0.001 0.001 387.2 0
0.005 0.005 2522 0
0.005 0.005 2377 0
RHS Conj. 2 0.012 0.012 6143 0
Table 3: Statistics of the linear regression models. We report the R squared, the adjusted R squared, the F statistics, and the p-value associated to the F statistics for the different linear models. When the p-value of the F-statistics is close to , we can reject the null hypothesis that the intercept-only model provides a better fit than the linear model.

Likewise, in table 3, we report the the statistics associated to the linear regression models for the prediction of the various quantities, where only the Weierstrass coefficients are used as features (inputs). The low R-squared values indicated that the regression is not so good in terms of the coefficients alone.

5.1.2 Categorical Quantities

Next, we train classifiers to predict the values of

and because these easily fall into discrete categories: the rank and the torsion group size can only be one of 16 integer values due to Mazur’s theorem. Again, we use a cross validation, and we evaluate the performance of the classifier, by computing the score:


where we have, in the predicted versus actual, the true positives (TP), false positives (FP), and false negatives (FN). Since we have several possible values for the rank , we compute both , by counting the total TP, FN and FP, as well as , the average value of computed across ranks.

In addition, we also compute the Matthew correlation coefficient [Matthew]

, to describe the confusion matrix:


Both F1-scaore and MCC are desired to be close to 1 for a good prediction.

Quantity (XGBoost) (Dummy) (Logistic)
(XGBoost) (Dummy) (Logistic)
(XGBoost) (Dummy) (Logistic)
Table 4: Performance of the classification models. The scores , , and the Matthew correlation coefficient

, for XGboost (left column), the dummy regressor (central column) and a logistic regression (right column). The reported values are averages across

-fold cross-validations, with the corresponding standard deviations.

For checks, we compare the XGBoost classifier with (1) a baseline classifier, that predicts always the predominant class in the training set, as well as with (2) a logistic regression. We find that XGboost performs better than the baseline models for predicting , but the performance of the prediction of is comparable to the baselines (see Table 4).

Figure 12: Prediction of . (A) Importance of the different features (inputs) to predict . (B) Confusion matrix (normalized by column) showing the fraction of entries with given . (C) Difference between the confusion matrix obtained for the XGBoost and the dummy classifier. Results are averaged over a -fold cross validation.
Figure 13: Prediction of . (A) Importance of the different features to predict . (B) Confusion matrix (normalized by column) showing the fraction of entries with rank with given . (C) Difference between the confusion matrix obtained for the XGBoost and the dummy classifier. Results are averaged over a -fold cross validation.

Analyzing the confusion matrices (see Figures 12 and 13), it appears clear that it is very hard to predict and from the Weierstraß coefficients alone. For both the prediction of and , the most important predictor is (Figures 12 and 13). This is the feature that contributed the most to increase the performance of the boosted tree [XGBoost].

5.2 Mixed Predictions

While the results in the previous subsection may seem disappointing in general, they do present a good sanity check: to obtain all the BSD quantities from the elliptic curve data in some straight-forward way would be an almost unimaginable feat in number theory. Nevertheless, in this section, let us build machine learning models to predict the values of , , , , , and among themselves, i.e., we consider as features (inputs) all the quantities characterizing the elliptic curves (except the predicted quantity), rather than the Weierstraß coefficients alone.

Quantity NMAE (XGBoost) NMAE (Dummy) NMAE (Linear)
Table 5: Performance of the regression models considering as features all the quantities characterizing an ellipses. The normalized median absolute error , for XGboost (left column), the dummy regressor (central column) and a linear regression (right column). The reported values are averages across -fold cross-validations, with the corresponding standard deviations. Results are considerably improved compared to table 1.
Figure 14: True vs Predicted values Results are shown for all the quantities, using XGBoost (left column) and the Linear model (right column).
Figure 15: Feature importance considering as features all the quantities characterizing an elliptic curve. Feature importance of the XGBoost regression models for predicting (Part A), (Part B), (Part C), (Part D) and (Part E).

We present the results in Table 5 of the accuracy measure NMAE by the 3 methods as in subsection 5.1: machine-learning by XGBoost, dummy regression and linear regression. To read the table, of each of the 5 quantities given in a row, we use the other 4 to train the ML in order to predict this row. We see that this is a significant improvement over the previous subsection and shows that, especially the XGBoost, the machine-learning can very confidently predict the Tamagawa number, the regulator and . The feature importance is shown in Fig. 15 and in table 6, we report the the statistics associated to the mixed predictions linear regression models. A comparison between the predictions of the linear models compared to XGboost is presented in Figure 14.

Predicted variable R-squared Adj. R-squared F-statistic Prob (F-statistic)
0.038 0.038 8999 0

0.054 0.054 12960 0

0.042 0.042 9889 0

0.017 0.017 3891 0

0.114 0.114 29110 0

Table 6: Statistics of the linear regression models (mixed predictions). We report the R squared, the adjusted R squared, the F statistics, and the p-value associated to the F statistics for the mixed predictions linear models.
Quantity (XGBoost) (Dummy) (Logistic)
(XGBoost) (Dummy) (Logistic)
(XGBoost) (Dummy) (Logistic)
Table 7: Performance of the classification models considering as features all the quantities characterizing an elliptic curve. The scores , and the Matthew correlation coefficient , for XGboost (left column), the dummy regressor (central column) and a logistic regression (right column). The reported values are averages across -fold cross-validations, with the corresponding standard deviations. Results are considerably improved compared to table 4.

Finally, let us use all quantities: the coefficients as well as , , , , to predict and . The accuracy measure and (which should be close to 1 ideally) are presented in Table 7 and the feature importance, in Figures 16 and 17. We see that these are considerably improved compared to those obtained in section 5.1 (cf. Tables 1 and 4). This is somehow to be expected in light of the correlations observed in Figure 10. In fact, even logistic regressions behave fairly well, with the , , and the scores subtiantially larger than those obtained by the dummy classifiers (see Table 7

). For reference, we include report the hyperplane equations of the linear regression models to see how each of the quantities can be fitted by all the others:

Figure 16: Prediction of considering as features all the quantities characterizing an ellipses. (A) Importance of the different features to predict . (B) Confusion matrix (normalized by column) showing the fraction of entries with rank with given . (C) Difference between the confusion matrix obtained for the XGBoost and the dummy classifier. Results are averaged over a -fold cross validation.
Figure 17: Prediction of considering as features all the quantities characterizing an elliptic curve. (A) Importance of the different features to predict . (B) Confusion matrix (normalized by column) showing the fraction of entries with value and given . (C) Difference between the confusion matrix obtained for the XGBoost and the dummy classifier. Results are averaged over a -fold cross validation.

6 Conclusions and Prospects

In this paper, we initiated the study of the data science of the arithmetic of elliptic curves in light of the Birch-Swinnerton-Dyer Conjecture. This is inspired by the the recent advances in the statistical investigation of Calabi-Yau manifolds, especially in the context of super-string theory [HJP, ACHN], as well as in the paradigm of machine-learning structures in geometry [He, HeBook] and algebra [HK]. Whilst we are still within the landscape of "Calabi-Yau-ness", it is expected that patterns in number theory should be much more subtle than those in geometry over

and in combinatorics. Nevertheless, BSD, residing at the crux between arithmetic and analysis, might be more susceptible to machine-learning and to pattern-recognition.

From our preliminary examinations on the extensive database of Cremona [Cre, Cre2] we have already found several interesting features. First, we find that in the minimal Weierstraß representation, where a pair of coefficients clearly constitutes the principle component of the data, the distribution thereof follows a curious cross-like symmetry across rank, as shown in Figures 1 and 2. This is a highly-non-trivial symmetry since does not preserve rank. This symmetry is reminiscent of mirror symmetry for Calabi-Yau threefolds. In addition, the absence of data-points beyond the boundaries of the cross is also of note, much like that Hodge plot for the Calabi-Yau threefolds.

Over all, the distribution of the Euclidean distance of to the origin, as well as that of the RHS of the Strong BSD, viz., the quantity (cf. conjectures in Sec. 2.2), are best described by a Beta-distribution, which is selected in both cases among 85 continuous distributions using the Akaike Information Criterion. Organized by rank, these distributions also vary.

One further visualize the data, the tuples consisting of the coefficients as well as the BSD tuple for ranks using the standard techniques from topological data analysis. The bar-codes are shown in Figures 6 and 7. While the Weierstraßcoefficients show little variation over rank, the BSD tuple does show differences over . Moreover, as expected, the divisibility of the conductor influences the barcodes.

Finally, emboldened by the recent success in using machine-learning to computing bundle cohomology on algebraic varieties without recourse to sequence-chasing and Gr’́obner bases as well as recognizing whether a finite group is simple directly by looking at the Cayley table, we asked the question of whether one can “predict” quantities otherwise difficult to compute directly from “looking” at the shape of the elliptic curve. Ideally, one would have hoped that training on , one could predict any of the BSD quantities to high precision, as was in the cohomology case. However, due to the very high variation in the size of , one could not find a good machine-learning technique, decision trees, support-vector machines or neural networks, which seems to achieve this. This is rather analogous to the (expected) failure of predicting prime numbers using AI. Nevertheless, the BSD quantities, when mixed with the Weierstraß coefficients, does behave well under machine-learning. For instance, the Matthew correlation coefficient between predicted and true values of and is .

At some level, the experiments here, in conjunction with those in [He, KS, Rue, CHKN, BHJM, JKP, HK, BCDL, Ashmore:2019wzb], tend to show a certain hierarchy of difficulty in how machine-learning responds to problems in mathematics. Understandably, number theory is the most difficult: as a reprobate, [He] checked that trying to predict the next prime number, for instance, seems unfeasible for simple neural networks. On the other hand, algebraic geometry over the complex numbers seems to present a host of amenable questions such as bundle cohomology, recognition of elliptic fibrations or calculating Betti numbers. In between lie algebra and combinatorics, such as the structure of finite groups, where precision/confidence of the cross-validation is somewhat intermediate. It is therefore curious that in this paper one sees that a problem such as BSD, which resides in between arithmetic and geometry, is better behaved under machine-learning than a direct attack on patterns in primes.


YHH would like to thank the Science and Technology Facilities Council, UK, for grant ST/J00037X/1, Nankai University, China for a chair professorship and Merton College, Oxford, for enduring support.

Appendix A Learning Curves

To prevent overfitting, we compute the learning curves of the regression (figure 18) and classification (figure 19) models in section 5.1. We find that of the data is a good choice for the training set size, suggesting a 5-fold cross validation.

Figure 18: Learning curves for the regression models. The Normalized Median Absolute Error as a function of the train set size of the XGBoost regression models for predicting (A), (B), (C), (D) and (E). The shaded areas correspond to standard deviation across a cross validation.
Figure 19: Learning curves for the classification models. The Matthew coefficient as a function of the train set size of the XGBoost regression models for predicting (A) and (B). The shaded areas correspond to standard deviation across a cross validation.

Appendix B Comparison with SVM

In this section, we compare the performance of the XGBoost models with Support Vector Machine (SVM) models. SVM models are very long to train, hence we focus, for this task, on a subset of examples. In table 8 we report the performance of the regression models used to predict , , , and . Only in the case of , the SVM model performs better than XGBoost.

Quantity NMAE (XGBoost) NMAE (Dummy) NMAE (SVM)