1 Introduction and Summary
Elliptic curves occupy a central stage in modern mathematics, their geometry and arithmetic providing endless insights into the most profound structures. The celebrated Conjecture of Birch and SwinnertonDyer [BSD] is the key result dictating the behaviour of over finite number fields and thereby, arithmetic. Despite decades of substantial progress, the proof of the conjecture remains elusive. To gain intuition, a highly explicit and computational programme had been pursued by Cremona [Cre], in cataloguing all elliptic curves up to isogeny and expressed in canonical form, to conductors into the hundreds of thousands.
Interestingly, a somewhat similar situation exists for the higher dimensional analogue of elliptic curves considered as Ricciflat Kähler manifolds, viz., CalabiYau manifolds. Though Yau [Yau] settled the CalabiYau Conjecture [Ca], much remains unknown about the landscape of such manifolds, even over the complex numbers. For instance, even seemingly simple questions of whether there is a finite number of topological types of CalabiYau folds for is not known – even though it is conjectured so. Nevertheless, because of the pivotal importance of CalabiYau manifolds to superstring theory, theoretical physicists have been constructing everexpanding datasets thereof over the last few decades (cf. [HeBook] for a pedagogical introduction).
Given the recent successes in the science of “big data” and machine learning, it is natural to examine the database of Cremona [Cre2] using the latest techniques of Data Science. Indeed, such a perspective has been undertaken for CalabiYau manifolds and the landscape of compactifications in superstring theory in highenergy physics, ranging from machinelearning [He] to statistics [Doug, HJP]. Indeed, [He, KS, CHKN, Rue] brought about a host of new activities in machinelearning within string theory; moreover, [He, HeBook] and the subsequent work in [BHJM, ACHN, HL, JKP, HK, BCDL, Ashmore:2019wzb] introduced the curious possibility that machinelearning should be applied to at least stochastically avoid expensive algorithms in geometry and combinatorics and to raise new conjectures.
Can artificial intelligence help with understanding the syntax and semantics of mathematics? While such profound questions are better left to the insights of Turing and Voevodsky, the more humble question of using machinelearning to recognizing patterns which might have been missed by standard methods should be addressed more immediately. Preliminary experiments such as being able to “guess” to over 99% accuracy and confidence  the ranks of cohomology groups without exactsequencechasing (having seen tens of thousands of examples of known bundle cohomologies)
[HeBook], or whether a finite group is simple without recourse to theorem of Noether and Sylow (having been trained on thousands of Cayley tables) [HK] already point to this potentiality.In our present case of number theory, extreme care should, of course, be taken. Patterns in the primes can be notoriously deceptive, as exemplified by the likes of Skewe’s constant. Indeed, a sanity check to let neural networks predict the next prime number in
[He] yielded a reassuring null result. Nevertheless, one should not summarily disregard all experimentation in number theory: after all, the best neural network of the 19th century  the mind of Gauß  was able to patternspot to raise the profound Prime Number Theorem years before the discovery of complex analysis to allow its proof.The purpose of this paper is to open up dialogue between data scientists and number theorists, as the aforementioned works have done for the machinelearning community with geometers and physicists, using Cremona’s elliptic curve database as a concrete arena. The organization is as follows. We begin with a rapid introduction in Section 2, bearing in mind of the diversity of readership, to elliptic curves in light of BSD. In Section 3, we summarize Cremona’s database and perform preliminary statistical analyses beyond simple frequency count. Subsequently, Section 4 is devoted to machinelearning various aspects of the BSD quantities and Section 5, to their topological data analyses and persistent homology. Finally, we conclude in Section 5 with the key results and discuss prospects for the future.
2 Elliptic Curves and BSD
Our starting point is the Weierstraß model
(2.1) 
of an elliptic curve over , where and the coefficients . The discriminant and Jinvariant of are obtained in a standard way as
(2.2) 
where , , , and . Smoothness is ensured by the nonvanishing of and isomorphism (isogeny) between two elliptic curves, by the equality of .
An algorithm of Tate and Laska [Tat, Las] ^{*}^{*}* In particular, consider the transformation between the coefficiens and between two elliptic curves and :
2.1 Rudiments on the Arithmetic of
This subsection serves as essentially a lexicon for the quantities which we require, presented in brief, itemized form.
 Conductor and Good/Bad Reduction:

The conductor is the product over all (finite many) number of primes  the primes of bad reduction  where reduced modulo becomes singular (where ). All other primes are called good reduction.
 Rank and Torsion:

The set of rational points on has the structure of an Abelian group, . The nonnegative integer is called the rank, its nonvanishing would signify an infinite number of rational points on . The group is called the torsion group and can be only one of 15 groups by Mazur’s celebrated theorem [Maz], viz., the cyclic group for and , as well as the direct product for .
 LFunction and Conductor:

The HasseWeil Zetafunction of can be defined, given a finite field , as the generating functions (the local) and (the global):
(2.3) Here, in the local zeta function , is the number of points of over the finite field and the product is taken over all primes to give the global zeta function .
The definition (2.3) is applicable to general varieties, and for elliptic curves, the global zeta function simplifies (cf. [Sil]) to a product of the Riemann zeta function and a socalled Lfunction as
(2.4) where
In the above, counts the number of points of mod for primes of good reduction and depending on type of bad reduction. The positive integer which controls, via its factorization, these primes, is the conductor of .
Importantly, the Lfunction has analytic continuation [TW] to so that the variable
is not a merely dummy variable like
in the local generating function, but renders a complex analytic function.  Real Period:

The periods of a complex variety is usually defined to be the integral of some globally defined holomorphic differential over a basis in homology. Here, we are interested in the real period, defined as (using the minimal Weierstraß model)
(2.5) over the set of real points of the elliptic curve.
 Tamagawa Number:

Now, over any field is a group, thus in particular we can define over the adic field for a given prime, as well as its subgroup of points which have good reduction. We define the index in the sense of groups
(2.6) which is clearly equal to 1 for primes of good reduction since then . The Tamagawa number is defined to be the product over all primes of badreduction of , i.e.,
(2.7)  Canonical Height:

For a rational point , written in minimal fraction form with , and , we can define a naive height . Then, a canonical height can be defined as
(2.8) where (times) is the addition of under the group law of . This limit exists and renders the unique quadratic form on such that is bounded. An explicit expression of in terms of can be found, e.g., in Eq4 and Eq5 of [BGZ]. The canonical height defines a natural bilinear form
(2.9) for two points and as always, is done via the group law.
 Regulartor:

Given the infinite (free Abelian) part of , viz., , let its generators be , then we can define the regulator
(2.10) where the pairing is with the canonical height and defines an integer matrix. For , is taken to be 1 by convention.
 TateShafarevich Group:

Finally, one defines group cohomologies and between which there is a homomorphism (cf. e.g., [Mil]IV.2 for a detailed description). We can then define the TateShafarevich Group of as the kernel of the homomorphism
(2.11) This is the most mysterious part of the arithmetic of elliptic curves, it is conjectured to be a finite Abelian group. For ranks , this has been proven (cf. the survey of [RS]) but in general this is not known.
2.2 The Conjecture
With the above definitions, we can at least present the celebrated
Conjecture 1 (Birch–SwinnertonDyer (Weak Version))
The order of the zero of at is equal to the rank ,
That is, the Taylor series around 1 is with some complex coefficient .
In fact, a stronger version of the conjecture predicts precisely what the Taylor coefficient should be:
Conjecture 2 (Birch–SwinnertonDyer (Strong Version))
The Taylor coefficient of at is given in terms of the regulator , Tamagawa number , (analytic) order of the TateShafarevich group , and the order of the torsion group . Specifically, let , then
3 Elliptic Curve Data
BSD arose from extensive computer experimentation, the earliest of its kind, by Birch and SwinnertonDyer. Continuing along this vein, Cremona [Cre] then compiled an impressive of list of 2,483,649 isomorphism classes of elliptic curves over and explicitly computed the relevant quantities introduced above. This is available freely online at [Cre2].
3.1 Cremona Database
The database of Cremona, on which the rest of the paper will focus, associates to each minimal Weierstraß model (given by the coefficients and ; generically, these last two coefficients have very large magnitude), the following:

the conductor , ranging from 1 to 400,000;

the rank , ranging from 0 to 4;

the torsion group , whose size ranges from 1 to 16;

the real period , a real number ranging from approximately to .

the Tamagawa number , ranging from 1 to 87040;

the order of the TateShafarevich group (exactly when known, otherwise given numerically), ranging from 1 to 2500;

the regulator , ranging from approximately to .
A typical entry, corresponding to the curve (labelled as “314226b1” and with and which are readily computed from(2.2)) would be
(3.1) 
3.2 Weierstraß Coefficients
Let us begin with a statistical analysis of the minimal Weierstraß coefficients themselves. It so happens that in the entire database, there are only 12 different sets of values of , we tally all of the curves in the following histogram, against rank and :
We see that most of the curves are of smaller rank, with only a single instance of . This is in line with the recent result of [BS] that most elliptic curves are rank 1; in fact, over 2/3 of elliptic curves obey the BSD conjecture [BSZ].
To give an idea of the size of the acoefficients involved, the largest one involved in the database is
(3.2) 
which is of rank 0.
Even though it is a conjecture that the rank can be arbitrarily large, the largest confirmed rank [Elk] so far known in the literature is 19, corresponding (the last term being a 72digit integer!) to
(3.3) 
This extraordinary result is clearly not in the database due to the large rank.
One of the first steps in data visualization is a
principle component analysis where features of the largest variations are extracted. The minimal Weierstraß model gives a natural way of doing so, since the coefficients take only 12 values and we can readily see scatter plot of . Now, due to the large variation in these coefficients, we define a signed natural logarithm for as(3.4) 
We present this scatter plot of in Fig. 1. Therein, we plot all the data points (i.e., for all different values of ) together, distinguishing rank by colour (rank 4 has only a single point as seen from the table above).
The first thing one would notice is the approximate crosslike symmetry, even within rank. However, this is not trivial because the transformation
(3.5) 
is by no means a rank preserving map. For instance, a single change in sign in , could result in rank change from 1 to 3:
(3.6) 
Examples of a similar nature abound. The next feature to notice is that the size of the cross shrinks as the rank increases. This is rather curious since the largest rank case of (3.3) has far larger coefficients. This symmetry is somewhat reminiscent of mirror symmetry for CalabiYau 3folds, where every compact smooth such manifold with Hodge numbers has a mirror manifold with these values exchanged.
3.3 Distributions of and
Fortified by the above initial observations, let us continued with more refined study of the distribution of and . First, let us plot the distribution of each individually, normalized by the total, i.e., as probability mass functions. These are shown in part (A) to the left of Fig 2. Note that the horizontal axes (binning) is done logarithmically. We see that the distributions of both and are symmetric with respect to (Fig. 2A), with spanning orders of magnitude smaller as compared to . This is just to give an idea of the balanced nature of Cremona’s data, that elliptic curves with and are all constructed.
Next, in part (B) of Fig 2, we plot the joint probability mass function of the pair with colour showing the frequency as indicated by the colourbar to the right. We see that, as discussed in Fig. 1, there is a crosslike symmetry. Here, since we are not separating by rank, the symmetry is merely a reflection of the constructions of the dataset, that and are all present. What is less explicable is that it should be a cross shape and what is the meaning of the boundary curve beyond which there does not seem to be any minimal models. For reference, the central rectilinear cross indicates the cases of and respectively.
Finally, we compute the Euclidean distance
from the origin and study its probability distribution. This is shown in part (C) of Fig.
2, We find that half of the data lies within a radius of from the origin. The logarithm of can be well fitted with a Beta probability distribution:(3.7) 
with parameters , and . Thus, whilst there are a number of coefficients of enormous magnitude, the majority still have relatively small ones.
Differences by Rank:
As with our initial observations, we now study the variation of with respect to the rank of the elliptic curves. First, in Fig. 3
parts AD, we plot the joint distributions of
and for respectively. We can see that they differ signigicantly from each other, under permutation test [PJ] at confidence level .Next, Fig. 3 E show the probability distribution functions for our Euclidean distance for the different ranks. We find that the median Euclidean distance from the center decreases for higher values of . In fact, we see that the median values of and increase with the rank (see Fig. 4
DE which we will discuss shortly). Again, each is individually wellfitted by the Gamma distribution. In tables
9 and 10 in the Appendix, we show some statistics of andincluding their mean, standard deviation, median, and the number of zero entries, for given rank
and values of , , .3.4 Distributions of various BSD Quantities
Now, the coefficients are the inputs of the dataset, each specifying a minimal model of an elliptic curve, and to each should be associated the numerical tuple for the rank, the conductor, the regulator, the real period, the Tamagawa number, the order of the torsion group and the order of TateShafarevich group, respectively. It is now expedient to examine the distribution of “output” parametres.
As always, we arrange everything by rank and in Fig. 4 show the box plots of the variation around the median (drawn in red). The boxes enclose of the distribution, and the whiskers, of the distribution. We see that, as detailed above, have only variation and has many orders of magnitude more in variation that . The conductor has a fairly tame distribution while the other BSD quantities vary rather wildly – the is part of the difficulty of the conjecture, the relevant quantities behave quite unpredictably.
The RHS of Conjecture 2:
We now put all the quantities together according to the RHS of the Strong BSD Conjecture, which we recall to be . We test which statistical distribution best describe this data, by comparing 85 continuous distribution under the Akaike information criterion. We find that the distribution best describing the data is the Beta distribution (see Figure 5 A):
(3.8) 
where is the standard gamma function, , , and the original variable has been rescaled such that .
The selected distribution changes for elliptic with a specific rank . For , the selected distribution is the exponentiated Weibull, while for larger values of , the selected distribution is the Johnson SB (see Figure 5 B). We find that the median value of the increases both as a function of the rank (see Figure 5 C) and (see Figure 5 D).
4 Topological Data Analysis
Let us gain some further intuition by visualizing the data. As far as the data is concerned, to each elliptic curve, specified by the Weierstraß coefficients, one associates a list of quantities, the conductor, the rank, the realperiod, etc. The define a point cloud in Euclidean space of rather high dimension, each of point of which is defined by the list of these quantities which enter BSD. In the above, we have extracted the last two coefficients of the Weierstraß form of the elliptic curves and studied them against the variations of the first three which, in normal form, can only be one of the 9 possible 3tuples of and 0. The normal form thus conveniently allows us to at least “see” the Weierstraß coefficients because we have projected to two dimensions. However, the full list of the relevant quantities for the elliptic curve has quite a number of entries and cannot be visualized directly.
Luckily, there is precisely a recently developed method in data science which allows for the visualization of “high dimensionality”, viz., persistent homolgy in topological data analysis [CZCG] (cf. an excellent introductory survey of [OPTGH]). In brief, one creates a VietorisRips simplex from the data points in Euclidean space, with a notion of neighbourhood (by Euclidean distance). The Betti numbers of the simplex is then computed as one varies , whether the values are nonzero for each gives a notion of whether nontrivial topology (such as holes) persists for different scales . The result is a socalled barcode for the data. In the ensuing, we will use G. Henselman’s nice implementation of the standard methods in topological data analysis, the package Eirene for Julia/Python [Ei].
(a)  

(b) 
4.1 The Weierstraß Coefficients
We begin by studying the barcodes for the full set of five Weierstraß coefficients (as always, we take for and due to their size). Of course, computing the homology of the VietorisRips complex for over 2 million points is computationally impossible. The standard method is to consider random samples. Moreover, the first two Betti numbers and are usually sufficiently illustrative. Thus, we will take 1000 random samples (we do not separate the ranks since we find there is no significant difference for the barcodes amongst different ranks) of the coefficients (with the usual for the last two). The barcodes are shown in part (a) of Figure 6. For reference, we also plot the barcodes for the pair only since do not vary so much.
4.2 A 6dimensional PointCloud
Let us now try to visualize the relevant BSD quantities together. Organized by the ranks which dominate the data by far, the 6tuple
(4.9) 
naturally form three pointclouds in . Due to the high dimensionality, we sample 100 random points for each of the values and compute the full barcodes . It turns out that the main visible features are in dimension 0. We present these in Figure 7 and observe that indeed there is some variation in the barcode amongst the different ranks.
4.3 Conductor Divisibility
The factors of the conductor is of defining importance in the Lfunction, which appear to the LHS of BSD, meanwhile, the RHS is governed, in the strong case, by the combination . It is therefore expedient to consider the point cloud of the 3tuple organized by divisibility properties of . For instance, one could contrast the barcodes for the triple for even versus odd. Again, the features are prominent for dimension 0 and the barcodes are shown in Figure 8.
even  odd 

Simiarly, we could group by modulo 3, as shown in Figure 9.
5 Machine Learning
, a paradigm was proposed to using machinelearning, and in particular deep neural networks to help computations in various problems in algebraic geometry. Exemplified by computing cohomology of vector bundles, it was shown that to very high precision, AI can guess the correct answer without using the standard method of Gröbner basis construction and chasing long exact sequences, both of which are computationally intensive. Likewise,
[HK]showed that machinelearning can identify algebraic structures such as distinguishing simple from nonsimple finite groups. At over 99% precision, the requisite answers can be estimated without recourse to standard computations which are many orders of magnitude slower.
It is therefore natural to wonder whether the elliptic curve data can be “machinelearned”. Of course, we need to be careful. While computational algebraic geometry over hinged on finding kernels and cokernels of integer matrices, a task in which AI excels. Problems in number theory are much less controlled. Indeed, trying to predict prime numbers [He] seems like a hopeless task, as mentioned in the introduction. Nevertheless, let us see how far we can go with our present dataset for BSD.
5.1 Predicting from the Weierstraß Coefficients
We begin with quantifying the performance of machine learning models in predicting, one by one, the quantities , , , , , , , together with the RHS of the Conjecture 2, given the Weierstraß coefficients , , , , alone. Straightaway, this is expected to be a difficult, if not impossible task (as impossible as, perhaps, the prediction of prime numbers). This is confirmed by the correlation between the Weierstraßcoefficients and the BSD quantities: from the correlation matrix, we see that the relationship is indeed weak (cf. Fig. 10), implying this is not a straightforward prediction task. Here, the correlation matrix contains the values of the Spearman correlation [Pear], that is the Pearson correlation [Pear] between the rank variables associated to each pairs of the original variables.
Quantity  NMAE (XGBoost) 
NMAE (Dummy)  NMAE (Linear) 

RHS Conj. 2 
RMSE (XGBoost)  RMSE (Dummy)  RMSE (Linear)  

RHS Conj. 2 
), for XGboost (left column), the dummy regressor (central column) and a linear regression (right column). The reported values are averages across
fold crossvalidations, with the corresponding standard deviations.Our analysis relies on gradient boosted trees [BST], using the implementation of XGboost [XGBoost]
, an opensource scalable machine learning system for tree boosting used in a number of winning Kaggle solutions (17/29 in 2015). In Appendix
B, we present the similar results using a support vector machine, another highly popular machinelearning model, and see that the XGboost indeed performs better. Furthermore, based on the learning curves of the XGboost models (discussed in Appendix
A), we have chosen a fold crossvalidation, such that the training set includes of the values, and the validation set the remaining .5.1.1 Numerical Quantities
First, we train regression models to predict the values of , , , and . We evaluate the performance of the regression, by computing the normalized median absolute error:
(5.10) 
where are the observed values and are the predicted values, and the rooted mean squared error:
(5.11) 
where is the size of the test set. We desire that both NMAE and RMSE to be close to 0 for a good prediction.
We compare the result of the XGBoost regression with two baselines: (1) a linear regression model and (2) a dummy regressor, that always predicts the mean of the training set. We find that, in all cases, the machine learning algorithms perform significantly better than the baseline models (see Table 1) with respect to the NMAE and RMSE. However, XGboost performs only marginally better than the baselines in predicting the value of . We report also the socalled Imimportance of the features for the XGBoost regressor in Fig. 15
. Here, importance indicates how useful each feature is in the construction of the boosted decision trees, and is calculated as the average importance across trees. For a single tree, the importance of a feature is computed as the relative increase in performance resulting from the tree splits based on that given feature
[XGBoost].We see that overall our measures, NMAE and RMSE, are not too close to 0, except perhaps , for which even a simple linear regression does fairly well. In table 2, we report the values of the coefficients of the linear regression fit for .
coef  std err  t  P>t  [0.025  0.975]  

const  1.5946  0.004  380.761  0.000  1.586  1.603 
0.0658  0.005  14.524  0.000  0.057  0.075  
0.0065  0.004  1.543  0.123  0.015  0.002  
0.0518  0.005  11.473  0.000  0.061  0.043  
0.6320  0.006  110.282  0.000  0.643  0.621  
0.4877  0.006  85.112  0.000  0.477  0.499 
value. Here, we can reject the null hypothesis that the coefficients are equal to
at significance level , for all coefficients, except the one associated to ().What Table 2 means is that .
Predicted variable  Rsquared  Adj. Rsquared  Fstatistic  Prob (Fstatistic) 

0  0  95.67  3.86e101  
0.006  0.006  2777  0  
0.001  0.001  387.2  0  
0.005  0.005  2522  0  
0.005  0.005  2377  0  
RHS Conj. 2  0.012  0.012  6143  0 
Likewise, in table 3, we report the the statistics associated to the linear regression models for the prediction of the various quantities, where only the Weierstrass coefficients are used as features (inputs). The low Rsquared values indicated that the regression is not so good in terms of the coefficients alone.
5.1.2 Categorical Quantities
Next, we train classifiers to predict the values of
and because these easily fall into discrete categories: the rank and the torsion group size can only be one of 16 integer values due to Mazur’s theorem. Again, we use a cross validation, and we evaluate the performance of the classifier, by computing the score:(5.12) 
where we have, in the predicted versus actual, the true positives (TP), false positives (FP), and false negatives (FN). Since we have several possible values for the rank , we compute both , by counting the total TP, FN and FP, as well as , the average value of computed across ranks.
In addition, we also compute the Matthew correlation coefficient [Matthew]
, to describe the confusion matrix:
(5.13) 
Both F1scaore and MCC are desired to be close to 1 for a good prediction.
Quantity  (XGBoost)  (Dummy)  (Logistic) 

(XGBoost)  (Dummy)  (Logistic)  

(XGBoost)  (Dummy)  (Logistic)  

, for XGboost (left column), the dummy regressor (central column) and a logistic regression (right column). The reported values are averages across
fold crossvalidations, with the corresponding standard deviations.For checks, we compare the XGBoost classifier with (1) a baseline classifier, that predicts always the predominant class in the training set, as well as with (2) a logistic regression. We find that XGboost performs better than the baseline models for predicting , but the performance of the prediction of is comparable to the baselines (see Table 4).
Analyzing the confusion matrices (see Figures 12 and 13), it appears clear that it is very hard to predict and from the Weierstraß coefficients alone. For both the prediction of and , the most important predictor is (Figures 12 and 13). This is the feature that contributed the most to increase the performance of the boosted tree [XGBoost].
5.2 Mixed Predictions
While the results in the previous subsection may seem disappointing in general, they do present a good sanity check: to obtain all the BSD quantities from the elliptic curve data in some straightforward way would be an almost unimaginable feat in number theory. Nevertheless, in this section, let us build machine learning models to predict the values of , , , , , and among themselves, i.e., we consider as features (inputs) all the quantities characterizing the elliptic curves (except the predicted quantity), rather than the Weierstraß coefficients alone.
Quantity  NMAE (XGBoost)  NMAE (Dummy)  NMAE (Linear) 

We present the results in Table 5 of the accuracy measure NMAE by the 3 methods as in subsection 5.1: machinelearning by XGBoost, dummy regression and linear regression. To read the table, of each of the 5 quantities given in a row, we use the other 4 to train the ML in order to predict this row. We see that this is a significant improvement over the previous subsection and shows that, especially the XGBoost, the machinelearning can very confidently predict the Tamagawa number, the regulator and . The feature importance is shown in Fig. 15 and in table 6, we report the the statistics associated to the mixed predictions linear regression models. A comparison between the predictions of the linear models compared to XGboost is presented in Figure 14.
Predicted variable  Rsquared  Adj. Rsquared  Fstatistic  Prob (Fstatistic) 
0.038  0.038  8999  0  

0.054  0.054  12960  0 

0.042  0.042  9889  0 

0.017  0.017  3891  0 

0.114  0.114  29110  0 

Quantity  (XGBoost)  (Dummy)  (Logistic) 

(XGBoost)  (Dummy)  (Logistic)  

(XGBoost)  (Dummy)  (Logistic)  

Finally, let us use all quantities: the coefficients as well as , , , , to predict and . The accuracy measure and (which should be close to 1 ideally) are presented in Table 7 and the feature importance, in Figures 16 and 17. We see that these are considerably improved compared to those obtained in section 5.1 (cf. Tables 1 and 4). This is somehow to be expected in light of the correlations observed in Figure 10. In fact, even logistic regressions behave fairly well, with the , , and the scores subtiantially larger than those obtained by the dummy classifiers (see Table 7
). For reference, we include report the hyperplane equations of the linear regression models to see how each of the quantities can be fitted by all the others:
(5.14) 
6 Conclusions and Prospects
In this paper, we initiated the study of the data science of the arithmetic of elliptic curves in light of the BirchSwinnertonDyer Conjecture. This is inspired by the the recent advances in the statistical investigation of CalabiYau manifolds, especially in the context of superstring theory [HJP, ACHN], as well as in the paradigm of machinelearning structures in geometry [He, HeBook] and algebra [HK]. Whilst we are still within the landscape of "CalabiYauness", it is expected that patterns in number theory should be much more subtle than those in geometry over
and in combinatorics. Nevertheless, BSD, residing at the crux between arithmetic and analysis, might be more susceptible to machinelearning and to patternrecognition.
From our preliminary examinations on the extensive database of Cremona [Cre, Cre2] we have already found several interesting features. First, we find that in the minimal Weierstraß representation, where a pair of coefficients clearly constitutes the principle component of the data, the distribution thereof follows a curious crosslike symmetry across rank, as shown in Figures 1 and 2. This is a highlynontrivial symmetry since does not preserve rank. This symmetry is reminiscent of mirror symmetry for CalabiYau threefolds. In addition, the absence of datapoints beyond the boundaries of the cross is also of note, much like that Hodge plot for the CalabiYau threefolds.
Over all, the distribution of the Euclidean distance of to the origin, as well as that of the RHS of the Strong BSD, viz., the quantity (cf. conjectures in Sec. 2.2), are best described by a Betadistribution, which is selected in both cases among 85 continuous distributions using the Akaike Information Criterion. Organized by rank, these distributions also vary.
One further visualize the data, the tuples consisting of the coefficients as well as the BSD tuple for ranks using the standard techniques from topological data analysis. The barcodes are shown in Figures 6 and 7. While the Weierstraßcoefficients show little variation over rank, the BSD tuple does show differences over . Moreover, as expected, the divisibility of the conductor influences the barcodes.
Finally, emboldened by the recent success in using machinelearning to computing bundle cohomology on algebraic varieties without recourse to sequencechasing and Gr’́obner bases as well as recognizing whether a finite group is simple directly by looking at the Cayley table, we asked the question of whether one can “predict” quantities otherwise difficult to compute directly from “looking” at the shape of the elliptic curve. Ideally, one would have hoped that training on , one could predict any of the BSD quantities to high precision, as was in the cohomology case. However, due to the very high variation in the size of , one could not find a good machinelearning technique, decision trees, supportvector machines or neural networks, which seems to achieve this. This is rather analogous to the (expected) failure of predicting prime numbers using AI. Nevertheless, the BSD quantities, when mixed with the Weierstraß coefficients, does behave well under machinelearning. For instance, the Matthew correlation coefficient between predicted and true values of and is .
At some level, the experiments here, in conjunction with those in [He, KS, Rue, CHKN, BHJM, JKP, HK, BCDL, Ashmore:2019wzb], tend to show a certain hierarchy of difficulty in how machinelearning responds to problems in mathematics. Understandably, number theory is the most difficult: as a reprobate, [He] checked that trying to predict the next prime number, for instance, seems unfeasible for simple neural networks. On the other hand, algebraic geometry over the complex numbers seems to present a host of amenable questions such as bundle cohomology, recognition of elliptic fibrations or calculating Betti numbers. In between lie algebra and combinatorics, such as the structure of finite groups, where precision/confidence of the crossvalidation is somewhat intermediate. It is therefore curious that in this paper one sees that a problem such as BSD, which resides in between arithmetic and geometry, is better behaved under machinelearning than a direct attack on patterns in primes.
Acknowledgments
YHH would like to thank the Science and Technology Facilities Council, UK, for grant ST/J00037X/1, Nankai University, China for a chair professorship and Merton College, Oxford, for enduring support.
Appendix A Learning Curves
To prevent overfitting, we compute the learning curves of the regression (figure 18) and classification (figure 19) models in section 5.1. We find that of the data is a good choice for the training set size, suggesting a 5fold cross validation.
Appendix B Comparison with SVM
In this section, we compare the performance of the XGBoost models with Support Vector Machine (SVM) models. SVM models are very long to train, hence we focus, for this task, on a subset of examples. In table 8 we report the performance of the regression models used to predict , , , and . Only in the case of , the SVM model performs better than XGBoost.
Quantity  NMAE (XGBoost)  NMAE (Dummy)  NMAE (SVM) 
