1 Introduction
Accurate computational solutions to the Schrödinger equation are critical for quantum chemistry and condensed matter physics. The timeindependent form of the equation tries to solve for a wavefunction
, which is an eigenfunction of the Hamiltonian,
of the system:(1)  
(2) 
Here is the realvalued energy of the wavefunction, () is the position and spin (position) of the th electron, is the position of the th nucleus and is the charge of the th nucleus. Under the BornOppenheimer approximation, the nuclei are regarded as stationary and we can solve for the lowestenergy solution of this equation by choosing a class of unnormalized wavefunctions parameterized by , and minimizing the Rayleigh quotient:
(3) 
where in the last line we are performing the integral by Monte Carlo sampling from the distribution . This method of directly minimizing the energy of a system using a parametric approximation to the true wavefunction – also known as a wavefunction Ansatz – and Monte Carlo sampling is known as variational quantum Monte Carlo (VMC) Foulkes et al. (2001).
Many classes of wavefunction Ansätze have been proposed in the last fifty years. Deep neural networks have recently come to the fore as a promising flexible and expressive class of functions for the manyelectron problem Ruggeri et al. (2018); Luo and Clark (2019); Han et al. (2019); Choo et al. (2020); Hermann et al. (2020). The Fermionic Neural Network (FermiNet) Pfau et al. (2020) has managed to reach higher absolute accuracy on many small atomic and molecular systems than other neural network Ansätze. This is likely due to a combination of architectural choices, and the use of KroneckerFactored Approximate Curvature (KFAC) Martens and Grosse (2015) as an optimizer, while other neural network Ansätze typically use firstorder methods like ADAM. Some methods use the secondorder stochastic reconfiguration Sorella (1998) method, which is closely related to natural gradient descent, but are limited in their capacity to scale to larger networks.
Despite the promising FermiNet results, there are many limitations to the original work. Optimization requires large computational resources, and as system size grows, the accuracy diminishes slightly. In this paper, we investigate routes to improving the scaling, accuracy and optimization of the FermiNet, how to reach chemical accuracy on second row atoms, and how to accelerate the training of large systems by an order of magnitude.
2 Improving FermiNet Accuracy
# hidden units  256  512  256  512  Exact Chakravorty et al. (1993) 

# dets  16  16  32  32  
P  341.2561(1)  341.2570(1)  341.2565(1)  341.2578(1)  341.259 
S  398.1049(1)  398.1066(1)  398.1072(1)  398.1082(1)  398.110 
Cl  460.1452(1)  460.1451(1)  460.1463(1)  460.1477(1)  460.148 
Ar  527.5374(1)  527.5384(1)  527.5396(1)  527.5405(1)  527.540 
The original FermiNet achieved chemical accuracy, defined as 1 kcal/mol^{1}^{1}11 kcal/mol is equal to 1.594 milliHartree (m). to the exact energy, on many small systems, such as firstrow atoms (lithium to neon), the accuracy declined with larger systems. Ablation studies on N, CO, and a hydrogen chain of 10 atoms showed that both the number of determinants and the width of the layers in the oneelectron stream play an important role in the convergence of the FermiNet energy. We investigated the role of both in the performance on secondrow atoms, specifically phosphorus through argon. While the settings used in the original paper were insufficient for reaching chemical accuracy relative to exact results Chakravorty et al. (1993), we found that increasing both the number of determinants and width of the oneelectron stream was sufficient to reach chemical accuracy on all systems except sulphur. We also increased the number of MCMC steps between weight updates from 10 to 20 to reduce noise and improve equilibration, and ran training for 300,000400,000 weight updates instead of 200,000 to guarantee convergence. Results are presented in Table 1.
3 Scaling and Simplifying FermiNets
Framework  MCMC Steps  Det weights  Envelope  GPU Hours  Energy () 

TensorFlow Pfau et al. (2020)  10  Yes  Full Covariance  11520  155.9263(6) 
JAX  50  No  Full Covariance  1880  155.9348(1) 
JAX  50  No  Isotropic  1104  155.9348(1) 
Method  con_TS  dis_TS  gbut  gt_TS  tbut 

CCSD(T) Kinal and Piecuch (2007)  40.4  21.8  25.1  22.3  28.0 
CRCC(2,3) Kinal and Piecuch (2007)  41.1  66.1  24.9  22.1  27.9 
CCSDt Shen and Piecuch (2012)  40.1  59.0  27.2  25.3  31.1 
CC(t;3) Shen and Piecuch (2012)  40.2  60.1  25.3  22.6  28.3 
DMC Berner and Lüchow (2010)  40.40.5  58.60.5  25.20.5  22.20.5  27.90.5 
FermiNet  40.20.1  57.70.1  25.30.1  22.50.1  28.40.1 
Experiment Srinivasan et al. (1965); Wiberg and Fenoglio (1968)  40.62.5        25.90.4 
The large computational overhead of the FermiNet is a limitation for practical adoption and scaling to larger systems. For instance, computing the energy of bicyclobutane, a 30electron system, took roughly 1 month on 16 V100 GPUs Pfau et al. (2020). This is outside the reach of many research groups. We took several steps to improve this performance without sacrificing accuracy. First, we wrote a new implementation of the FermiNet in JAX Bradbury et al. (2018), which led to immediate improvements in performance. On large systems, GPU utilization went from ~60% to ~90%. The memory overhead also declined dramatically, possibly due to the availability of forwardmode gradients in JAX, which were used in computing the kinetic energy. This reduced the number of GPUs needed to run bicyclobutane with a batch size of 4096 from 16 to 4. On its own, this is enough to achieve a 6x improvement in efficiency.
To improve efficiency even further, we removed several unnecessary features from the FermiNet. At a high level, the FermiNet Ansatz can be written as:
(4) 
(5) 
Where index spins, and index electrons, indexes atoms and indexes determinants. is the last layer of a permutationequivariant neural network, and the last term in Eq. 5 involving weights and is a multiplicative envelope that enforces the boundary condition that the wavefunction goes to zero at infinity. The weights on the determinants are redundant and can be absorbed into the linear weights , and so we remove them from the network. Each “covariance" parameter in the envelope is one matrix. Computing adds quite a large computational overhead with unclear benefit. We find that if we replace the full covariance with a single parameter and compute instead, effectively making the covariance isotropic, the performance on bicyclobutane is unaffected, while the computational overhead is reduced by a full 40%. In Table 2, it can be seen that by combining these simplications and the JAX implementation, training can be accelerated by a full order of magnitude, and the overall energy can actually be improved by a full 8 m relative to the original results. Note that this is without the wider networks and more determinants used in the previous section.
Bicyclobutane to 1,3butadiene transition
The simplified FermiNet is fast enough that we can investigate multiple geometries of complex systems. We start with the transition of bicyclobutane to 1,3butadiene. This has been investigated by many computational techniques Kinal and Piecuch (2007); Shen and Piecuch (2012); Berner and Lüchow (2010), because the popular CCSD(T) method dramatically underestimates the energy of one possible transition, wrongly predicting that it is the preferred pathway. Both DMC and CC(t;3) methods agree with experimental results, and we compare against the DMC results in Fig. 1 and the full suite of results in Table 3. In both cases, we add the zeropoint vibrational energy computed by CASSCF from Kinal and Piecuch (2007). We find that with no fine tuning, the same FermiNet is able to match the DMC energy differences between bicyclobutane and all other states, both equilibrium and transition, to within chemical accuracy. In Table 3, it can be seen that the FermiNet matches CC(t;3) even more closely, except on the disrotatory pathway, on which the FermiNet predicts slightly lower energies than any other accurate method. This is an impressive feat for a method applied essentially outofthebox, with no systemspecific tuning.
Cyclobutadiene automerization and comparison to the PauliNet
The automerization of cyclobutadiene is another system on which CCSD(T) struggles, due to its multireference nature Lyakh et al. (2012). The PauliNet, another neural network Ansatz which is faster but less accurate than the FermiNet, has been shown to match the performance of multireference coupled cluster on this system Hermann et al. (2020). We found that the FermiNet achieves energies ~70 m or 44 kcal/mol lower than the PauliNet on this system, but only after more iterations than the PauliNet was run for, so it is possible that the PauliNet did not fully converge. The relative energies between the ground and transition state were comparable between the PauliNet and FermiNet – 9.90.6 kcal/mol for PauliNet and 10.3
0.1 kcal/mol for FermiNet – and both were at the high end of the experimentallyobserved range. In terms of the relative efficiency of the PauliNet and FermiNet, an exact comparison is difficult. Assuming full GPU utilization for both models, one iteration of the PauliNet took 50s on a GTX 1080 Ti, which at 11.3 TFLOP/s would come to 565 TFLOP/iteration. One iteration of the FermiNet took 2.5s across 8 V100s, which at 125 TFLOP/s per device comes out to 2.5 PFLOP/iteration, or roughly 5x the computational cost of the PauliNet. Already we have made impressive strides in accelerating the FermiNet, and we hope that future innovation will bring this figure down in the future.
4 Conclusions
We have shown that through a combination of careful engineering and simplification of the FermiNet, we can greatly increase the speed of training, while at the same time we can increase the accuracy simply by making the network larger. We can extend the range of atoms for which we can reach chemical accuracy to the entire second row of the periodic table, beyond which exact methods become impractical. While the original FermiNet results were not within chemical accuracy of CCSD(T) extrapolated to the complete basis set limit for large systems like bicyclobutane, we have shown here that the relative energies between different steps in the transition from bicyclobutane to butadiene are within chemical accuracy of the best available methods. This includes the disrotatory transition state, for which CCSD(T) fails dramatically. In comparison against the PauliNet, we find that the FermiNet is able to achieve far better absolute energies on cyclobutadiene, and comparable relative energies.
The only major downside of the FermiNet relative to the PauliNet is that it is significantly slower to train. One possible reason for this, advanced by the PauliNet authors, is that this is because the PauliNet incorporates more “physical prior knowledge" – namely the exact cusp conditions are built into the PauliNet, while the FermiNet learns the cusp conditions. Another possible reason is simply that the PauliNet sacrifices accuracy for speed, given it has an order of magnitude fewer parameters and does not achieve the same absolute energies as the FermiNet. The exact reason for the discrepancy in training speed remains a topic for future study. We are hopeful that future developments in neural network wavefunction Ansätze will lead to models that combine the speed and light weight of the PauliNet with the accuracy of the FermiNet.
Broader Impact
This work could lead to the adoption of new computational techniques by the chemistry and material science communities. While the present work is still too small to be applied to cuttingedge chemistry and has only been used for already wellunderstood systems, in the future these methods could help accelerate chemical research by predicting chemical reactions before they can be observed in the lab. In the best case, this could lead to discoveries with positive social impact like new lifesaving drugs or more efficient batteries or catalysts for carbon capture. In the worst case, this could lead to harmful discoveries of the sort that have happened in applied chemistry in the past – but this is a risk shared by all computational chemistry methods development, and a strong professional ethos discouraging such research makes it unlikely.
Thanks to Piotr Piecuch and Frank Noé for sharing data and results, and James Martens for discussions around KFAC and optimization.
References
 [1] (2010) Isomerization of bicyclo[1.1.0]butane by means of the Diffusion Quantum Monte Carlo method. The Journal of Physical Chemistry A 114 (50), pp. 13222–13227. Cited by: §3, Table 3.
 [2] (2018) JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax. Cited by: §3.
 [3] (1993) Groundstate correlation energies for atomic ions with 3 to 18 electrons. Physical Review A 47 (5), pp. 3649. Cited by: Table 1, §2.
 [4] (2020) Fermionic neuralnetwork states for abinitio electronic structure. Nature Communications 11 (1), pp. 1–7. Cited by: §1.
 [5] (2001) Quantum monte carlo simulations of solids. Reviews of Modern Physics 73 (1), pp. 33. Cited by: §1.
 [6] (2019) Solving manyelectron Schrödinger equation using deep neural networks. Journal of Computational Physics 399, pp. 108929. Cited by: §1.
 [7] (2020) Deepneuralnetwork solution of the electronic Schrödinger equation. Nature Chemistry. Cited by: §1, Figure 2, §3.
 [8] (2007) Computational investigation of the conrotatory and disrotatory isomerization channels of bicyclo[1.1.0]butane to buta1,3diene: a completely renormalized coupledcluster study. The Journal of Physical Chemistry A 111 (4), pp. 734–742. Cited by: §3, Table 3.
 [9] (2019) Backflow transformations via neural networks for quantum manybody wave functions. Physical Review Letters 122 (22), pp. 226401. Cited by: §1.
 [10] (2012) Multireference nature of chemistry: the coupledcluster view. Chemical Reviews 112 (1), pp. 182–243. Cited by: §3.

[11]
(2015)
Optimizing neural networks with kroneckerfactored approximate curvature.
In
International Conference on Machine Learning (ICML)
, pp. 2408–2417. Cited by: §1.  [12] (2020) Ab initio solution of the manyelectron Schrödinger equation with deep neural networks. Physical Review Research 2 (3), pp. 033429. Cited by: Better, Faster Fermionic Neural Networks, §1, Table 2, §3.
 [13] (2018) Nonlinear network description for manybody quantum systems in continuous space. Physical Review Letters 120 (20), pp. 205302. Cited by: §1.

[14]
(2012)
Combining activespace coupledcluster methods with moment energy corrections via the CC(P;Q) methodology, with benchmark calculations for biradical transition states
. The Journal of Chemical Physics 136 (14), pp. 144104. Cited by: §3, Table 3.  [15] (1998) Green function monte carlo with stochastic reconfiguration. Physical review letters 80 (20), pp. 4558. Cited by: §1.
 [16] (1965) The thermal decomposition of bicyclo[1.1.0]butane. The Journal of Physical Chemistry 69 (5), pp. 1775–1777. Cited by: Table 3.
 [17] (1968) Heats of formation of CH hydrocarbons. Journal of the American Chemical Society 90 (13), pp. 3395–3397. Cited by: Table 3.
Comments
There are no comments yet.