I Introduction
Quantitative and formal models are widely employed in economics, sociology, and political science [47, 24] for studying diverse subjects ranging from economic growth [52] to participation in social movements [29]. They are also used to describe empirical regularities, such as patterns in voting behavior and city scaling laws, from which mechanisms can be derived and subsequently tested [11, 23]. While qualitative methods remain a dominant paradigm in much of social science due to their ability to include normative arguments, capture greater complexity and nuance, and extrapolate from sparse data, mathematical models provide a complementary way to reason about causal mechanisms, generate questions, and produce empirically testable predictions [21].
These models are typically constructed using a combination of existing theoretical paradigms, intuition about the relevant variables, and insights on how they interact. Although intuition can serve as a good inductive prior, especially since researchers – being human agents themselves – can have valuable insights on social phenomena, it suffers from two key drawbacks. First, because available data can be highdimensional, noisy, or strongly nonlinear, it may be difficult to recognize certain empirical regularities. Second, social phenomena are often explained by a large space of plausible and intuitive theories [33]. This means that personal and experiential factors [30]
, together with hidden assumptions made during the modeling process, can bias the models. The inclusion or omission of certain variables may even change the sign of the estimated parameters. These biases can threaten the external validity of these models, especially if they are tested in isolation without a systematic examination of the entire solution space
[31].Machine learning provides a collection of powerful tools for detecting patterns in data. However, these models are often uninterpretable and lack the ability to generalize beyond the observed data. Symbolic regression – the process of discovering symbolic expressions from empirical data – has thus evolved as a central method for uncovering hidden relationships in complex data. In recent years, these methods have seen substantive use in the discovery of fundamental laws in the natural sciences including mathematical conjectures [33], physical laws [56, 17, 59], and ecological models [10, 44].
There have also been several applications of symbolic regression to social science data [48, 55, 62, 20], indicating the promise of AIassisted model discovery in these fields. However, most existing methods lack flexibility in incorporating userspecified inductive biases: a crucial element for seamless humanmachine collaboration, particularly in cases where inference from sparse data can benefit greatly from human input. They are also unable to take advantage of the additional statistical power offered by longitudinal (panel) datasets, where information shared between panels can improve the generalizability of the learned model.
We address these problems in our work and propose a general framework for using neurosymbolic regression to assist the discovery of quantitative models in social science. Building off OccamNet [16], a neural method for efficient symbolic regression, our system searches for parsimonious functional relationships or differential equations and evaluates them according to their explanatory power^{1}^{1}1The code used for our model and experiments is available at https://github.com/druidowm/OccamNet_SocialSci.. We extend existing interpretable machine learning methods on two fronts: (i) improving noise tolerance in small datasets (ii) and enabling the system to learn symbolic relationships across longitudinal data (e.g., in a panel dataset across countries) using a weightsharing approach. This allows one neuralsymbolic model to be trained on multiple panels with different parameters simultaneously. Using the proposed workflow shown in Fig. 1, these contributions allow neurosymbolic methods to be more readily applied to abinitio model discovery in social science, where small and noisy datasets are often available across contexts with different model parameters.
We highlight several key applications of this system in a social science context. In section III.1, we show that complexity regularization can be used to find higherorder corrections in network degree distributions, which we extend in mapping the complexityaccuracy Pareto front of economic production functions in section III.1.2. We argue that such maps can help researchers tradeoff various quantitative models and debias the search process. Next, we apply this method to dynamical models in section III.1.2, where we use realworld, noisy data to discover the LotkaVolterra equations and show that weightsharing learning can be applied across epidemic waves to provide better performance in finding infection models. Finally, we highlight the utility of neurosymbolic regression in evaluating existing models in section III.2.3, using weightsharing learning to map the complexityaccuracy front of economic growth models across 18 OECD countries and evaluate the generalizability of Solow and Swan’s exogenous growth model [52, 54]. We hope that the capabilities demonstrated in this article will inspire greater collaborative use of these machine learning methods in social science.
Ii Method
We employ a neural network architecture built on OccamNet
[16] to find sparse, interpretable formal models that fit the input data. The architecture is designed to accept a range of inductive priors from the user to regularize the regression or reduce the size of the search space.In the most general formulation, OccamNet takes as input a panel dataset of the form on variables, where each dataset with samples is partitioned by the user into an input set
(1) 
and a target set
(2) 
for symbolic regression, i.e. . When the data is not longitudinal, we simply have with one dataset .
Alternatively, the user can also allow OccamNet to automatically find a partition by having it search through all possible combinations of inputs and targets and fitting an implicit function to the dataset(s). This mode of operation is useful if the user is seeking new empirical regularities in data and wishes to obtain a ranked list of candidate relations based on their error and compactness.
The goal of the regression problem is to find vectors of functions and hidden constant parameters
that approximate or exactly specify each target variable with input vector based on a predefined set of basis functions . These basis functions can have any number of arguments, be specified over different domains, or include constants to be optimized over (e.g., power laws of the form ). They may include predetermined variables, such as , , the measured rate of capital growth, or the basic reproduction number of a virus. Furthermore, OccamNet can operate with piecewise functions or nondifferentiable, potentially recursive programs like , enabling a wide range of possibilities for constructing formal models.OccamNet works by representing a probability distribution over a space of userspecified functions. During training, OccamNet samples functions from its probability distribution, evaluates their effectiveness at fitting the given data, and updates its probability distribution so as to weight its probability toward betterfitting solutions. An overview of the implementation of OccamNet and its advantages over other neurosymbolic regression methods for social science data is provided in Appendix A.
ii.1 Regularization
Our model is built on two key inductive priors: the use of symbolic bases and complexity regularization. Symbolic regression enables us to find interpretable expressions (as opposed to blackbox or fully nonparametric results), which can inform an understanding of the mechanisms underlying the formal model. For instance, the form of differential equations, the existence of scaling laws, or interaction terms between certain variables can provide insight into the behavior of the social system. Furthermore, although symbolic expressions may only be suited to certain kinds of social science theories, they describe precisely specified relationships within the dataset and can be used to generate testable predictions across populations or timeframes.
Complexity regularization enables model discovery to follow the maxim that “the theory should be just as complicated as all our evidence suggests” [33]. Such a requirement does not necessarily depend on the assumption of parsimony: although it may be true that simple theories have higher probabilities [63], it also ensures that there is sufficient evidence relative to the complexity of the theory to produce determinate research designs. In the computational context, for many systems, the function relating the variables of interest only consists of a few terms, leading to a sparse regression problem which makes model discovery computationally tractable and avoids the need to perform a bruteforce search over functions [9].
To bias OccamNet toward simpler solutions, we include two regularization terms in the optimization objective, which the user can tune to control the simplicity of OccamNet’s output: Activation Regularization and Constant Regularization. These terms penalize the number of activation bases and the number of tuneable constants used in the resulting expression respectively.
We may further constrain the model’s search space by accounting for the units of the variables present in the data. For instance, a dataset on economic growth might include variables with units such as dollars per capita or units of labor per year. Upon user specification of the units of each input and target variable, OccamNet checks whether the units of a sampled function match those of the target variable. It then adds a Unit Regularization term to penalize functions where the units do not match. While OccamNet may also add a unitnormalizing constant in front of the mismatched terms, this would come at the cost of additional complexity. The implementation of all of the regularization terms is described in more detail in Appendix A.1.
ii.2 Differential equation discovery
For timedependent datasets, we wish to discover dynamical systems of the form , where is a vector denoting the state of the system at time . The input to OccamNet is a timeseries dataset
(3) 
where we have samples that are timestep apart starting from an initial state at . Our regression target is then given by . Fitting dynamical equations follows the same process as described in II with the exception that is first computed numerically from the input data using the central difference formula^{2}^{2}2The central difference formula is given by and accounts for secondorder error in the derivative approximation.. Since each is often contaminated with noise, it may be necessary to filter both and to counter differentiation error, as described in Appendix A.2.
While we primarily evaluate OccamNet on systems of nonlinear differential equations in which the target variables are decoupled, we note that our model can also be applied to coupled systems by fitting the equations individually and adding the output of each fit as an input variable. For instance, for a system of two ODEs and as regression targets, one can use OccamNet to fit and append the predicted timeseries to the input data for fitting .
ii.3 Data ensemble learning by weightsharing
Datasets used in social science often contain both crosssectional and time series information. Take for example a dataset containing economic indicators across countries, where the rows of each panel are timedependent vectors of variables . Furthermore, each country has a vector of hidden parameters that is inaccessible to the researcher. Even if the variables are related by the same functions for each , solutions to the dynamical system may be qualitatively very different because each country may have different parameters and initial conditions.
Learning on ensembles of data by sharing the weights of OccamNet on multiple panels allows us to aggregate the statistical power of such datasets and reconstruct the dynamical equations even if is inaccessible and individual country trajectories do not traverse the full phase space of the system. This is done by fixing a shared set of functions for all panels and fitting the optimal constants
for each panel individually. The system repeats this step for a specified number of epochs and selects the optimal functions
with the lowest mean squared error with respect to the target data. To support ensembles with datasets of varying sizes, we normalize the total error by dividing each individual dataset error by the number of samples in the given panel.Since the loss computed at each epoch is a distribution across the panels, different summary statistics may be used to select the optimal symbolic model. For instance, the loss may be computed as the mean or median. However, because the model is generally nonlinear, errors of the form , where
is a vector of Gaussian distributed entries with zero mean, can lead to highly skewed distributions that require alternate measures of central tendency. It may also be the case that certain outliers require additional variables or theoretical models to explain. In this case, the researcher can choose to identify the optimal formal model from the error histogram. The data ensemble learning modality is demonstrated in Fig.
1.Iii Case Studies
We validate our AIassisted approach on synthetic and realworld datasets related to social science covering both static distributions and dynamical systems. All dataset preprocessing details and model parameters are available in Appendix B.
iii.1 Static distributions
We first evaluate our neurosymbolic regression method for the discovery of social science models on static distributions without timedependence. We extract symbolic relations as illustrated in Fig. 2, which summarizes our results on fitting datasets of single variables such as network densification and degree distributions, as well as a multivariable distribution for the CobbDouglas production function [14].
iii.1.1 Network properties
Beginning with Barabási and Albert’s seminal paper on scalefree networks [3], graph formation properties and degree distributions have become an increasingly active area of research. Network structures are studied in economics and sociology in the context of clustering, information spread, and skewed distributions, such as those exemplifying the “rich get richer” effect in preferential attachment processes. Here, we focus on two sets of persistent properties observed in human networks: densification laws and degree distributions.
In Fig. 2 (a), we use neurosymbolic regression to find the network densification process in citation networks. Densification occurs when the number of edges in a graph grows superlinearly in the number of nodes over time. We use OccamNet to reconstruct the densification law of the citation graph of high energy physics preprints on arXiv from January 1993 to April 2003, where the number of edges grows according to as shown by Leskovec et al. [38]. Our fit closely matches their baseline with a densification law exponent of approximately
, indicating that the number of citations in each paper on average grows over time. Note that we expect OccamNet to produce a fit with slightly higher error compared to the Ordinary Least Squares solution since our model has the additional task of finding the correct functional form before fitting the constants using gradient descent. For more accurate constant parameters, one may opt to take the functional form outputted by OccamNet and then fit the parameters using nonlinear least squares.
Next, we use OccamNet to discover the functional form of various network degree distributions. Power law distributions of the form have been found in contexts ranging from hyperlink graphs [1] to biological networks [2]. These distributions have been proposed to be universal across a wide range of systems, spawning extensive literature about their formation mechanisms and their links to network dynamics. However, alternative degree distributions like lognormal, Weibull, and exponential have also been observed in realworld networks, stimulating new work on finding alternative mechanisms that explain these functional forms [8]. Employing neurosymbolic regression, we can go beyond the standard set of distributions to find any function drawn from the specified bases.
Fig. 2 (b) shows the degree distribution for both Wikipedia article links [58] and the high energy physics citation network. OccamNet is trained on both datasets using the same complexity regularization. In the Wikipedia dataset, the system discovers the power law distribution independently, while in the citation dataset, it finds that the degree distribution is better fit by a function of the form
. The two examples underscore the principle of Occam’s razor. For a given level of complexity regularization, the system picks the simpler functional form for the Wikipedia dataset because this dataset more closely resembles a power law. On the other hand, OccamNet discovers a more complex, but far more accurate expression for the HEPTH dataset because this dataset deviates significantly from simpler models. The citation network fit performs better than both the power law and lognormal distributions.
It is worth noting that using meansquare error minimization does not generally provide the optimal estimator for any particular degree distribution. For instance, leastsquares fitting can produce inaccurate estimates for power law distributions, and goodnessoffit tests such as likelihood ratios and the KolmogorovSmirnov statistic must be used to evaluate the model [13]. However, in the case where the functional form is not known a priori, we opt to use meansquare error minimization as a Bayesian posterior with a noninformative prior. Other informative priors can be used when the set of functions are narrowed down, for instance, by performing maximumlikelihood estimation for each of the hypotheses proposed by the symbolic regression system.
iii.1.2 Production functions
One of the bestknown symbolic relations in economics is the CobbDouglas production function [14]. Its introduction in 1928 represented the first time an aggregate production function was developed and statistically estimated [5]. Predicting that a stable relationship between inputs and outputs could be derived from empirical data, the authors found a robust linear relationship from the log outputtocapital ratio and the log labortocapital ratio of the American manufacturing industry. This relationship was later shown to hold across countries and timeframes [19]. The function has since been widely adopted as a measure of productivity, as a utility function, and as a foundation of subsequent models such as the SolowSwan model of longrun economic growth [52, 54].
In its most common form, the law states that the total production (output) of a single good with two input factors, capital and labor , is given by
(4) 
where is typically a constant representing total factor productivity (ratio of output not explained by the quantity of inputs), and and are the output elasticities of capital and labor, respectively. In cases where we expect to see constant returns to scale (i.e., an increase in the input quantity leads to an equivalent increase in the output), and are additionally assumed to satisfy the constraint .
Cobb and Douglas used a US manufacturing dataset covering annual production from 18991922 to fit the linear regression
by ordinary least squares. The resulting constant parameters were given by and , resulting in the equation which is plotted in Fig. 2 (c). We explore the expressions for predicted by OccamNet for the same dataset at varying levels of complexity regularization. We define the complexity value of an expression as the sum of the number of activation bases and constants. While there are other possible measures of complexity, such as the number of bits needed to store the symbol string of the expression [56], our definition matches the regularization options provided for OccamNet training as described in II.1. We also note that any complexity metric will not be an exact measure of parsimony, which is typically defined according to the social and paradigmatic context through which the expression is interpreted.The Pareto front of the complexityaccuracy tradeoff from sweeping across Activation Regularization parameter values is shown in Fig. 2 (c). The full specification of model parameters is included in Appendix B.3. Without any regularization, OccamNet discovers an expression with lower loss and higher complexity than the CobbDouglas baseline. Moreover, with a small amount of regularization, our model finds an equation with both lower loss and lower complexity than the baseline. The CobbDouglas function, therefore, lies outside the Pareto front – under our particular definition of complexity – for the given US manufacturing dataset. We recover a CobbDouglaslike expression of the form only after restricting the set of possible bases for the regression to the ones found in (4). With only 24 available samples available in the data, our model benefits from a stronger inductive bias on the target function to reduce the risk of overfitting.
In contrast to the complexity metric, the true definition of parsimony, of course, depends on how the expression is interpreted through an economic lens. This is why the use of symbolic regression must be viewed as a collaborative humanmachine process. While the CobbDouglas function has a strong theoretical foundation, OccamNet offers a method for both discovering new models and for validating the existing expression. The ability to customize the library of bases and the depth of each symbolic expression allows for the straightforward injection of existing knowledge about the expected model. Sparse and noisy datasets such as the one above would often be better suited to this modality.
iii.2 Dynamical laws
We next apply our neurosymbolic regression method to differential equation discovery. OccamNet is primarily evaluated on systems of nonlinear differential equations, where each ODE is fit independently. We use both synthetic and realworld datasets to fit models of population dynamics, disease spread, and economic growth.
iii.2.1 Predatorprey relationships
Ecologists have long relied on mathematical models of population dynamics for studying ecological processes such as predatorprey interactions. While these models – often taking the form of differential equations – were historically derived from first principles and rigorous theorybuilding, symbolic regression offers a promising tool for reverseengineering ecological models from data [44].
A simple yet paramount model of predatorprey interactions between two species is given by the LotkaVolterra equations [40],
(5)  
where and are the populations of the prey and predator, respectively. The constant parameter is the rate of exponential population growth for the prey, is the rate of exponential death for the predator, is the rate at which the predators destroy the prey, and is the rate of predator population growth due to the consumption of prey. Inspired by the chemical law of mass action [4], the LotkaVolterra equations state that the rate of predation is directly proportional to the product of the populations of the prey and predator.
The LotkaVolterra predatorprey model has significant implications beyond ecology. Richard Goodwin was one of the first to adopt the model to describe the cyclical relationship between employment and wages in economics, otherwise known as Goodwin’s class struggle model [25]. Other applications of generalized LotkaVolterra systems in economics include the modeling of wealth distributions in society and values of firms in the stock market [43].
In Fig. 3 (a) we simulate a system of LotkaVolterra equations as defined in (5) with synthetic parameters , , , and . We then use OccamNet to fit and and simulate our generated differential equations using the same initial conditions. Our model is able to rediscover the exact functional form and constant parameters.
We then apply OccamNet to recover the LotkaVolterra model from the Hudson Bay Company’s data on the lynx and hare population in Canada from 18451935 [41]. As shown in Fig. 3 (b), we use the subset of records from 19001920 as it contains two population cycles with the least apparent noise. We additionally apply cubicspline interpolation to reduce noise before fitting the numerically computed derivatives as discussed in Appendix A.2.
When fitting the interpolated data using OccamNet, we recover a system of ODEs that closely resembles the LotkaVolterra equations. In particular, our model is able to learn the exponential population growth and decay terms and , as well as the critical interaction terms of the form and . Additionally, the best OccamNet fit includes small constant correction terms of and . These constants have much smaller magnitudes relative to the other terms and may be due to overfitting. A researcher may choose to either ignore these kinds of correction terms, or instead increase the level of constant regularization in OccamNet to obtain a fit with higher loss but potentially better generalizability.
iii.2.2 Epidemic spread
Next, we use OccamNet to discover compartmental models of epidemics. In particular, we use synthetic and real datasets for the wellknown SusceptibleInfectedRecovered (SIR) [34] model for fixed population sizes, which is given by the ODE system
(6) 
where , , and represent the proportions of the susceptible, infected, and recovered populations at time respectively, and where .
The SIR model and numerous variants are often used to describe the spread of infectious diseases such as measles [6] or COVID19 [15]. Such models are valuable for predicting the magnitude or duration of an epidemic, estimating epidemiological parameters such as the basic reproduction number [57] (given by in the SIR model), and for forecasting the impact of public health interventions. Beyond modeling disease spread, SIR variants have also been used to study the diffusion of information on social networks [60, 37], and thus carry substantial relevance to social science.
In Fig. 3 (c), we simulate a synthetic timeseries using the SIR model. OccamNet discovers the correct functional form of the SIR model along with approximately correct constant parameters up to rounding. The deviations of the learned constant parameters likely stem from higherorder errors in the numerical derivative estimate which are not addressed by the central difference formula. One could opt to use higherorder derivative approximations to account for such errors if necessary.
Next, we demonstrate the data ensemble learning functionality of OccamNet on a panel dataset for measles infection data in the UK. Horrocks and Bauch used the SINDy (Sparse Identification of Nonlinear Dynamics) algorithm [9] to fit a variant of the SIR model with timevarying contact rates to this measles dataset as well as to chickenpox and rubella data [32]. We note that SINDy requires the specification of a preset library of possible terms for which the algorithm learns sparse vectors of coefficients. OccamNet instead uses a specified library of simple bases to compose more complex terms with no additional inductive bias necessary. Our method requires less prior knowledge about the expression structure and is thus better suited to deriving new functional forms.
Horrocks and Bauch fit the entire measles timeseries to a singular equation for with timevarying [32]. We instead demonstrate OccamNet’s ability to discover the SIR model for each cycle of the epidemic with different and parameters. Using a subset of the data from 19481958, we generate an ensemble of five measles incidence cycles. We then apply the denoising techniques as in [9] and fit each dataset both individually and as an ensemble in which we learn the same functional form for all five datasets with varying constant parameters.
Fig. 3 (d) highlights the advantage of ensemble learning over individual fits. When each 2year cycle is fit independently, OccamNet struggles to learn expressions with the SIRlike form of . Due to the sparsity and noisiness of each individual dataset, it only extracts the interaction term from one of the five periods. The ensemble fit, however, discovers a function that included the key form of , as shown in the last row of the table in Fig. 3 (d). The parameter is a placeholder for a constant that varies for each cycle. Ensemble learning can therefore help avoid overfitting to individual datasets and improve generalization. While our model also learns higher order terms such as and for the ensemble, these terms are of much smaller magnitude compared to the leading terms and are thus of less importance to the correct fit. This is another example in which OccamNet’s custom regularization capabilities could be applied to eliminate higherorder terms.
iii.2.3 Longrun economic growth
The final dynamical system we consider is the neoclassical SolowSwan model of economic growth [52]. The model postulates that longrun economic growth can be explained by capital accumulation, labor growth, and technological progress. The model typically assumes CobbDouglastype aggregate production with constant returns to scale, given by
(7) 
where is the output per unit of effective labor, is the capital intensity (capital stock per unit of effective labor), and is the elasticity of output with respect to capital. The central differential equation in the Solow model describes the dynamics of ,
(8) 
where is the savings rate, is the population growth rate, is the technological growth rate, and is the capital stock depreciation rate. This equation states that the rate of change of capital stock is equal to the difference between the rate of investment and the rate of depreciation. The key result of the Solow growth model is that a greater amount of savings and investments do not affect the rate of economic growth in the long run.
While the Solow model was originally derived to describe U.S. economic growth, it has also been applied to other countries. If the model were perfectly universal, we would expect to see every country’s capital grow according to equation (8), with varying hidden parameters. We thus generate a synthetic example of country panel data for economic growth by simulating the Solow equations for and , with the goal of rediscovering equation (8). As an additional level of complexity, we model the savings rate and population growth rate as timedependent variables that grow according to the equations and . Each panel dataset is generated by randomly sampling initial conditions and parameters and
from uniform distributions of values outlined in appendix
B.6.1. The resulting crosscountry time series is displayed in Fig. 1. As demonstrated in the figure, OccamNet is able to recover the exact equation for (denoted as Output 1) with varying parameter for each of the 20 synthetic panels.We then attempt to fit the Solow model of capital growth to realworld, noisy country data. In particular, we select macroeconomics data from 18 of the original 20 countries in the Organisation for Economic Cooperation and Development (OECD) for which data was available. The input to our symbolic regression includes data on capital per capita, income per capita, savings rate, and population growth compiled by Khoo et al. [35], where it was utilized for regression under the Green Solow model – a variant of SolowSwan for modeling sustainable growth [7]. The data is originally sourced from the Penn World Tables [22], and are from the Maddison Project database [42], and is from the World Development Indicators [61]. There is no available data for the remaining parameters and , so they are instead treated as learnable constants. We also apply SavitzkyGolay filtering to smooth the data before running the regression, as described in appendix B.6.2.
In Fig. 4 (b), we compare the SolowSwan model baseline to three expressions produced by OccamNet under increasing levels of complexity regularization. The Solow baseline is generated by finding the bestfit parameter in a leastsquares of fit of equation (8). The expression with no regularization is given by with constants and that vary across countries. While this expression has a lower weighted meansquared error (WMSE) than the baseline, its functional form does not carry immediate economic intuition like the Solow model.
We then add Constant Regularization as described in II.1, which results in the equation closely matching the functional form of (8). This suggests that the Solow model has strong external validity as it can be discovered without any strong human priors. Finally, we apply Activation Regularization in addition to Constant Regularization, resulting in the output . The learned expression contains only one term from the Solow model and has the highest WMSE. The example in Fig. 4 (b) concretely demonstrates the tradeoff between accuracy and simplicity in the discovery of symbolic models. A researcher would thus benefit from running OccamNet with several specifications of regularization to select a result with the desired level of precision and complexity.
Iv Discussion
This article aims to test and demonstrate the usefulness of neurosymbolic methods in social science. We argue that its contribution is twofold. First, humanengineered quantitative models remain a critical part of social science research as they provide a way to formalize intuition or qualitative arguments about a social system and bridge micromacro theories. Recognizing that, neurosymbolic methods help researchers benchmark these formal models against counterfactuals on the complexityaccuracy Pareto front: a practice not often done when new quantitative models are proposed. This is partly due to the difficulty of generating alternatives aside from baseline linear models or theories from the existing literature. The selection strategy of symbolic methods allows the intentional tuning of complexity (or meaningfulness, with humanaided postselection) and prediction accuracy encountered by traditional parametric regression and deep learning methods.
Second, neurosymbolic methods can also help social scientists discover new, interpretable models or generate novel hypotheses from data. Expressions on the Paretofront can be interpreted and tested without prejudice. Aside from complete models, terms that repeatedly appear along the front may also uncover persistent interactions [51]. Our proposed approach is powerful as it allows the user to tune the amount of bias entered into the model. While models developed abinitio from a theoretical basis can be powerful in elucidating the implications of a present paradigm and incorporating human knowledge about human systems, the underconstrained nature of social science theories means that the researcher has to be conscious of their personal, experiential, or theoretical biases [31, 30]. Using neurosymbolic regression, the user can opt to inject a minimal amount of bias (e.g., provide a wide range of functional bases) or intentionally specify terms that existing theories may suggest.
Broadly speaking, two classes of quantitative models are typically employed in social science [12]. Parametric regression has proven invaluable over the past few decades. Since data and computational resources have become widely available, researchers have made thorough use of methods like generalized linear models for causal inference [46]. Although parametric regression offers less expressive power than formal models, it provides benefits in interpretability and data efficiency, especially when there is strong reason to believe that the underlying relationship takes a simple (e.g., linear) form. Its key drawback, however, is that this interpretability and efficiency are derived from strong assumptions about how variables interact. More complex behaviors, such as temporal dynamics or nonlinear relations between multiple variables, are ubiquitous in social science but can be hidden by using simpler parametric methods. Likewise, nonlinear optimization methods for fitting specified models [45] also require an a priori structure.
As an alternative approach, social scientists can use nonparametric methods that make weak assumptions about the distribution of the data, such as kernel density estimation, local regression, or deep neural networks. These models are useful for prediction policy problems
[36] – where prediction accuracy, not causal identifiability, is prioritized – and for capturing relationships in systems that often operate in highly nonlinear regimes. However, these methods come at the cost of interpretability. We suggest that neurosymbolic regression can offer a bridge between these two classes of statistical methods. In between specifying a complete model to let the system search for optimal parameters and leaving it openended to seek a parsimonious expression, the user can provide any amount of human insight. For instance, they may specify the presence of a powerlaw relationship, an interaction term, or a derivative, and use the system to compose the rest of the model. From sections III.1 to III.2, we demonstrate each of these use cases and show that the framework can successfully help social scientists discover empirical regularities in the data with interpretable expressions.Beyond interpretability, humansupplied biases are also useful in reducing the model search space, particularly when the problem faces a dearth of informative data. Since this is often the case in social science, we stress that the use of neurosymbolic methods must be treated as a collaborative humanmachine process in which a dictionary of motifs helps make inference problems tractable, as has been suggested by past work [10, 18]. Humans also play a critical role in the validation and interpretation of the resultant models. Neurosymbolic methods merely suggest functional fits; using these equations to paint a meaningful picture in the context of social science theory remains a human endeavor. The discovery of a formal model is often just the beginning, informing additional data collection, experiments, and qualitative inquiry [21]. A productive approach to quantitative social science needs to consider both the scientific and sociological role of modeling beyond its predictive capability alone.
Neurosymbolic regression holds the potential to be a powerful tool for helping social scientists with hypothesis creation and model discovery. An exciting next step would be to collaborate with social scientists in applying OccamNet to their research, particularly to datasets that are conjectured to capture a symbolic law. In problems with more available data and multiple independent panels, a machine learning approach allows data to be divided into training and validation sets – provided that unexplained errors between the two sets are uncorrelated – to improve the outofdistribution performance, or external validity, of the learned model. We hope to see our proposed humanmachine framework applied to unearth entirely new quantitative or formal models. Furthermore, OccamNet’s flexibility in incorporating userdefined bases such as piecewise functions and logical operators allows for moving beyond learning arithmetic expressions. Future work could include applying OccamNet to learn more complex symbolic rules such as strategies in economic games [20] or networkstructured models.
Acknowledgements.
We would like to thank Megan Yamoah for useful discussions during the drafting of this article. This work was sponsored in part by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA87501921000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. This work was also sponsored in part by the National Science Foundation under Cooperative Agreement PHY2019786 (The NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/) and in part by the Air Force Office of Scientific Research under the award number FA95502110317.References
 [1] (199909) Diameter of the WorldWide Web. Nature 401 (6749), pp. 130–131. External Links: ISSN 00280836, 14764687, Link, Document Cited by: §III.1.1.
 [2] (200511) Scalefree networks in cell biology. Journal of Cell Science 118 (21), pp. 4947–4957. External Links: ISSN 14779137, 00219533, Link, Document Cited by: §III.1.1.
 [3] (199910) Emergence of Scaling in Random Networks. Science 286 (5439), pp. 509–512. Note: Publisher: American Association for the Advancement of Science Section: Report External Links: ISSN 00368075, 10959203, Link, Document Cited by: §III.1.1.
 [4] (1992) The Orgins and Evolution of PredatorPrey Theory. Ecology 73 (5), pp. 1530–1535. Note: Publisher: Ecological Society of America External Links: ISSN 00129658, Link, Document Cited by: §III.2.1.
 [5] (201205) Retrospectives: The Introduction of the CobbDouglas Regression. Journal of Economic Perspectives 26 (2), pp. 223–236. External Links: ISSN 08953309, Link, Document Cited by: §III.1.2.
 [6] (2002) Dynamics of Measles Epidemics: Estimating Scaling of Transmission Rates Using a Time Series SIR Model. Ecological Monographs 72 (2), pp. 169–184. Note: Publisher: Ecological Society of America External Links: ISSN 00129615, Link, Document Cited by: §III.2.2.
 [7] (200406) The Green Solow Model. Working Paper Technical Report 10557, National Bureau of Economic Research. Note: Series: Working Paper Series External Links: Link, Document Cited by: §III.2.3.
 [8] (201912) Scalefree networks are rare. Nature Communications 10 (1), pp. 1017. External Links: ISSN 20411723, Link, Document Cited by: §III.1.1.
 [9] (201604) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences 113 (15), pp. 3932–3937. External Links: ISSN 00278424, 10916490, Link, Document Cited by: §A.2, §II.1, §III.2.2, §III.2.2.
 [10] (202012) Automated Discovery of Relationships, Models, and Principles in Ecology. Frontiers in Ecology and Evolution 8, pp. 530135. External Links: ISSN 2296701X, Link, Document Cited by: §I, §IV.
 [11] (201301) Universality in voting behavior: an empirical analysis. Scientific Reports 3 (1), pp. 1049. Note: Number: 1 Publisher: Nature Publishing Group External Links: ISSN 20452322, Link, Document Cited by: §I.
 [12] (201912) Revealing Complex Ecological Dynamics via Symbolic Regression. BioEssays 41 (12), pp. 1900069. External Links: ISSN 02659247, 15211878, Link, Document Cited by: §IV.
 [13] (200911) PowerLaw Distributions in Empirical Data. SIAM Review 51 (4), pp. 661–703. External Links: ISSN 00361445, 10957200, Link, Document Cited by: §III.1.1.
 [14] (1928) A Theory of Production. The American Economic Review 18 (1), pp. 139–165. Note: Publisher: American Economic Association External Links: ISSN 00028282, Link Cited by: §B.3, §III.1.2, §III.1.
 [15] (202010) A SIR model assumption for the spread of COVID19 in different communities. Chaos, Solitons & Fractals 139, pp. 110057. External Links: ISSN 09600779, Link, Document Cited by: §III.2.2.
 [16] (202107) Fast Neural Models for Symbolic Regression at Scale. arXiv:2007.10784 [cs, stat]. Note: arXiv: 2007.10784 External Links: Link Cited by: Appendix B, Figure 1, §I, §II.
 [17] (2020) Discovering Symbolic Models from Deep Learning with Inductive Biases. In Advances in Neural Information Processing Systems, Vol. 33, pp. 17429–17442. Cited by: §I.

[18]
(201012)
GPTIPS:an open source genetic programming toolbox for multigene symbolic regression
. Lecture Notes in Engineering and Computer Science 2180, pp. . Cited by: §IV.  [19] (197610) The CobbDouglas Production Function Once Again: Its History, Its Testing, and Some New Empirical Values. Journal of Political Economy 84 (5), pp. 903–915. External Links: ISSN 00223808, 1537534X, Link, Document Cited by: §III.1.2.
 [20] (2002) Using Symbolic Regression to Infer Strategies from Experimental Data. In Evolutionary Computation in Economics and Finance, S. Chen (Ed.), Studies in Fuzziness and Soft Computing, pp. 61–82. External Links: ISBN 9783790817843, Link, Document Cited by: §I, §IV.
 [21] (200810) Why Model?. Journal of Artificial Societies and Social Simulation 11 (4). External Links: Link Cited by: §I, §IV.
 [22] (201510) The next generation of the penn world table. American Economic Review 105 (10), pp. 3150–82. External Links: Document, Link Cited by: §III.2.3.
 [23] (199908) Zipf’s Law for Cities: An Explanation. The Quarterly Journal of Economics 114 (3), pp. 739–767. External Links: ISSN 00335533, 15314650, Link, Document Cited by: §I.
 [24] (201206) Formal Models of Bureaucracy. Annual Review of Political Science 15 (1), pp. 353–377. External Links: ISSN 10942939, 15451577, Link, Document Cited by: §I.
 [25] (200712) The LotkaVolterra Equations in Economics: An Italian Precursor. Economia Politica XXIV, pp. 343–348. Cited by: §III.2.1.

[26]
(201408)
The Optimal Hard Threshold for Singular Values is
. IEEE Transactions on Information Theory 60 (8), pp. 5040–5053. External Links: ISSN 00189448, 15579654, Link, Document Cited by: §A.2.  [27] (2003) Overview of the 2003 kdd cup. ACM SIGKDD Explorations Newsletter 5 (2), pp. 149–151. External Links: Document Cited by: §B.1.
 [28] (2014) A dynamical system for pagerank with timedependent teleportation. Internet Mathematics, pp. 1–30. Cited by: §B.2.2.
 [29] (1978) Threshold models of collective behavior. American Journal of Sociology 83 (6), pp. 1420–1443. External Links: ISSN 00029602, 15375390, Link Cited by: §I.
 [30] (199703) Bias in Social Research. Sociological Research Online 2 (1), pp. 7–19. External Links: ISSN 13607804, 13607804, Link, Document Cited by: §I, §IV.
 [31] (201706) Fuck Nuance. Sociological Theory 35 (2), pp. 118–127. External Links: ISSN 07352751, 14679558, Link, Document Cited by: §I, §IV.
 [32] (202012) Algorithmic discovery of dynamic models from infectious disease data. Scientific Reports 10 (1), pp. 7061. External Links: ISSN 20452322, Link, Document Cited by: §B.5.2, §III.2.2, §III.2.2.
 [33] (2021) Designing social inquiry: scientific inference in qualitative research. New edition edition, Princeton University Press, Princeton. External Links: ISBN 9780691224626 9780691224633 Cited by: §I, §I, §II.1.
 [34] (1927) A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character 115 (772), pp. 700–721. Cited by: §III.2.2.

[35]
(2021)
Neural Ordinary Differential Equations for the Regression of Macroeconomics Data Under the Green Solow Model
. In Database and Expert Systems Applications, C. Strauss, G. Kotsis, A. M. Tjoa, and I. Khalil (Eds.), Lecture Notes in Computer Science, Cham, pp. 78–90. External Links: ISBN 9783030864729, Document Cited by: §B.6.2, §III.2.3.  [36] (201505) Prediction Policy Problems. American Economic Review 105 (5), pp. 491–495. External Links: ISSN 00028282, Link, Document Cited by: §IV.
 [37] (202112) Information diffusion modeling and analysis for socially interacting networks. Social Network Analysis and Mining 11 (1), pp. 11. External Links: ISSN 18695450, 18695469, Link, Document Cited by: §III.2.2.
 [38] (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining  KDD ’05, Chicago, Illinois, USA, pp. 177. External Links: ISBN 9781595931351, Link, Document Cited by: §B.1, §III.1.1.
 [39] (201406) SNAP Datasets: Stanford large network dataset collection. Note: http://snap.stanford.edu/data Cited by: Figure 2.
 [40] (191003) Contribution to the Theory of Periodic Reactions. The Journal of Physical Chemistry 14 (3), pp. 271–274. External Links: ISSN 00927325, 15415740, Link, Document Cited by: §III.2.1.
 [41] (1937) Fluctuations in the Numbers of the Varying Hare (Lepus Americanus). University of Toronto Press. External Links: ISBN 9781487581787, Link Cited by: §B.4.2, §III.2.1.
 [42] (201711) Maddison Database 2010. External Links: Link Cited by: §III.2.3.
 [43] (200209) Theoretical analysis and simulations of the generalized LotkaVolterra model. Physical Review E 66 (3), pp. 031102. Note: Publisher: American Physical Society External Links: Link, Document Cited by: §III.2.1.
 [44] (201805) Reverseengineering ecological theory from data. Proceedings of the Royal Society B: Biological Sciences 285 (1878), pp. 20180422. External Links: ISSN 09628452, Link, Document Cited by: §I, §III.2.1.
 [45] (196501) A Simplex Method for Function Minimization. The Computer Journal 7 (4), pp. 308–313. External Links: ISSN 00104620, 14602067, Link, Document Cited by: §IV.
 [46] (1972) Generalized Linear Models. Journal of the Royal Statistical Society. Series A (General) 135 (3), pp. 370. External Links: ISSN 00359238, Link, Document Cited by: §IV.
 [47] (199308) Formal Models of Collective Action. Annual Review of Sociology 19 (1), pp. 271–300. External Links: ISSN 03600572, 15452115, Link, Document Cited by: §I.
 [48] (201905) Influential factors of carbon emissions intensity in OECD countries: Evidence from symbolic regression. Journal of Cleaner Production 220, pp. 1194–1201. External Links: ISSN 09596526, Link, Document Cited by: §I.
 [49] (2015) The network data repository with interactive graph analytics and visualization. In AAAI, External Links: Link Cited by: §B.2.2.
 [50] (199211) Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60 (14), pp. 259–268. External Links: ISSN 01672789, Link, Document Cited by: §A.2.
 [51] (200904) Distilling FreeForm Natural Laws from Experimental Data. Science 324 (5923), pp. 81–85. External Links: ISSN 00368075, 10959203, Link, Document Cited by: §IV.
 [52] (195602) A Contribution to the Theory of Economic Growth. The Quarterly Journal of Economics 70 (1), pp. 65. External Links: ISSN 00335533, Link, Document Cited by: Figure 1, §I, §I, §III.1.2, §III.2.3.
 [53] (199705) Population regulation in snowshoe hare and Canadian lynx: Asymmetric food web configurations between hare and lynx. Proceedings of the National Academy of Sciences 94 (10), pp. 5147–5152. External Links: ISSN 00278424, 10916490, Link, Document Cited by: §B.4.2.
 [54] (195611) Economic Growth and Capital Accumulation. Economic Record 32 (2), pp. 334–361. External Links: ISSN 00130249, 14754932, Link, Document Cited by: §I, §III.1.2.
 [55] (2014) Explaining Unemployment Rates with Symbolic Regression. In Genetic Programming Theory and Practice XI, R. Riolo, J. H. Moore, and M. Kotanchek (Eds.), pp. 119–135. Note: Series Title: Genetic and Evolutionary Computation External Links: ISBN 9781493903740 9781493903757, Link, Document Cited by: §I.
 [56] (202004) AI Feynman: A physicsinspired method for symbolic regression. Science Advances 6 (16), pp. eaay2631. Note: Publisher: American Association for the Advancement of Science Section: Research Article External Links: ISSN 23752548, Link, Document Cited by: §I, §III.1.2.
 [57] (201708) Reproduction numbers of infectious disease models. Infectious Disease Modelling 2 (3), pp. 288–303. External Links: ISSN 24680427, Link, Document Cited by: §III.2.2.
 [58] (2009) Wikipedia database dump. Note: Version from 20090306. External Links: Link Cited by: §B.2.2, §III.1.1.
 [59] (201909) Symbolic regression in materials science. MRS Communications 9 (3), pp. 793–805. Note: Publisher: Cambridge University Press External Links: ISSN 21596859, 21596867 Cited by: §I.
 [60] (201612) Epidemic model for information diffusion in web forums: experiments in marketing exchange and political dialog. SpringerPlus 5 (1), pp. 66. External Links: ISSN 21931801, Link, Document Cited by: §III.2.2.
 [61] World Development Indicators  DataBank. External Links: Link Cited by: §III.2.3.
 [62] (201507) Modeling oil production based on symbolic regression. Energy Policy 82, pp. 48–61. External Links: ISSN 03014215, Link, Document Cited by: §I.
 [63] (1984) Basic issues in econometrics. University of Chicago Press, Chicago. External Links: ISBN 9780226979830 Cited by: §II.1.
Appendix A OccamNet
OccamNet works by representing a probability distribution over a space of userspecified basis functions. During training, OccamNet samples functions from its probability distribution, evaluates their effectiveness at fitting the given data, and updates its probability distribution so as to weight its probability toward betterfitting solutions. Specifically, OccamNet consists of alternating Activation and TSoftmax
Layers which together represent a probability distribution over a set of functions. The Activation Layers consist of userspecified basis functions, the building blocks out of which OccamNet constructs symbolic expressions. The TSoftmax Layers represent the probabilities of connecting an Activation Layer’s outputs to the next layer, which is either another Activation Layer or the output layer. To generate trial functions, OccamNet samples from the TSoftmax Layer probabilities, obtaining a set of connections that represent a trial function: the input variables are traced along the generated graph and transformed by the basis functions along the path. Because the user specifies the bases included in each Activation Layer as well as other properties of the network such as the number of layers, OccamNet allows for a highly flexible userspecified search space. An example set of specified basis functions may be given by
(9) 
where , are either variables or subexpressions and where is a placeholder for a constant parameter that is tuned via gradient descent. The user may also define custom bases that are relevant to the data.
Additionally, the user may specify the maximum depth of the symbolic expression by stacking layers of . The depth of an expression is given by the maximum number of nested basis functions. For instance, indicates a maximum expression depth of 2, which allows for expressions such as but not (having a depth of 3). Moreover, each basis layer can contain a unique set of bases if desired.
OccamNet’s architecture leads to a number of advantages over other symbolic regression methods:

OccamNet allows the user to take full advantage of existing knowledge by encoding inductive biases into the network. For example, if the user expects the expression to be a sum of distinct terms, they can set the final activation layer to include only addition bases. In many Symbolic Regression architectures, such inductive biases cannot be directly incorporated into the architecture.

OccamNet’s training procedure allows for a nondifferentable fitness term, thereby allowing the user to include discrete factors in the fitness, such as whether the units match or the number of constants used. Many neural approaches to Symbolic Regression cannot handle such factors.

OccamNet is highly efficient and scales on a GPU, something that most symbolic regression methods cannot offer. This allows the user to find better fits in the same time as many other methods.

OccamNet is a neural network that can be made fully differentiable, thereby enabling it to be combined with other neural architectures. Although we do not test this in our paper, OccamNet could, for example, be used to synthesize programs or to perform regression with image or text data. Thus, OccamNet can handle a broader range of potential Symbolic Regression tasks than many other approaches.
a.1 Regularization terms
OccamNet uses two terms for complexity regularization: Activation Regularization and Constant Regularization. Let
denote the number of activation functions
and let denote the number of constants. The complexity regularization terms are thus given by(10) 
where and are scalar weights for the respective regularization terms. OccamNet subtracts these terms from the unregularized fitness of a given expression . The weight parameters and allow the user to effectively customize the complexity of the resulting expression.
OccamNet also uses Unit Regularization when indicated. For a dataset with variables, the user may specify an array of exponents , where is the number of unique units across the dataset, such that each entry contains the exponent of a unit for variable . For example, if the th variable has units of dollars/year, the th row in could be represented by the vector where the first entry corresponds to dollars and the second entry corresponds to years.
OccamNet then feeds these units through the sampled function. Each basis function receives a set of variables with units, may have requirements on those units for them to be consistent, and returns a new set of units. For example, receives one variable which it requires to have units and returns the units . Similarly, takes two variables with units that it requires to be equal and returns the same units. Using these rules, OccamNet propagates units through the function until it obtains units for the output. If at any point the input units for a basis function do not meet that basis’s requirements, that basis returns Any basis functions that receive also return . Finally, if the output units of do not match the desired output units (including if outputs ), OccamNet marks as not preserving units.
The multiplication by a constant basis function, , can produce any output units. As such, it returns If any basis function receives it will either return if it has no constraints on the input units, or it will treat the as being the units required to meet the basis function’s consistency conditions. For example, if the function receives , it will treat the input as and if the function receives and it will return .
After sampling functions from OccamNet, we determine which functions do not preserve units. Because we wish to avoid these functions entirely, we bypass evaluating their normal fitness (thereby saving compute time) and instead assign a fitness of where
is a userspecified hyperparameter.
a.2 Differential equation fitting
The target variables for differential equation fitting are computed numerically from timeseries data. In particular, we use the central difference formula which accounts for secondorder error in the derivative approximation. We use this formula to compute the centered derivative of each variable (column) in the input data to get .
When the timeseries data is sufficiently noisy, it is helpful to counter differentiation error by applying preprocessing techniques such as total variance regularization
[50] or the hard thresholding of singular values [26] as discussed in reference [9]. Alternatively, may be filtered or regularized via frequencybased methods or polynomial interpolation prior to numerical differentiation.Appendix B Case studies
Unless otherwise specified, all of the experiments are run with the following hyperparameters.
Epochs  Learning rate  Constant learning rate  Decay rate  Activation regularization weight  Constant regularization weight 
1000  5  0.05  1  0  0 
For each experiment, we perform a gridsearch of values for the remaining hyperparameters, which are described in the original OccamNet paper [16]. We search over the following parameter values and select the results with the lowest error:

Standard deviation (): [0.5, 5, 50]

Top number: [1, 5, 10]

Equalization: [0, 1, 5]
b.1 HEPTH citation densification
We reconstruct the densification law of the citation graph of high energy physics preprints on arXiv (HEPTH) from January 1993 to April 2003, with 29555 papers and 352807 edges [27]. For each month starting from January 1993, we aggregate the number of edges formed up to the end of . We then transform the data as and to replicate the methodology of Leskovec et al. [38] in fitting a line on the loglog scale.
Output  Standard deviation  Top number  Equalization 
5  5  5 
Basis library:
b.2 Degree distributions
b.2.1 HEPTH degree distribution
Using the same dataset as above, we fit the degree distribution of the citation graph at the last available time period. To mitigate the effects of the long tail of nodes with high degrees, we only include nodes with degree less than 200. We additionally remove the data points for degree 0 (papers with no citations) and degree 1. Finally, we fit the distribution in loglog space.
Output  Standard deviation  Top number  Equalization  Activation regularization weight  Constant regularization weight 
5  10  0  0.05  0.05 
Basis library:
b.2.2 Wikipedia hyperlink degree distribution
We fit the degree distribution of the graph of Wikipedia hyperlinks [58] sourced from the Network Repository [49, 28]. The graph consists of nodes representing individual articles in the English Wikipedia with directed edges indicating wikilinks from one article to another. We preprocess the distribution similarly to the HEPTH citation graph, discarding nodes with degree 0 and degree less greater than 200. We fit the distribution in loglog space using the same parameters as for the HEPTH degree distribution above.
Output  Standard deviation  Top number  Equalization  Activation regularization weight  Constant regularization weight 
5  10  0  0.05  0.05 
Basis library:
b.3 CobbDouglas production
The dataset used by Charles Cobb and Paul Douglas to empirically test the CobbDouglas production function contains annual capital (K), labor (L), and output (Y) data for the US manufacturing sector from 18991922 [14]. We sweep across several values of the activation regularization weight to explore the complexityaccuracy tradeoff of this dataset.
Output  Standard deviation  Top number  Equalization  Activation regularization weight 

50  10  5  0  
50  10  5  0.1  
50  10  5  0.2  
50  10  5  0.3 
Basis library:
We then restrict the bases to the bareminimum needed to construct the CobbDouglas function.
Output  Standard deviation  Top number  Equalization  Activation regularization weight 
5  5  5  0 
Restricted basis library:
b.4 Predatorprey dynamics
b.4.1 Synthetic LotkaVolterra model
We simulate a system of LotkaVolterra equations as defined in (5) with synthetic parameters and populations and . The simulation spans 200 time steps with initial conditions such that it contains 2 periods of the predatorprey cycle. The target ODE system is defined to be
(11)  
(12) 
We then separately fit and since the system is decoupled.
Output  Standard deviation  Top number  Equalization 

5  1  5  
5  1  5 
Basis library:
b.4.2 Lynxhare population
One of the first datasets supporting the LotkaVolterra model was the Hudson Bay Company’s data on the lynx and hare population in Canada from 18451935 [41]. The dataset is compiled from annual fur records on lynx and hare across 10 regions of boreal Canada, with the entries prior to 1903 being derived from questionnaires [53]. Specifically, we use the subset of records from 19001920. Given the sparsity of the data, we utilize cubic spline interpolation to generate 100 samples at equidistant timesteps spanning the 20year range. This allows us to compute a smoother timeseries when numerically differentiating the data.
Output  Standard deviation  Top number  Equalization 

5  1  0  
5  10  1 
Basis library:
b.5 Compartmental models in epidemiology
b.5.1 Synthetic SIR model
We simulate a synthetic timeseries using the SIR model with initial conditions of for 60 time steps. The target model is defined to be
(13)  
(14)  
(15) 
We then separately fit , , and .
Output  Standard deviation  Top number  Equalization 

5  5  5  
5  5  0  
5  10  5 
Basis library:
b.5.2 Measles spread
We apply OccamNet’s ensemble learning capability to a panel dataset for measles infection data in the UK from 19481967. The measles cycle is mostly biennial, which allows us to split the data into 2year cycles for ensemble fitting. Since there is no data on susceptible individuals, we follow the methodology outlined by Horrocks and Bauch [32] to generate synthetic data for . We also replicate their data preprocessing by applying a SavitzkyGolay filter with a window length of 19 and polynomial order of 3 to smooth out . As an additional preprocessing step, we scale the derivative by a multiplicative factor of in order to improve OccamNet loss convergence on small regression outputs.
Dataset  Output  Standard deviation  Top number  Equalization 

19521953  5  5  1  
19541955  5  5  5  
19561957  5  1  1  
19581959  5  1  1  
19601961  5  1  5  
Ensemble  5  5  1 
Basis library:
b.6 Economic growth
b.6.1 Synthetic SolowSwan model
Twenty synthetic “country” datasets are generated by simulating the following equations for 30 time steps:
(16)  
(17)  
(18)  
(19) 
For each of the 20 datasets, we randomly sample the constant parameters , , and as well as the initial conditions for the ODE system from the uniform distributions , , , , , and . The initial conditions for output are computed using the CobbDouglasbased assumption . We use OccamNet to fit which is computed using the centered derivative.
Output  Standard deviation  Top number  Equalization 
0.5  5  1 
Basis library:
Unit checker: for units [USD (millions), capita]
b.6.2 OECD country capital growth
We use macroeconomic data compiled by Khoo et al. [35] to attempt to recover the Solow model of economic growth. Eighteen of the original 20 OECD countries are present in this dataset: Austria, Belgium, Canada, Denmark, France, Germany, Greece, Ireland, Italy, Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, Turkey, United Kingdom, and United States. The time series , , , for each country have varying starting points as well as varying lengths ranging from 4 years to 39 years.
Output  Standard deviation  Top number  Equalization  Activation regularization weight  Constant regularization weight 
0.1  1  0  0  0  
0.1  1  0  0  10  
0.1  1  0  1  10 
Basis library:
Unit checker: for units [USD (millions), capita]