1 Introduction
In this paper, we present a framework for deep fundamental factor (DFF) models. The key aspect is a statistically interpretable fitted deep learner without modification of the network. Our method explicitly identifies interaction effects and ranks the importance of the factors. In the case when the network contains no hidden layers, we recover pointwise OLS or GLS estimators. For one or more hidden layers, we show how uncertainty bands for the sensitivity of the model to each input arise from the network weights. Moreover, for a certain choice of activation functions, we prove that such a distribution has bounded moments, with bounds expressed as a function of the weights and biases in the network.
Deep learning
Deep learning applies hierarchical layers of hidden variables to construct nonlinear predictors which scale to high dimensional input space. The deep learning paradigm for data analysis is algorithmic rather than probabilistic (see Breiman (2001)). Deep learning has been shown to ’compress’ the input space by projecting the input variables into a lower dimensional space using autoencoders, as in deep portfolio theory (Heaton et al., 2017). A related approach introduced by Fan et al. (2017), referred to as sufficient forecasting
, provides a set of sufficient predictive indices which are inferred from highdimensional predictors. The approach uses projected principal component analysis under a semiparametric factor model and has a direct correspondence with deep learning.
1.1 Why Deep Neural Networks?
Artificial neural networks have a long history in business and economic statistics. Building on the seminal work of
Gallant and White (1988); Andrews (1989); Hornik et al. (1989), Swanson and White (1995); Kuan and White (1994); Lo (1994); Hutchinson et al. (1994); Baillie and Kapetanios (2007); Racine (2001) develop various studies in the finance, economics and business literature. Most recently, the literature has been extended to include deep neural networks (Sirignano et al., 2016; Dixon et al., 2016; Feng et al., 2018; Heaton et al., 2017; Chen et al., 2019).It is well known that shallow neural networks are furnished with the universal representation theorem, which states that any shallow feedforward neural network (i.e. with a single hidden layer) can represent all continuous functions (Hornik et al., 1989), provided there are enough hidden units. The extension to deep neural networks has only recently been motivated on theoretical grounds (Tishby and Zaslavsky, 2015; Poggio, 2016; Mhaskar et al., 2016; Martin and Mahoney, 2018; Bartlett et al., 2017). Poggio (2016)
show that deep networks can achieve superior performance versus linear additive models, such as linear regression, while avoiding the curse of dimensionality.
Martin and Mahoney (2018) show that deep networks are implicitly selfregularizing and Tishby and Zaslavsky (2015) characterizes the layers as ’statistically decoupling’ the input variables.There are additionally many recent theoretical developments which characterize the approximation behavior as a function of network depth, width and sparsity level (Polson and Rockova, 2018). Recently Bartlett et al. (2017)
prove upper and lower bounds on the expressability of deep feedforward neural network classifiers with the piecewise linear activation function, such as ReLU activation functions. They show that the relationship between expressability and depth is determined by the degree of the activation function.
There is further ample theoretical evidence to suggest that shallow networks can not approximate the class of nonlinear functions represented by deep ReLU networks without blowup. Telgarsky (2016) shows that there is a ReLU network with layers and units such that any network approximating it with only layers must have units. Mhaskar et al. (2016) discuss the differences between composition versus additive models and show that it is possible to approximate higher polynomials much more efficiently with several hidden layers than a single hidden layer.
Feng et al. (2018) show that deep neural networks provide powerful expressability when combined with regularization, however their use in factor modeling presents some fundamental obstacles, one of which we shall address in this paper. Neural networks have been presented to the investment management industry as ’blackboxes’. As such they are not viewed as interpretable and their internal behavior can not be reasoned on statistical grounds. We add to the literature by introducing a method of interpretability which holds for almost any feedforward network, shallow or deep, whether fitted with homoscedastic or heteroscedastic error. One important caveat is that we do not attempt to solve the causation problem in economic modeling.
Nonlinearity
While high dimensional data representation is one distinguishing aspect of machine learning over linear regression, it is not alone. Deep learning resolves predictor nonlinearities and interaction effects without overfitting through a biasvariance tradeoff. As such, it provides a highly expressive regression model for complex data which relies on compositions of simple nonlinear functions rather than being additive.
On the theoretical side, we introduce a generalized least squares regression estimation approach for neural networks and develop a method of interpretability which holds for almost any neural network, shallow or deep, whether fitted with homoscedastic or heteroscedastic error. On the empirical side, we show that the interpretability confidence intervals tighten with increasing hidden units in a feedforward network. We also show that the interpretability method compares favorably with other classical methods on standard benchmark problems. Finally, we develop a deep fundamental factor model with six factors for S&P 500 listed stocks and compare factor interpretability and performance with the linear fundamental factor model.
The rest of the paper is outlined as follows. Section 1.2 provides the connection with deep learning and factor models in finance. Section 2 introduces the terminology and notation for defining neural network based fundamental factor models. Section 3 introduces our neural network parameter estimation approach which, in principle, is sufficiently general for fundamental factor models such as the BARRA model. Section 3.1 describes our method for interpreting neural network based factor models and compares the approach with other interpretability techniques. Section 4.2 demonstrates the application of our framework to neural network based factor models with heteroscedastic error. Finally, Section 5 concludes with directions for future research.
1.2 Connection with Finance Factor Models
Linear crosssectional factor models, such as (Fama and MacBeth, 1973), FF(3) and FF(5) (Fama and French, 1993, 1992, 2015) and BARRA factor models (see Rosenberg and Marathe (1976); Carvalho et al. (2012)) are appealing because of their simplicity and their economic interpretability, generating tradable portfolios. Factor realizations are estimated in the BARRA model by generalized least squares regression. Least squares linear regression can have poor expressability and relies on independent Gaussian errors. Generalizing to nonlinearities and incorporating interaction effects is a harder task.
Asset managers seek novel predictive firm characteristics to explain anomalies which are not captured by classical capital asset pricing and factor models. Recently a number of independent empirical studies, rooted in a data science approach, have shown the importance of using a higher number of economically interpretable predictors related to firm characteristics and other common factors
Moritz and Zimmermann (2016); Harvey et al. (2015); Gu et al. (2018); Feng et al. (2018). Gu et al. (2018) analyze a dataset of more than 30,000 individual stocks over a 60 year period from 1957 to 2016, and determine over 900 baseline signals. Both Moritz and Zimmermann (2016) and Gu et al. (2018) highlight the inadequacies of OLS regression in variable selection over high dimensional datasets.Our works build on recent literature that finds evidence of superior performance of nonlinear regression techniques such as regression trees and neural networks (Gu et al., 2018). Feng et al. (2018) demonstrate the ability of a threelayer deep neural network to effectively predict asset returns from fundamental factors.
2 Deep Fundamental Factor Models
Rosenberg and Marathe (1976) introduced a crosssectional fundamental factor model to capture the effects of macroeconomic events on individual securities. The choice of factors are microeconomic characteristics – essentially common factors, such as industry membership, financial structure, or growth orientation Nielsen and Bender (2010).
The BARRA fundamental factor model expresses the linear relationship between fundamental factors and asset returns:
(1) 
where is the matrix of known time invariant factor loadings (betas): is the exposure of asset to factor at time .
The factors are asset specific attributes such as market capitalization, industry classification, style classification. is the vector of unobserved factor realizations at time , including .
is the vector of asset returns at time . The errors are assumed independent of the factor realizations with heteroscedastic Gaussian error,
2.1 Deep Factor Models
Consider a nonlinear crosssectional fundamental factor model of the form
(2) 
where are asset returns, is a differentiable nonlinear function that maps the row of to the asset return at time . The map is assumed to incorporate a bias term so that . In the special case when is linear, the map is .
A key feature is that we do not assume that
is described by a parametric distribution, such as a Gaussian distribution. In our setup, the model shall just be used to predict the next period returns only and stationarity of the factor realizations is not required.
We approximate a nonlinear map, , with a feedforward neural network crosssectional factor model:
(3) 
where is a deep neural network with layers, that is, a superposition of univariate semiaffine functions, , to give
(4) 
and the unknown model parameters are a set of weight matrices
and a set of bias vectors
.The semiaffine function is itself defined as the composition of the activation function, , and an affine map:
(5) 
where is the output from the previous layer, .
The activation functions, , e.g. , are critical to nonlinear behavior of the model. Without them, would be a linear map and, as such, would be incapable of capturing interaction effects between the inputs. This is true even if the network has many layers.
2.2 Architecture Design
We begin by considering a simple feed forward binary with only two features, as illustrated in Figure 1. The simplest configuration we shall consider has just two inputs and one output unit  this is a multivariate regression model.
The next configuration we shall consider has one hidden layer  the number of hidden units shall be equal to the number of input neurons. This choice serves as a useful reference point as many hidden units are often needed for sufficient expressability. The final configuration has substantially more hidden units. Note that the second layer has been introduced purely to visualize the output from the hidden layer. These set of simple configurations (a.k.a. architectures) is ample to illustrate how a feedforward neural network method works.
There are many types of architectures for statistical and econometric modeling. Recurrent neural networks, Gated Recurrent Units and LSTMs are used for dynamic factor modeling
(Gu et al., 2018) and modeling of the limit order book (Dixon, 2018b, a). Dixon et al. (2018) introduce architectures for spatiotemporal modeling. Chen et al. (2019) use a combination of LSTMs and feedforward architectures in a Generative Adversarial Network (GAN) to enforce a noarbitrage constraint.Determining the weight and bias matrices, together with how many hidden units are needed for generalizable performance is the goal of training and inference. However, we emphasize that some conceptual understanding of neural networks is needed to derive interpretability and the utility of our framework rests on being able to run as either a linear factor model or a nonlinear factor model.
No hidden units (linear)  Two hidden units  Many hidden units 
3 Training and Inference
The standard
norm loss function used in ordinary least squares regression is only suitable when the errors are identical and independent. Under Generalized Linear Regression (GLS), we can extend our loss function to account for heteroscedasticity in the error by minimizing the squared Mahalanobis length of the residual vector.
To construct and evaluate a deep learner for fundamental factor modeling, we start with training data , of inputoutput pairs over all times where at each time . Recall that the responses, , denote the asset return and the are the asset’s sensitivities to the factors at time . The goal is to find the deep learner of , where we have a loss function for a predictor, , of the output signal, at time . Where there is no ambiguity, we drop subscripts for ease of notation.
In its simplest form, we then solve a weighted optimization problem using data, :
where the are the variances of the residual error and is a regularization penalty term. The loss function is nonconvex, possessing many local minima and is generally difficult to find a global minimum.
The diagonal conditional covariance matrix of the error, , is not known and must be estimated. This is performed as follows using the notation to denote the transpose of a vector and to denote model parameters fitted under heteroscedastic error.

For each , estimate the residual error, , using unweighted least squares minimization to find the weights, , and biases, where the error is
(6) 
The sample conditional covariance matrix is obtained as
(7) 
Perform the generalized least squares minimization to obtain a fitted heteroskedastic neural network model, with refined error
(8) 
Finally the resampled conditional covariance, , of the heteroscedastic error, , is estimated and combined with the covariance of the neural network to give a model for the covariance of excess returns
(9) where the functional form of the factor model covariance will be given in the next section.
Note that nolook ahead bias is introduced in the factor model, if the outofsample estimation is performed at time . Note, for avoidance of doubt, we just consider in this paper.
Regularization
is a global regularization parameter which we tune using the outofsample predictive weighted meansquared error (MSE) of the model. The regularization penalty, , introduces a biasvariance tradeoff.
is given in closed form by a chain rule and, through backpropagation, each layer’s weights
and biasesare fitted with stochastic gradient descent.
3.1 Factor Interpretability
Once the neural network has been trained, a number of important issues surface around how to interpret the model parameters. This aspect is by far the most prominent issue in deciding whether to use neural networks in favor of other machine learning and statistical methods for estimating factor realizations, sometimes even if the latter’s predictive accuracy is inferior.
In this section, we shall introduce a method for interpreting multilayer perceptrons which imposes minimal restrictions on the neural network design. There are numerous techniques for interpreting machine learning methods which treat the model as a blackbox. A good example are Partial Dependence Plots (PDPs) as described by
Greenwell et al. (2018). However, PDPs are compute intensive and can not easily resolve interaction effects. Other approaches also exist in the literature. Garson (1991) partitions hiddenoutput connection weights into components associated with each input neuron using absolute values of connection weights. Olden and Jackson (2002) determines the relative importance, , of the output to the predictor variable of the model as a function of the weights, according to a simple linear expression.We seek to understand the limitations on the choice of activation functions and understand the effect of increasing layers and numbers of neurons on probabilistic interpretability. For example, under standard Gaussian i.i.d. data, how robust are the model’s estimate of the importance of each input variable with variable number of neurons?
3.2 Sensitivities
We turn to a ’whitebox’ technique for determining the importance of the input variables. This approach generalizes Dimopoulos et al. (1995) to a deep neural network with interaction terms. Moreover, the method is directly consistent with how coefficients are interpreted in linear regression  they are model sensitivities. Model sensitivities are the change of the fitted model output w.r.t. input.
As a control, we shall use this property to empirically evaluate how reliably neural networks, even deep networks, learn data from a linear model.
Such an approach is appealing to practitioners who evaluate the comparative performance of linear regression with neural networks and need the assurance that a neural network model is at least able to reproduce and match the coefficients on a linear dataset.
We also offset the common misconception that the activation functions must be deactivated for a neural network model to produce a linear output. Under linear data, any nonlinear statistical model should be able to reproduce a statistical linear model under some choice of parameter values. Irrespective of whether data is linear or nonlinear in practice  the best control experiment for comparing a neural network estimator with an OLS estimator is to simulate data under a linear regression model. In this scenario, the correct model coefficients are known and the error in the coefficient estimator can be studied.
To evaluate fitted model sensitivities analytically, we require that the function is continuous and differentiable everywhere. Furthermore, for stability of the interpretation, we shall require that is a Lipschitz continuous^{1}^{1}1If Lipschitz continuity is not imposed, then a small change in one of the input values could result in an undesirable large variation in the derivative. That is, there is a positive real constant s.t. , . Such a constraint is necessary for the first derivative to be bounded and hence amenable to the derivatives, w.r.t. to the inputs, providing interpretability.
Fortunately, provided that the weights and biases are finite, each semiaffine function is Lipschitz continuous everywhere. For example, the function is continuously differentiable with derivative, , is globally bounded. With finite weights, the composition of with an affine function is also Lipschitz. Clearly ReLU is not continuously differentiable and one can not use the approach described here. Note that for the following examples, we are indifferent to the choice of homoscedastic or heteroscedastic error, since the model sensitivities are independent of the error.
In a linear regression model
(10) 
the model sensitivities are
(11) 
In a FFWD neural network, we can use the chain rule to obtain the model sensitivities
(12) 
For example, with one hidden layer, and :
(13) 
In matrix form, with general , the Jacobian^{2}^{2}2When is an identity function, the Jacobian . of ,
(14) 
where . Bounds on the sensitivities are given by the product of the weight matrices
(15) 
Multiple hidden layers
The model sensitivities can be readily generalized to an layer deep network by evaluating the Jacobian matrix:
(16) 
Example: Step test
To illustrate our interpretability approach, we shall consider a simple example. The model is trained to the following data generation process where the coefficients of the features are stepped and the error, here, is i.i.d. uniform:
(17) 
Figure 2 shows the ranked importance of the input variables in a neural network with one hidden layer. Our interpretability method is compared with well known blackbox interpretability methods such as Garson’s algorithm (Garson, 1991) and Olden’s algorithm (Olden and Jackson, 2002). Our approach is the only technique to interpret the fitted neural network which is consistent with how a linear regression model would interpret the input variables.
3.3 Interaction Effects
The previous example is too simplistic to illustrate another important property of our interpretability method, namely the ability to capture pairwise interaction terms. The pairwise interaction effects are readily available by evaluating the elements of the Hessian matrix. For example, with one hidden layer, the Hessian takes the form:
(18) 
where it is assumed that the activation function is at least twice differentiable everywhere, e.g. .
Example: Friedman data
To illustrate our input variable and interaction effect ranking approach, we will use one of the classical benchmark regression problems described in (Friedman, 1991) and (Breiman, 1996). The input space consists of ten i.i.d. uniform random variables; however, only five out of these ten actually appear in the true model. The response is related to the inputs according to the formula
using white noise error,
. We fit a NN with one hidden layer containing eight units and a weight decay of 0.01 (these parameters were chosen using 5fold crossvalidation) to 500 observations simulated from the above model with . The crossvalidated value was 0.94.4 Applications
4.1 Simulated Example
In this section, we demonstrate the estimation properties of neural network sensitivities applied to data simulated from a linear model. We show that the sensitivities in a neural network are consistent with the linear model, even if the neural network model is nonlinear. We also show that the confidence intervals, estimated by sampling, converge with increasing hidden units.
We generate 400 simulated training samples from the following linear model with i.i.d. Gaussian error:
(19) 
Table 1 compares an OLS estimator with a zero hidden layer feedforward network (NN) and a one hidden layer feedforward network with 10 hidden neurons and tanh activation functions (NN). The functional form of the first two regression models are equivalent, although the OLS estimator has been computed using a matrix solver whereas the zero layer hidden network parameters have been fitted with stochastic gradient descent.
Model  Intercept  Sensitivity of  Sensitivity of  

OLS  0.01  1.0154  1.018  
NN  0.02  1.0184141  1.02141815  
NN  0.02  1.013887  1.02224 
The fitted parameters values will vary slightly with each optimization as the stochastic gradient descent is randomized. However, the sensitivity terms are given in closed form and easily mapped to the linear model. In an industrial setting, such a onetomapping is useful for migrating to a deep factor model where, for model validation purposes, compatibility with linear models should be recovered in a limiting case. Clearly, if the data is not generated from a linear model, then the parameter values would vary across models.
Figure 5 and Tables 2 and 3 shows the empirical distribution of the fitted sensitivities using the single hidden layer model with increasing hidden units. The sharpness of the distributions is observed to converge monotonically with the number of hidden units. The confidence intervals are estimated under a nonparametric distribution.
In general, provided the weights and biases of the network are finite, the variances of the sensitivities are bounded for any input and choice of activation function.
Note that probabilistic bounds on the variance of the Jacobians can be derived for the case when the network is ReLU activated are given in Section B. We do not recommend using ReLU activation because it does not permit identification of the interaction terms and has provably nonconvergent sensitivity variances as a function of the number of hidden units (see Appendix B).
(a) density of  (b) density of 
Hidden Units  Mean  Median  Std.dev  1% C.I.  99% C.I. 

2  0.980875  1.0232913  0.10898393  0.58121675  1.0729908 
10  0.9866159  1.0083131  0.056483902  0.76814914  1.0322522 
50  0.99183553  1.0029879  0.03123002  0.8698967  1.0182846 
100  1.0071343  1.0175397  0.028034585  0.89689034  1.0296803 
200  1.0152218  1.0249312  0.026156902  0.9119074  1.0363332 
Hidden Units  Mean  Median  Std.dev  1% C.I.  99% C.I. 

2  0.98129386  1.0233982  0.10931312  0.5787732  1.073728 
10  0.9876832  1.0091512  0.057096474  0.76264584  1.0339714 
50  0.9903236  1.0020974  0.031827927  0.86471796  1.0152498 
100  0.9842479  0.9946766  0.028286876  0.87199813  1.0065105 
200  0.9976638  1.0074166  0.026751818  0.8920307  1.0189484 
4.2 S&P 500 Factor Modeling
This section presents the application of our framework to the fitting of a fundamental factor model under heteroscedastic errors, as in the BARRA model. The BARRA model includes many more explanatory variables than used in experiments below, but the purpose, here, is to illustrate the application of our framework to financial data.
We define the universe as the top 250 stocks from the S&P 500 index, ranked by market cap. Factors are given by Bloomberg and reported monthly over the period from November 2008 to November 2018. We remove stocks with too many missing factor values, leaving 218 stocks.
We follow the twostep procedure defined in Section 3. We begin by generating the model errors from our neural network over each period in the dataset, . The historical factors are inputs to the model and are standardized to enable model interpretability. These factors are (i) current enterprise value; (ii) PricetoBook ratio; (iii) current enterprise value to trailing 12 month EBITDA; (iv) PricetoSales ratio; (v) PricetoEarnings ration; and (vi) log market cap. The responses are the monthly asset returns for each stock in our universe.
We use a moving window of observations to estimate the sample covariance of the model error. For each , the covariance matrix is reestimated and used to weight the error in the refined model fitting. The next period monthly asset returns are then forecasted.
We use Tensorflow
Abadi et al. (2016) to implement a two hidden layer feedforward network and develop a custom implementation for the generalized least squares error and variable sensitivites. The GLS model is implemented based on the two step procedure using the OLS regression in the Python StatsModels
module.
All results are shown using regularization and activation functions. The number of hidden units and regularization parameters are found by fivefold crossvalidation and reported alongside the results.
Figure 6
compares the performance of a GLS estimator with the feedforward neural network with 50 hidden units in the first hidden layer, 10 hidden units in the second layer and
. Both models are fitted with heteroscedastic error and the insample performance is compared with the outofsample error.(a) General Linear Regression  (b) FFWD Neural Network 
Figure 7 shows the insample MSE as a function of the number of hidden units in the hidden layer. The neural networks are trained here without regularization to demonstrate the effect of solely increasing the number of hidden units in the first layer. Increasing the number of hidden units reduces the bias in the model.
Figure 8 shows the effect of regularization on the MSE errors for a network with 50 units in the first hidden layer. Increasing the level of regularization increases the insample bias but reduces the outofsample bias, and hence the variance of the estimation error.
(a) Insample  (b) OutofSample 
Figure 9 shows the sensitivities of the GLS regression model and neural network models to each factor over a 48 month period from November 2014 to November 2018. The black line shows the sensitivities evaluated by our neural network based method. The red line shows the same sensitivities using linear regression.
Figure 10 compares the distribution of sensitivities to each factor over a 48 month period from November 2014 to November 2018 as estimated by neural networks and GLS regression.
Finally, although not the primary purpose of this paper, we provide evidence that our neural network factor model generates higher information ratios than the linear factor model when used to sort portfolios from our universe. Figure 11 shows the information ratios of a portfolio selection strategy which selects the stocks with the highest predicted monthly returns. The information ratios are evaluated for various size portfolios, using the S&P 500 index as the benchmark.
The information ratios are observed to increase by a factor of approximately three. Also shown, for control, are randomly selected portfolios^{3}^{3}3Note that the reason why the information ratios are always positive, even when under random portfolio selection, is because we have defined a universe of only 218 stocks, with sufficient historical data available over the period from November 2014 to November 2018.. We observe in some cases that even though the factors in the linear regression barely exceed white noise, they are indeed predictive when used in a neural network model. Even for this limited set of factors, one can produce positive information ratios with neural networks.
(a) General Linear Regression  (b) FFWD Neural Network  (c) White Noise 
5 Summary
An important aspect in adoption of neural networks in factor modeling is the existence of a statistical framework which provides the transparency and statistical interpretability of linear least squares estimation. Moreover, one should expect to use such a framework applied to linear data and obtain similar results to linear regression, thus isolating the effects of nonlinearity versus the effect of using different optimization algorithms and model implementation environments.
In this paper, we introduce a deep learning framework for fundamental factor modeling which generalizes the linear fundamental factor models by capturing nonlinearity, interaction effects and nonparametric shocks in financial econometrics. Our framework provides interpretability, with confidence intervals, and ranking of the factor importances and interaction effects. Moreover, our framework can be used to estimate factor realizations under either homoscedastic or heteroscedastic error. In the case when the network contains no hidden layers, our approach recovers a linear fundamental factor model. The framework allows the impact of nonlinearity and nonparametric treatment of the error on the factors over time and forms the basis for generalized intepretability of fundamental factors. Our neural network model is observed to generate information ratios which are a factor of three higher than generalized linear regression.
References
 Abadi et al. (2016) Abadi, M., P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, et al. (2016). TensorFlow: A System for Largescale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pp. 265–283.
 Andrews (1989) Andrews, D. (1989). A unified theory of estimation and inference for nonlinear dynamic models a.r. gallant and h. white. Econometric Theory 5(01), 166–171.
 Baillie and Kapetanios (2007) Baillie, R. T. and G. Kapetanios (2007). Testing for neglected nonlinearity in longmemory models. Journal of Business & Economic Statistics 25(4), 447–461.
 Bartlett et al. (2017) Bartlett, P., N. Harvey, C. Liaw, and A. Mehrabian (2017). Nearlytight vcdimension bounds for piecewise linear neural networks. CoRR abs/1703.02930.
 Breiman (1996) Breiman, L. (1996, Aug). Bagging predictors. Machine Learning 24(2), 123–140.
 Breiman (2001) Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science 16(3), 199–231.
 Carvalho et al. (2012) Carvalho, C. M., H. Lopes, O. Aguilar, and M. Mendoza (2012, 01). Dynamic stock selection strategies: A structured factor model framework. Bayesian Statistics 9 9.
 Chen et al. (2019) Chen, L., M. Pelger, and J. Zhuz (2019, March). Deep learning in asset pricing. Technical report, Stanford University.
 Dimopoulos et al. (1995) Dimopoulos, Y., P. Bourret, and S. Lek (1995, Dec). Use of some sensitivity criteria for choosing networks with good generalization ability. Neural Processing Letters 2(6), 1–4.

Dixon (2018a)
Dixon, M. (2018a).
A high frequency trade execution model for supervised learning.
High Frequency 1(1), 32–52.  Dixon (2018b) Dixon, M. (2018b). Sequence classification of the limit order book using recurrent neural networks. Journal of Computational Science 24, 277 – 286.
 Dixon et al. (2016) Dixon, M., D. Klabjan, and J. H. Bang (2016). Classificationbased financial markets prediction using deep neural networks. CoRR abs/1603.08604.
 Dixon et al. (2018) Dixon, M. F., N. G. Polson, and V. O. Sokolov (2018). Deep learning for spatiotemporal modeling: Dynamic traffic flows and high frequency trading. Applied Stochastic Models in Business and Industry 0(0).
 Fama and MacBeth (1973) Fama, E. and J. D. MacBeth (1973). Risk, return, and equilibrium: Empirical tests. Journal of Political Economy 81(3), 607–36.
 Fama and French (1992) Fama, E. F. and K. R. French (1992). The crosssection of expected stock returns. The Journal of Finance 47(2), 427–465.
 Fama and French (1993) Fama, E. F. and K. R. French (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics 33(1), 3 – 56.
 Fama and French (2015) Fama, E. F. and K. R. French (2015). A fivefactor asset pricing model. Journal of Financial Economics 116(1), 1–22.
 Fan et al. (2017) Fan, J., L. Xue, and J. Yao (2017). Sufficient forecasting using factor models. Journal of Econometrics 201(2), 292 – 306. Theoretical and Financial Econometrics: Essays in honor of C. Gourieroux.
 Feng et al. (2018) Feng, G., J. He, and N. G. Polson (2018, Apr). Deep Learning for Predicting Asset Returns. arXiv eprints, arXiv:1804.09314.
 Friedman (1991) Friedman, J. H. (1991, 03). Multivariate adaptive regression splines. Ann. Statist. 19(1), 1–67.
 Gallant and White (1988) Gallant, A. and H. White (1988, July). There exists a neural network that does not make avoidable mistakes. In IEEE 1988 International Conference on Neural Networks, pp. 657–664 vol.1.
 Garson (1991) Garson, G. D. (1991, April). Interpreting neuralnetwork connection weights. AI Expert 6(4), 46–51.
 Greenwell et al. (2018) Greenwell, B. M., B. C. Boehmke, and A. J. McCarthy (2018, May). A Simple and Effective ModelBased Variable Importance Measure. arXiv eprints, arXiv:1805.04755.
 Gu et al. (2018) Gu, S., B. T. Kelly, and D. Xiu (2018). Empirical asset pricing via machine learning. Chicago Booth Research Paper 1804.
 Harvey et al. (2015) Harvey, C. R., Y. Liu, and H. Zhu (2015, 10). … and the CrossSection of Expected Returns. The Review of Financial Studies 29(1), 5–68.
 Heaton et al. (2017) Heaton, J. B., N. G. Polson, and J. H. Witte (2017). Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry 33(1), 3–12.
 Hornik et al. (1989) Hornik, K., M. Stinchcombe, and H. White (1989, July). Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366.
 Hutchinson et al. (1994) Hutchinson, J. M., A. W. Lo, and T. Poggio (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance 49(3), 851–889.
 Kuan and White (1994) Kuan, C.M. and H. White (1994). Artificial neural networks: an econometric perspective. Econometric Reviews 13(1), 1–91.
 Lo (1994) Lo, A. (1994). Neural networks and other nonparametric techniques in economics and finance. In AIMR Conference Proceedings, Number 9.
 Martin and Mahoney (2018) Martin, C. H. and M. W. Mahoney (2018). Implicit selfregularization in deep neural networks: Evidence from random matrix theory and implications for learning. CoRR abs/1810.01075.
 Mhaskar et al. (2016) Mhaskar, H., Q. Liao, and T. A. Poggio (2016). Learning real and boolean functions: When is deep better than shallow. CoRR abs/1603.00988.
 Moritz and Zimmermann (2016) Moritz, B. and T. Zimmermann (2016). Treebased conditional portfolio sorts: The relation between past and future stock returns.
 Nielsen and Bender (2010) Nielsen, F. and J. Bender (2010). The fundamentals of fundamental factor models. Technical Report 24, MSCI Barra Research Paper.
 Olden and Jackson (2002) Olden, J. D. and D. A. Jackson (2002). Illuminating the ’black box’: a randomization approach for understanding variable contributions in artificial neural networks. Ecological Modelling 154(1), 135 – 150.
 Poggio (2016) Poggio, T. (2016). Deep Learning: Mathematics and Neuroscience. A Sponsored Supplement to Science BrainInspired intelligent robotics: The intersection of robotics and neuroscience, 9–12.
 Polson and Rockova (2018) Polson, N. and V. Rockova (2018, Mar). Posterior Concentration for Sparse Deep Learning. arXiv eprints, arXiv:1803.09138.
 Racine (2001) Racine, J. (2001). On the nonlinear predictability of stock returns using financial and economic variables. Journal of Business & Economic Statistics 19(3), 380–382.
 Rosenberg and Marathe (1976) Rosenberg, B. and V. Marathe (1976). Common factors in security returns: Microeconomic determinants and macroeconomic correlates. Research Program in Finance Working Papers 44, University of California at Berkeley.
 Sirignano et al. (2016) Sirignano, J., A. Sadhwani, and K. Giesecke (2016, July). Deep Learning for Mortgage Risk. ArXiv eprints.
 Swanson and White (1995) Swanson, N. R. and H. White (1995). A modelselection approach to assessing the information in the term structure using linear models and artificial neural networks. Journal of Business & Economic Statistics 13(3), 265–275.
 Telgarsky (2016) Telgarsky, M. (2016). Benefits of depth in neural networks. CoRR abs/1602.04485.
 Tishby and Zaslavsky (2015) Tishby, N. and N. Zaslavsky (2015). Deep learning and the information bottleneck principle. CoRR abs/1503.02406.
Appendix A Other Interpretability Methods
Partial Dependency Plots (PDPs) evaluate the expected output w.r.t. the marginal density function of each input variable, and allow the importance of the predictors to be ranked. More precisely, partitioning the data into an interest set, , and its complement, , then the “partial dependence” of the response on is defined as
(20) 
where
is the marginal probability density of
: . Equation (20) can be estimated from a set of training data by(21) 
where are the observations of in the training set; that is, the effects of all the other predictors in the model are averaged out. There are a number of challenges with using PDPs for model interpretability. First, the interaction effects are ignored by the simplest version of this approach. While Greenwell et al. (2018) propose a methodology extension to potentially address the modeling of interactive effects, PDPs do not provide a 1to1 correspondence with the coefficients in a linear regression. Instead, we would like to know, under strict control conditions, how the fitted weights and biases of the MLP correspond to the fitted coefficients of linear regression. Moreover in the context of neural networks, by treating the model as a black box, it is difficult to gain theoretical insight in to how the choice of the network architecture effects its interpretability from a probabilistic perspective.
Garson (1991) partitions hiddenoutput connection weights into components associated with each input neuron using absolute values of connection weights. Garson’s algorithm uses the absolute values of the connection weights when calculating variable contributions, and therefore does not provide the direction of the relationship between the input and output variables.
Olden and Jackson (2002) determines the relative importance, , of the output to the predictor variable of the model as a function of the weights, according to the expression
(22) 
The approach does not account for nonlinearity introduced into the activation, which is the most critical aspects of the model. Furthermore, the approach presented was limited to a single hidden layer.
Appendix B Bounds on the Variance of the Jacobian
Consider a activated single layer feedforward network. In matrix form, with , the Jacobian, , can be written as linear combination of Heaviside functions:
(23) 
where , . The Jacobian can be written in matrix element form as
(24) 
where and . As a linear combination of indicator functions, we have
(25) 
Alternatively, the Jacobian can be expressed in terms of
(26) 
In the case when , this simplifies to a weighted sum of independent Bernoulli trials:
(27) 
where . The expectation of the Jacobian is given by
(28) 
where For finite weights, the expectation is bounded above by . We can write the variance of the Jacobian as:
(29) 
If the weights are constrained so that the mean is constant, , then the weights are . Then the variance is bounded by the mean:
(30) 
Otherwise, in general, we have the bound:
(31) 
The fact that the variance is slightlu monotonically increasing with the number of hidden units, combined with the inability to evaluate interaction terms, are two good reasons to avoid using activation functions.
b.1 Chernoff Bounds on the Jacobian
We can also derive probabilistic bounds on the Jacobians. Let and be reals in . Let be independent Bernoulli trials with so that
(32) 
The Chernofftype bound exists on deviations of above the mean
(33) 
A similar bound exists for deviations of below the mean. For :
(34) 
These bounds are generally weak and are suited to large deviations, i.e. the tail regions. The bounds are shown in the figure below for different values of  is increasing towards the upper right hand corner of the plot.
Comments
There are no comments yet.