Deep_Fundamental_Factors
Source code for Deep Fundamental Factor Models, https://arxiv.org/abs/1903.07677
view repo
Deep fundamental factor models are developed to interpret and capture non-linearity, interaction effects and non-parametric shocks in financial econometrics. Uncertainty quantification provides interpretability with interval estimation, ranking of factor importances and estimation of interaction effects. Estimating factor realizations under either homoscedastic or heteroscedastic error is also available. With no hidden layers we recover a linear factor model and for one or more hidden layers, uncertainty bands for the sensitivity to each input naturally arise from the network weights. To illustrate our methodology, we construct a six-factor model of assets in the S&P 500 index and generate information ratios that are three times greater than generalized linear regression. We show that the factor importances are materially different from the linear factor model when accounting for non-linearity. Finally, we conclude with directions for future research
READ FULL TEXT VIEW PDFSource code for Deep Fundamental Factor Models, https://arxiv.org/abs/1903.07677
In this paper, we present a framework for deep fundamental factor (DFF) models. The key aspect is a statistically interpretable fitted deep learner without modification of the network. Our method explicitly identifies interaction effects and ranks the importance of the factors. In the case when the network contains no hidden layers, we recover point-wise OLS or GLS estimators. For one or more hidden layers, we show how uncertainty bands for the sensitivity of the model to each input arise from the network weights. Moreover, for a certain choice of activation functions, we prove that such a distribution has bounded moments, with bounds expressed as a function of the weights and biases in the network.
Deep learning applies hierarchical layers of hidden variables to construct nonlinear predictors which scale to high dimensional input space. The deep learning paradigm for data analysis is algorithmic rather than probabilistic (see Breiman (2001)). Deep learning has been shown to ’compress’ the input space by projecting the input variables into a lower dimensional space using auto-encoders, as in deep portfolio theory (Heaton et al., 2017). A related approach introduced by Fan et al. (2017), referred to as sufficient forecasting
, provides a set of sufficient predictive indices which are inferred from high-dimensional predictors. The approach uses projected principal component analysis under a semi-parametric factor model and has a direct correspondence with deep learning.
Artificial neural networks have a long history in business and economic statistics. Building on the seminal work of
Gallant and White (1988); Andrews (1989); Hornik et al. (1989), Swanson and White (1995); Kuan and White (1994); Lo (1994); Hutchinson et al. (1994); Baillie and Kapetanios (2007); Racine (2001) develop various studies in the finance, economics and business literature. Most recently, the literature has been extended to include deep neural networks (Sirignano et al., 2016; Dixon et al., 2016; Feng et al., 2018; Heaton et al., 2017; Chen et al., 2019).It is well known that shallow neural networks are furnished with the universal representation theorem, which states that any shallow feedforward neural network (i.e. with a single hidden layer) can represent all continuous functions (Hornik et al., 1989), provided there are enough hidden units. The extension to deep neural networks has only recently been motivated on theoretical grounds (Tishby and Zaslavsky, 2015; Poggio, 2016; Mhaskar et al., 2016; Martin and Mahoney, 2018; Bartlett et al., 2017). Poggio (2016)
show that deep networks can achieve superior performance versus linear additive models, such as linear regression, while avoiding the curse of dimensionality.
Martin and Mahoney (2018) show that deep networks are implicitly self-regularizing and Tishby and Zaslavsky (2015) characterizes the layers as ’statistically decoupling’ the input variables.There are additionally many recent theoretical developments which characterize the approximation behavior as a function of network depth, width and sparsity level (Polson and Rockova, 2018). Recently Bartlett et al. (2017)
prove upper and lower bounds on the expressability of deep feedforward neural network classifiers with the piecewise linear activation function, such as ReLU activation functions. They show that the relationship between expressability and depth is determined by the degree of the activation function.
There is further ample theoretical evidence to suggest that shallow networks can not approximate the class of non-linear functions represented by deep ReLU networks without blow-up. Telgarsky (2016) shows that there is a ReLU network with layers and units such that any network approximating it with only layers must have units. Mhaskar et al. (2016) discuss the differences between composition versus additive models and show that it is possible to approximate higher polynomials much more efficiently with several hidden layers than a single hidden layer.
Feng et al. (2018) show that deep neural networks provide powerful expressability when combined with regularization, however their use in factor modeling presents some fundamental obstacles, one of which we shall address in this paper. Neural networks have been presented to the investment management industry as ’black-boxes’. As such they are not viewed as interpretable and their internal behavior can not be reasoned on statistical grounds. We add to the literature by introducing a method of interpretability which holds for almost any feedforward network, shallow or deep, whether fitted with homoscedastic or heteroscedastic error. One important caveat is that we do not attempt to solve the causation problem in economic modeling.
While high dimensional data representation is one distinguishing aspect of machine learning over linear regression, it is not alone. Deep learning resolves predictor non-linearities and interaction effects without over-fitting through a bias-variance tradeoff. As such, it provides a highly expressive regression model for complex data which relies on compositions of simple non-linear functions rather than being additive.
On the theoretical side, we introduce a generalized least squares regression estimation approach for neural networks and develop a method of interpretability which holds for almost any neural network, shallow or deep, whether fitted with homoscedastic or heteroscedastic error. On the empirical side, we show that the interpretability confidence intervals tighten with increasing hidden units in a feedforward network. We also show that the interpretability method compares favorably with other classical methods on standard benchmark problems. Finally, we develop a deep fundamental factor model with six factors for S&P 500 listed stocks and compare factor interpretability and performance with the linear fundamental factor model.
The rest of the paper is outlined as follows. Section 1.2 provides the connection with deep learning and factor models in finance. Section 2 introduces the terminology and notation for defining neural network based fundamental factor models. Section 3 introduces our neural network parameter estimation approach which, in principle, is sufficiently general for fundamental factor models such as the BARRA model. Section 3.1 describes our method for interpreting neural network based factor models and compares the approach with other interpretability techniques. Section 4.2 demonstrates the application of our framework to neural network based factor models with heteroscedastic error. Finally, Section 5 concludes with directions for future research.
Linear cross-sectional factor models, such as (Fama and MacBeth, 1973), FF(3) and FF(5) (Fama and French, 1993, 1992, 2015) and BARRA factor models (see Rosenberg and Marathe (1976); Carvalho et al. (2012)) are appealing because of their simplicity and their economic interpretability, generating tradable portfolios. Factor realizations are estimated in the BARRA model by generalized least squares regression. Least squares linear regression can have poor expressability and relies on independent Gaussian errors. Generalizing to non-linearities and incorporating interaction effects is a harder task.
Asset managers seek novel predictive firm characteristics to explain anomalies which are not captured by classical capital asset pricing and factor models. Recently a number of independent empirical studies, rooted in a data science approach, have shown the importance of using a higher number of economically interpretable predictors related to firm characteristics and other common factors
Moritz and Zimmermann (2016); Harvey et al. (2015); Gu et al. (2018); Feng et al. (2018). Gu et al. (2018) analyze a dataset of more than 30,000 individual stocks over a 60 year period from 1957 to 2016, and determine over 900 baseline signals. Both Moritz and Zimmermann (2016) and Gu et al. (2018) highlight the inadequacies of OLS regression in variable selection over high dimensional datasets.Our works build on recent literature that finds evidence of superior performance of non-linear regression techniques such as regression trees and neural networks (Gu et al., 2018). Feng et al. (2018) demonstrate the ability of a three-layer deep neural network to effectively predict asset returns from fundamental factors.
Rosenberg and Marathe (1976) introduced a cross-sectional fundamental factor model to capture the effects of macroeconomic events on individual securities. The choice of factors are microeconomic characteristics – essentially common factors, such as industry membership, financial structure, or growth orientation Nielsen and Bender (2010).
The BARRA fundamental factor model expresses the linear relationship between fundamental factors and asset returns:
(1) |
where is the matrix of known time invariant factor loadings (betas): is the exposure of asset to factor at time .
The factors are asset specific attributes such as market capitalization, industry classification, style classification. is the vector of unobserved factor realizations at time , including .
is the vector of asset returns at time . The errors are assumed independent of the factor realizations with heteroscedastic Gaussian error,
Consider a non-linear cross-sectional fundamental factor model of the form
(2) |
where are asset returns, is a differentiable non-linear function that maps the row of to the asset return at time . The map is assumed to incorporate a bias term so that . In the special case when is linear, the map is .
A key feature is that we do not assume that
is described by a parametric distribution, such as a Gaussian distribution. In our setup, the model shall just be used to predict the next period returns only and stationarity of the factor realizations is not required.
We approximate a non-linear map, , with a feedforward neural network cross-sectional factor model:
(3) |
where is a deep neural network with layers, that is, a super-position of univariate semi-affine functions, , to give
(4) |
and the unknown model parameters are a set of weight matrices
and a set of bias vectors
.The semi-affine function is itself defined as the composition of the activation function, , and an affine map:
(5) |
where is the output from the previous layer, .
The activation functions, , e.g. , are critical to non-linear behavior of the model. Without them, would be a linear map and, as such, would be incapable of capturing interaction effects between the inputs. This is true even if the network has many layers.
We begin by considering a simple feed forward binary with only two features, as illustrated in Figure 1. The simplest configuration we shall consider has just two inputs and one output unit - this is a multivariate regression model.
The next configuration we shall consider has one hidden layer - the number of hidden units shall be equal to the number of input neurons. This choice serves as a useful reference point as many hidden units are often needed for sufficient expressability. The final configuration has substantially more hidden units. Note that the second layer has been introduced purely to visualize the output from the hidden layer. These set of simple configurations (a.k.a. architectures) is ample to illustrate how a feedforward neural network method works.
There are many types of architectures for statistical and econometric modeling. Recurrent neural networks, Gated Recurrent Units and LSTMs are used for dynamic factor modeling
(Gu et al., 2018) and modeling of the limit order book (Dixon, 2018b, a). Dixon et al. (2018) introduce architectures for spatio-temporal modeling. Chen et al. (2019) use a combination of LSTMs and feed-forward architectures in a Generative Adversarial Network (GAN) to enforce a no-arbitrage constraint.Determining the weight and bias matrices, together with how many hidden units are needed for generalizable performance is the goal of training and inference. However, we emphasize that some conceptual understanding of neural networks is needed to derive interpretability and the utility of our framework rests on being able to run as either a linear factor model or a non-linear factor model.
No hidden units (linear) | Two hidden units | Many hidden units |
The standard
norm loss function used in ordinary least squares regression is only suitable when the errors are identical and independent. Under Generalized Linear Regression (GLS), we can extend our loss function to account for heteroscedasticity in the error by minimizing the squared Mahalanobis length of the residual vector.
To construct and evaluate a deep learner for fundamental factor modeling, we start with training data , of input-output pairs over all times where at each time . Recall that the responses, , denote the asset return and the are the asset’s sensitivities to the factors at time . The goal is to find the deep learner of , where we have a loss function for a predictor, , of the output signal, at time . Where there is no ambiguity, we drop subscripts for ease of notation.
In its simplest form, we then solve a weighted optimization problem using data, :
where the are the variances of the residual error and is a regularization penalty term. The loss function is non-convex, possessing many local minima and is generally difficult to find a global minimum.
The diagonal conditional covariance matrix of the error, , is not known and must be estimated. This is performed as follows using the notation to denote the transpose of a vector and to denote model parameters fitted under heteroscedastic error.
For each , estimate the residual error, , using unweighted least squares minimization to find the weights, , and biases, where the error is
(6) |
The sample conditional covariance matrix is obtained as
(7) |
Perform the generalized least squares minimization to obtain a fitted heteroskedastic neural network model, with refined error
(8) |
Finally the re-sampled conditional covariance, , of the heteroscedastic error, , is estimated and combined with the covariance of the neural network to give a model for the covariance of excess returns
(9) |
where the functional form of the factor model covariance will be given in the next section.
Note that no-look ahead bias is introduced in the factor model, if the out-of-sample estimation is performed at time . Note, for avoidance of doubt, we just consider in this paper.
is a global regularization parameter which we tune using the out-of-sample predictive weighted mean-squared error (MSE) of the model. The regularization penalty, , introduces a bias-variance tradeoff.
is given in closed form by a chain rule and, through back-propagation, each layer’s weights
and biasesare fitted with stochastic gradient descent.
Once the neural network has been trained, a number of important issues surface around how to interpret the model parameters. This aspect is by far the most prominent issue in deciding whether to use neural networks in favor of other machine learning and statistical methods for estimating factor realizations, sometimes even if the latter’s predictive accuracy is inferior.
In this section, we shall introduce a method for interpreting multi-layer perceptrons which imposes minimal restrictions on the neural network design. There are numerous techniques for interpreting machine learning methods which treat the model as a black-box. A good example are Partial Dependence Plots (PDPs) as described by
Greenwell et al. (2018). However, PDPs are compute intensive and can not easily resolve interaction effects. Other approaches also exist in the literature. Garson (1991) partitions hidden-output connection weights into components associated with each input neuron using absolute values of connection weights. Olden and Jackson (2002) determines the relative importance, , of the output to the predictor variable of the model as a function of the weights, according to a simple linear expression.We seek to understand the limitations on the choice of activation functions and understand the effect of increasing layers and numbers of neurons on probabilistic interpretability. For example, under standard Gaussian i.i.d. data, how robust are the model’s estimate of the importance of each input variable with variable number of neurons?
We turn to a ’white-box’ technique for determining the importance of the input variables. This approach generalizes Dimopoulos et al. (1995) to a deep neural network with interaction terms. Moreover, the method is directly consistent with how coefficients are interpreted in linear regression - they are model sensitivities. Model sensitivities are the change of the fitted model output w.r.t. input.
As a control, we shall use this property to empirically evaluate how reliably neural networks, even deep networks, learn data from a linear model.
Such an approach is appealing to practitioners who evaluate the comparative performance of linear regression with neural networks and need the assurance that a neural network model is at least able to reproduce and match the coefficients on a linear dataset.
We also offset the common misconception that the activation functions must be deactivated for a neural network model to produce a linear output. Under linear data, any non-linear statistical model should be able to reproduce a statistical linear model under some choice of parameter values. Irrespective of whether data is linear or non-linear in practice - the best control experiment for comparing a neural network estimator with an OLS estimator is to simulate data under a linear regression model. In this scenario, the correct model coefficients are known and the error in the coefficient estimator can be studied.
To evaluate fitted model sensitivities analytically, we require that the function is continuous and differentiable everywhere. Furthermore, for stability of the interpretation, we shall require that is a Lipschitz continuous^{1}^{1}1If Lipschitz continuity is not imposed, then a small change in one of the input values could result in an undesirable large variation in the derivative. That is, there is a positive real constant s.t. , . Such a constraint is necessary for the first derivative to be bounded and hence amenable to the derivatives, w.r.t. to the inputs, providing interpretability.
Fortunately, provided that the weights and biases are finite, each semi-affine function is Lipschitz continuous everywhere. For example, the function is continuously differentiable with derivative, , is globally bounded. With finite weights, the composition of with an affine function is also Lipschitz. Clearly ReLU is not continuously differentiable and one can not use the approach described here. Note that for the following examples, we are indifferent to the choice of homoscedastic or heteroscedastic error, since the model sensitivities are independent of the error.
In a linear regression model
(10) |
the model sensitivities are
(11) |
In a FFWD neural network, we can use the chain rule to obtain the model sensitivities
(12) |
For example, with one hidden layer, and :
(13) |
In matrix form, with general , the Jacobian^{2}^{2}2When is an identity function, the Jacobian . of ,
(14) |
where . Bounds on the sensitivities are given by the product of the weight matrices
(15) |
The model sensitivities can be readily generalized to an layer deep network by evaluating the Jacobian matrix:
(16) |
To illustrate our interpretability approach, we shall consider a simple example. The model is trained to the following data generation process where the coefficients of the features are stepped and the error, here, is i.i.d. uniform:
(17) |
Figure 2 shows the ranked importance of the input variables in a neural network with one hidden layer. Our interpretability method is compared with well known black-box interpretability methods such as Garson’s algorithm (Garson, 1991) and Olden’s algorithm (Olden and Jackson, 2002). Our approach is the only technique to interpret the fitted neural network which is consistent with how a linear regression model would interpret the input variables.
The previous example is too simplistic to illustrate another important property of our interpretability method, namely the ability to capture pairwise interaction terms. The pairwise interaction effects are readily available by evaluating the elements of the Hessian matrix. For example, with one hidden layer, the Hessian takes the form:
(18) |
where it is assumed that the activation function is at least twice differentiable everywhere, e.g. .
To illustrate our input variable and interaction effect ranking approach, we will use one of the classical benchmark regression problems described in (Friedman, 1991) and (Breiman, 1996). The input space consists of ten i.i.d. uniform random variables; however, only five out of these ten actually appear in the true model. The response is related to the inputs according to the formula
using white noise error,
. We fit a NN with one hidden layer containing eight units and a weight decay of 0.01 (these parameters were chosen using 5-fold cross-validation) to 500 observations simulated from the above model with . The cross-validated value was 0.94.In this section, we demonstrate the estimation properties of neural network sensitivities applied to data simulated from a linear model. We show that the sensitivities in a neural network are consistent with the linear model, even if the neural network model is non-linear. We also show that the confidence intervals, estimated by sampling, converge with increasing hidden units.
We generate 400 simulated training samples from the following linear model with i.i.d. Gaussian error:
(19) |
Table 1 compares an OLS estimator with a zero hidden layer feedforward network (NN) and a one hidden layer feedforward network with 10 hidden neurons and tanh activation functions (NN). The functional form of the first two regression models are equivalent, although the OLS estimator has been computed using a matrix solver whereas the zero layer hidden network parameters have been fitted with stochastic gradient descent.
Model | Intercept | Sensitivity of | Sensitivity of | |||
---|---|---|---|---|---|---|
OLS | 0.01 | 1.0154 | 1.018 | |||
NN | 0.02 | 1.0184141 | 1.02141815 | |||
NN | 0.02 | 1.013887 | 1.02224 |
The fitted parameters values will vary slightly with each optimization as the stochastic gradient descent is randomized. However, the sensitivity terms are given in closed form and easily mapped to the linear model. In an industrial setting, such a one-to-mapping is useful for migrating to a deep factor model where, for model validation purposes, compatibility with linear models should be recovered in a limiting case. Clearly, if the data is not generated from a linear model, then the parameter values would vary across models.
Figure 5 and Tables 2 and 3 shows the empirical distribution of the fitted sensitivities using the single hidden layer model with increasing hidden units. The sharpness of the distributions is observed to converge monotonically with the number of hidden units. The confidence intervals are estimated under a non-parametric distribution.
In general, provided the weights and biases of the network are finite, the variances of the sensitivities are bounded for any input and choice of activation function.
Note that probabilistic bounds on the variance of the Jacobians can be derived for the case when the network is ReLU activated are given in Section B. We do not recommend using ReLU activation because it does not permit identification of the interaction terms and has provably non-convergent sensitivity variances as a function of the number of hidden units (see Appendix B).
(a) density of | (b) density of |
Hidden Units | Mean | Median | Std.dev | 1% C.I. | 99% C.I. |
---|---|---|---|---|---|
2 | 0.980875 | 1.0232913 | 0.10898393 | 0.58121675 | 1.0729908 |
10 | 0.9866159 | 1.0083131 | 0.056483902 | 0.76814914 | 1.0322522 |
50 | 0.99183553 | 1.0029879 | 0.03123002 | 0.8698967 | 1.0182846 |
100 | 1.0071343 | 1.0175397 | 0.028034585 | 0.89689034 | 1.0296803 |
200 | 1.0152218 | 1.0249312 | 0.026156902 | 0.9119074 | 1.0363332 |
Hidden Units | Mean | Median | Std.dev | 1% C.I. | 99% C.I. |
---|---|---|---|---|---|
2 | 0.98129386 | 1.0233982 | 0.10931312 | 0.5787732 | 1.073728 |
10 | 0.9876832 | 1.0091512 | 0.057096474 | 0.76264584 | 1.0339714 |
50 | 0.9903236 | 1.0020974 | 0.031827927 | 0.86471796 | 1.0152498 |
100 | 0.9842479 | 0.9946766 | 0.028286876 | 0.87199813 | 1.0065105 |
200 | 0.9976638 | 1.0074166 | 0.026751818 | 0.8920307 | 1.0189484 |
This section presents the application of our framework to the fitting of a fundamental factor model under heteroscedastic errors, as in the BARRA model. The BARRA model includes many more explanatory variables than used in experiments below, but the purpose, here, is to illustrate the application of our framework to financial data.
We define the universe as the top 250 stocks from the S&P 500 index, ranked by market cap. Factors are given by Bloomberg and reported monthly over the period from November 2008 to November 2018. We remove stocks with too many missing factor values, leaving 218 stocks.
We follow the two-step procedure defined in Section 3. We begin by generating the model errors from our neural network over each period in the dataset, . The historical factors are inputs to the model and are standardized to enable model interpretability. These factors are (i) current enterprise value; (ii) Price-to-Book ratio; (iii) current enterprise value to trailing 12 month EBITDA; (iv) Price-to-Sales ratio; (v) Price-to-Earnings ration; and (vi) log market cap. The responses are the monthly asset returns for each stock in our universe.
We use a moving window of observations to estimate the sample covariance of the model error. For each , the covariance matrix is reestimated and used to weight the error in the refined model fitting. The next period monthly asset returns are then forecasted.
We use Tensorflow
Abadi et al. (2016) to implement a two hidden layer feed-forward network and develop a custom implementation for the generalized least squares error and variable sensitivites. The GLS model is implemented based on the two step procedure using the OLS regression in the Python StatsModels
module.
All results are shown using regularization and activation functions. The number of hidden units and regularization parameters are found by five-fold cross-validation and reported alongside the results.
Figure 6
compares the performance of a GLS estimator with the feed-forward neural network with 50 hidden units in the first hidden layer, 10 hidden units in the second layer and
. Both models are fitted with heteroscedastic error and the in-sample performance is compared with the out-of-sample error.(a) General Linear Regression | (b) FFWD Neural Network |
Figure 7 shows the in-sample MSE as a function of the number of hidden units in the hidden layer. The neural networks are trained here without regularization to demonstrate the effect of solely increasing the number of hidden units in the first layer. Increasing the number of hidden units reduces the bias in the model.
Figure 8 shows the effect of regularization on the MSE errors for a network with 50 units in the first hidden layer. Increasing the level of regularization increases the in-sample bias but reduces the out-of-sample bias, and hence the variance of the estimation error.
(a) In-sample | (b) Out-of-Sample |
Figure 9 shows the sensitivities of the GLS regression model and neural network models to each factor over a 48 month period from November 2014 to November 2018. The black line shows the sensitivities evaluated by our neural network based method. The red line shows the same sensitivities using linear regression.
Figure 10 compares the distribution of sensitivities to each factor over a 48 month period from November 2014 to November 2018 as estimated by neural networks and GLS regression.
Finally, although not the primary purpose of this paper, we provide evidence that our neural network factor model generates higher information ratios than the linear factor model when used to sort portfolios from our universe. Figure 11 shows the information ratios of a portfolio selection strategy which selects the stocks with the highest predicted monthly returns. The information ratios are evaluated for various size portfolios, using the S&P 500 index as the benchmark.
The information ratios are observed to increase by a factor of approximately three. Also shown, for control, are randomly selected portfolios^{3}^{3}3Note that the reason why the information ratios are always positive, even when under random portfolio selection, is because we have defined a universe of only 218 stocks, with sufficient historical data available over the period from November 2014 to November 2018.. We observe in some cases that even though the factors in the linear regression barely exceed white noise, they are indeed predictive when used in a neural network model. Even for this limited set of factors, one can produce positive information ratios with neural networks.
(a) General Linear Regression | (b) FFWD Neural Network | (c) White Noise |
An important aspect in adoption of neural networks in factor modeling is the existence of a statistical framework which provides the transparency and statistical interpretability of linear least squares estimation. Moreover, one should expect to use such a framework applied to linear data and obtain similar results to linear regression, thus isolating the effects of non-linearity versus the effect of using different optimization algorithms and model implementation environments.
In this paper, we introduce a deep learning framework for fundamental factor modeling which generalizes the linear fundamental factor models by capturing non-linearity, interaction effects and non-parametric shocks in financial econometrics. Our framework provides interpretability, with confidence intervals, and ranking of the factor importances and interaction effects. Moreover, our framework can be used to estimate factor realizations under either homoscedastic or heteroscedastic error. In the case when the network contains no hidden layers, our approach recovers a linear fundamental factor model. The framework allows the impact of non-linearity and non-parametric treatment of the error on the factors over time and forms the basis for generalized intepretability of fundamental factors. Our neural network model is observed to generate information ratios which are a factor of three higher than generalized linear regression.
A high frequency trade execution model for supervised learning.
High Frequency 1(1), 32–52.Partial Dependency Plots (PDPs) evaluate the expected output w.r.t. the marginal density function of each input variable, and allow the importance of the predictors to be ranked. More precisely, partitioning the data into an interest set, , and its complement, , then the “partial dependence” of the response on is defined as
(20) |
where
is the marginal probability density of
: . Equation (20) can be estimated from a set of training data by(21) |
where are the observations of in the training set; that is, the effects of all the other predictors in the model are averaged out. There are a number of challenges with using PDPs for model interpretability. First, the interaction effects are ignored by the simplest version of this approach. While Greenwell et al. (2018) propose a methodology extension to potentially address the modeling of interactive effects, PDPs do not provide a 1-to-1 correspondence with the coefficients in a linear regression. Instead, we would like to know, under strict control conditions, how the fitted weights and biases of the MLP correspond to the fitted coefficients of linear regression. Moreover in the context of neural networks, by treating the model as a black box, it is difficult to gain theoretical insight in to how the choice of the network architecture effects its interpretability from a probabilistic perspective.
Garson (1991) partitions hidden-output connection weights into components associated with each input neuron using absolute values of connection weights. Garson’s algorithm uses the absolute values of the connection weights when calculating variable contributions, and therefore does not provide the direction of the relationship between the input and output variables.
Olden and Jackson (2002) determines the relative importance, , of the output to the predictor variable of the model as a function of the weights, according to the expression
(22) |
The approach does not account for non-linearity introduced into the activation, which is the most critical aspects of the model. Furthermore, the approach presented was limited to a single hidden layer.
Consider a activated single layer feed-forward network. In matrix form, with , the Jacobian, , can be written as linear combination of Heaviside functions:
(23) |
where , . The Jacobian can be written in matrix element form as
(24) |
where and . As a linear combination of indicator functions, we have
(25) |
Alternatively, the Jacobian can be expressed in terms of
(26) |
In the case when , this simplifies to a weighted sum of independent Bernoulli trials:
(27) |
where . The expectation of the Jacobian is given by
(28) |
where For finite weights, the expectation is bounded above by . We can write the variance of the Jacobian as:
(29) |
If the weights are constrained so that the mean is constant, , then the weights are . Then the variance is bounded by the mean:
(30) |
Otherwise, in general, we have the bound:
(31) |
The fact that the variance is slightlu monotonically increasing with the number of hidden units, combined with the inability to evaluate interaction terms, are two good reasons to avoid using activation functions.
We can also derive probabilistic bounds on the Jacobians. Let and be reals in . Let be independent Bernoulli trials with so that
(32) |
The Chernoff-type bound exists on deviations of above the mean
(33) |
A similar bound exists for deviations of below the mean. For :
(34) |
These bounds are generally weak and are suited to large deviations, i.e. the tail regions. The bounds are shown in the figure below for different values of - is increasing towards the upper right hand corner of the plot.
Comments
There are no comments yet.