Recently, there has been a rapid and significant success in applying machine learning methods to a wide range of applications including vision (Szeliski:2010:CVA:1941882)Sebastiani:2002:MLA:505282.505283), medicine (medical_imaging), finance (finance), etc. In sensitive applications such as medicine, we would like to explain test-time model predictions to humans. An important question is : why the model makes a certain prediction for a particular test sample. One way to address this is to trace back model predictions to its training data. More specifically, one can ask which training samples were the most influential ones for a given test prediction.
Influence functions (cook_influence) from robust statistics measure the dependency of optimal model parameters on training samples. Previously (influence1) used first-order approximations of influence functions to estimate how much model parameters would change if a training point was up-weighted by an infinitesimal amount. Such an approximation can be used to identify most influential training samples in a test prediction. Moreover, this approximation is similar to the leave-one-out re-training, thus the first-order influence function proposed in (influence1) bypasses the expensive process of repeated re-training the model to find influential training samples in a test-time prediction.
In some applications, one may want to understand how model parameters would change when large groups of training samples are removed from the training set. This could be useful to identify groups of training data which drive the decision for a particular test prediction. As shown in (influence2), finding influential groups can be useful in real-world applications such as diagnosing batch effects (sc-Batch), apportioning credit between different data sources (credit), understanding effects of different demographic groups (demograph) or in a multi-party learning setting (multiparty). (influence2) approximates the group influence by sum of first-order individual influences over training samples in the considered group. However, removal of a large group from training can lead to a large perturbation to model parameters. Therefore, influence functions based on first-order approximations may not be accurate in this setup. Moreover, approximating the group influence by adding individual sample influences ignores possible cross correlations that may exist among samples in the group.
In this paper, we relax the first-order approximations of current influence functions and study how second-order approximations can be used to capture model changes when a potentially large group of training samples is up-weighted. Considering a training set and a group , existing first-order approximations of the group influence function (influence2) can be written as the sum of first-order influences of individual points. That is,
where is the first-order group influence function and is the first-order influence for the sample in . On the other hand, our proposed second-order group influence function has the following form:
captures informative cross-dependencies among samples in the group and is a function of gradient vectors and the Hessian matrix evaluated at the optimal model parameters. We present a more precise statement of this result in Theorem1. We note that the proposed second-order influence function can be computed efficiently even for large models. We discuss its computational complexity in Section 4.2.
Our analysis shows that the proposed second-order influence function captures model changes efficiently even when the size of the groups are relatively large. For example, in an MNIST classification problem using logistic regression, whenof the training samples are removed, the correlation between the ground truth estimate and second-order influence values improves by over when compared to the existing first-order influence values. We note that higher-order influence functions have been used in statistics (highorder)
for point and interval estimates of non-linear functionals in parameteric, semi-parametric and non-parametric models. However, to the best of our knowledge, this is the first time, higher-order influence functions are used for the interpretability task in the machine learning community.
Similar to (influence1)
, our main results for the second-order influence functions hold for linear prediction models where the underlying optimization is convex. Next, we explore effectiveness of both first-order and second-order group influence functions in the case of deep neural networks. Using a two-hidden layer fully-connected network with sigmoid activations, we train a classifier on the MNIST dataset with 10 labels. We observe that none of the existing first-order and the proposed second-order influence functions provide good estimates of the ground-truth influence over all training samples111Note that experiments of (influence1) focus on the most influential training samples, not all.. We observe that the influence estimates in case of deep neural networks are very low when compared to the ground truth influences. We take the eigen-decomposition of the Hessian matrix in the first-order influence function and empirically show that the contribution to the influence score from each component of the eigen-decomposition is low when compared to linear models.
In summary, we make the following contributions:
We propose second-order group influence functions that consider cross dependencies among the samples in the considered group.
Through several experiments over linear models and across different size of groups, we show that the second-order influence estimates have higher correlations with the ground truth compared to the first-order ones.
In the case of deep neural networks, we show that neither first nor the second order influence functions provide good estimates of the ground truth. We show that the contribution to the influence score from each eigenvector of the Hessian matrix is smaller compared to that of linear models.
2 Related Works
Influence functions, a classical technique from robust statistics introduced by (cook_influence; cook_inf_2) were first used in the machine learning community for interpretability by (influence1) to approximate the effect of upweighting a training point on the model parameters. This approximate change in the parameters could be used to compute the change in loss for a particular test point when a sample in the training set is removed or its weight is perturbed by an infinitesimal amount. More recently (pmlr_influence) focused on the behaviour of influence functions on self-loss. Influence functions can also be used to craft adversarial training images which are visually imperceptible from the original images and can be used to flip labels at the test time. In the past few years, there has been an increase in the applications of influence functions for a variety of machine learning tasks. (Schulam2019CanYT)
used influence functions to produce confidence intervals for a prediction and to audit the reliability of predictions.(model_fairness) used influence functions to approximate the gradient in order to recover a counterfactual distribution and increase model fairness. (koh2019stronger) crafted stronger data poisoning attacks using influence functions. More recently influence functions were used to detect extrapolation (extrapolation) and validate causal inference models (causal).
We consider the classical supervised learning problem setup, where the task is to learn a function(also called the hypothesis) mapping from the input space to an output space . We denote the input-output pair as . We assume that our learning algorithm is given training examples drawn i.i.d from some unknown distribution . Let be the space of the parameters of considered hypothesis class. The goal is to select model parameters to minimize the empirical risk as follows:
where , denotes the cardinality of the training set, the subscript indicates that the whole set is used in training and
is the associated loss function. We refer to the optimal parameters computed by the above optimization as.
Let and be the gradient and the Hessian of the loss function, respectively.
First, we discuss the case where we want to compute the effect of an individual training sample on optimal model parameters as well as the test predictions made by the model. The effect or influence of a training sample on the model parameters could characterized by removing that particular training sample and retraining the model again as follows:
Then, we can compute the change in model parameters as , due to removal of a training point . However, re-training the model for every such training sample is expensive when is large. Influence functions based on first-order approximations introduced by (cook_influence; cook_inf_2) was used by (influence1) to approximate this change. Up-weighting a training point by an infinitesimal amount leads to a new optimal model parameters, , obtained by solving the following optimization problem:
Removing a point is similar to up-weighting its corresponding weight by . The main idea used by (influence1) is to approximate by minimizing the first-order Taylor series approximation around . Following the classical result by (cook_influence), the change in the model parameters on up-weighting can be approximated by the influence function (influence1) denoted by :
A detailed proof can be found in (influence1). Using the given formulation, we can track the change with respect to any function of . The change in the test loss for a particular test point when a training point is up-weighted can be approximated as a closed form expression:
This result is based on the assumption (influence1) that the loss function is strictly convex in the model parameters and the Hessian is therefore positive-definite. This approximation is very similar to forming a quadratic approximation around the optimal parameters and taking a single Newton step. However explicitly computing and it’s inverse is not required. Using the hessian-vector product rule (Pearlmutter) the influence function can be computed efficiently.
4 Group Influence Function
Our goal in this section is to understand how the model parameters would change if a particular group of samples was up-weighted from the training set. However, up-weighting a group can lead to large perturbations to the training data distribution and therefore model parameters, which does not follow the small perturbation assumption of the first-order influence functions. In this section, we extend influence functions using second-order approximations to better capture changes in model parameters due to up-weighting a group of training samples. In Section 4.1, we show that our proposed second-order group influence function can be used in conjunction with optimization techniques to select the most influential training groups in a test prediction.
The empirical risk minimization (ERM) when we remove samples from training can be written as:
To approximate how optimal solution of this optimization is related to , we study the effect of up-weighting a group of training samples on model parameters. Note that in this case, updated weights should still be a valid distribution, i.e. if a group of training samples has been up-weighted, the rest of samples should be down-weighted to preserve the sum to one constraint of weights in the ERM formulation. In the individual influence function case (when the size of the group is one), up-weighting a sample by leads to down-weighting other samples by whose effect can be neglected similar to the formulation of (influence1).
In our formulation for the group influence function, we assume that the weights of samples in the set has been up-weighted all by and use to denote the fraction of up-weighted training samples. This leads to a down-weighting of the rest of training samples by , to preserve the empirical weight distribution of the training data. Therefore, the resulting ERM can be written as:
In the above formulation, if we get the original loss function (where none of the training samples are removed) and if , we get the loss function (where samples are removed from training).
Let denote the optimal parameters for minimization. Essentially we are concerned about the change in the model parameters (i.e. ) when each training sample in a group of size is upweighted by a factor of . The key step of the derivation is to expand around (the minimizer of , or ) with respect to the order of , the upweighting parameter. In order to do that, we use the perturbation theory (perturbation) to expand around .
Frequently used in quantum mechanics and also in other areas of physics such as particle physics, condensed matter and atomic physics, perturbation theory finds approximate solution to a problem () by starting from the exact solution of a closely related and simpler problem (). As gets smaller and smaller, these higher order terms become less significant. However, for large model perturbations (such as the case of group influence functions), using higher-order terms can reduce approximation errors significantly. The following perturbation series forms the core of our derivation for second-order influence functions:
where characterizes the first-order (in ) perturbation vector of model parameters while is the second-order (in ) model perturbation vector. We hide the dependencies of these perturbation vectors to constants (such as ) with the notation.
In the case of computing influence of individual points, as shown by (influence1), the scaling of is in the order of while the scaling of the second-order coefficient is which is very small when is large. Thus, in this case, the second-order term can be ignored. In the case of computing the group influence, the second-order coefficient is in the order of , which can be large when the size of is large. Thus, in our definition of the group influence function, both and are taken into account.
The first-order group influence function (denoted by ) when all the samples in a group are up-weighted by can be defined as:
To capture the dependency of the terms in , on the group influence function, we define as follows:
Although one can consider even higher-order terms, in this paper, we restrict our derivations up to the second-order approximations of the group influence function. We now state our main result in the following theorem:
If the third-derivative of the loss function at is sufficiently small, the second-order group influence function (denoted by ) when all samples in a group are up-weighted by is:
The full proof of this result is presented in the appendix. This result is based on the assumption that the third-order derivatives of the loss function at is small. For the quadratic loss, the third-order derivatives of the loss are zero. Our experiments with the cross entropy loss function indicates that this assumption approximately holds for the classification problem as well. Below, we present the sketch of this result.
We now derive and to be used in the second order group influence function . As
is the optimal parameter set for the interpolated loss function, due to the first-order stationary condition, we have the following equality:
The main idea is to use Taylor’s series for expanding around along with the perturbation series defined in Equation (9) and compare the terms of the same order in :
Similarly, we expand around using Taylor series expansion. To derive we compared terms with the coefficient of in Equation (13) and for we compared terms with coefficient . Based on this, can be written in the following way:
We expand Equation(13) and compare the terms with coefficient :
is the first-order approximation of group influence function and can be denoted by . Note that our first-order approximation of group influence function , is slightly different from (influence2) with an additional in the denominator. This extra term arises due to the conservation of the weight distribution of the training samples, which is essential when a large group is upweighted.
For we compare the terms with coefficients of the same order of in Equation (13):
For the term, we ignore the third-order term due to it being small. Now we substitute the value of and equate the terms with coefficient in the order of :
Rearranging the equation, we get the following identity:
It can be observed that the additional term () in our second-order approximation captures cross-dependencies among the samples in through a function of gradients and Hessians of the loss function at the optimal model parameters. This makes the second-order group influence function to be more informative when training samples are correlated. In Section 6, we empirically show that the addition of improves correlation with the ground truth influence as well.
For tracking the change in the test loss for a particular test point when a group
is removed, we use the chain rule to compute the influence score as follows:
Our second-order approximation of group influence function consists of a first-order term that is similar to the one proposed in (influence2) with an additional scaling term . This scaling is due to the fact that our formulation preserves the empirical weight distribution constraint in ERM. Our second-order influence function has an additional term that is directly proportional to and captures large perturbations to the model parameters more effectively.
4.1 Selection of influential groups
In this section, we explain how the second-order group influence function can be used to select the most influential group of training samples for a particular test prediction. In case of the existing first-order approximations for group influence functions, selecting the most influential group can be done greedily by ranking the training points with the highest individual influence since the group influence is the sum of influence of the individual points. However, with the second-order approximations such greedy selection is not optimal since the group influence is not additive in terms of the influence of individual points. To deal with this issue, we first decompose the second-order group influence function into two terms as:
where . While is additive with respect to the samples and has pairwise dependencies among samples.
To simplify notation, we define the constant vector as . Ideally for a given fixed group of size , we want to find training samples amongst the total training samples which maximizes the influence for a given test point . We can define this in the form of a quadratic optimization problem as follows:
where is composed of two matrices and i.e. . contains the weights associated with each sample in the training set. The entries of contain and the rows of contain . In case of , the columns contain . We define the constant as and as . This optimization can be relaxed using the relaxation from compressed sensing (Donoho:2006:CS:2263438.2272089; Candes:2005:DLP:2263433.2271950). The relaxed optimization can then be solved efficiently using the projected gradient descent. (projection1; Duchi:2008:EPL:1390156.1390191).
4.2 Computational Complexity
For models with large number of parameters, computing the inverse of the Hessian can be expensive and is of the order of . However, computing the Hessian-vector product (Pearlmutter) is relatively computationally inexpensive. In our experiments, we used conjugate gradients (a second-order optimization technique) (conjugate_gradient) to compute the inverse Hessian-vector product which uses a Hessian-vector product in the routine thus saving the expense for inverting the Hessian directly. The second-order group influence function can be computed similarly to the first order approximations with an additional step of Hessian-vector product.
5 Influence on Deep Networks
For both first-order and second-order influence functions, it is assumed that the empirical risk is strictly convex and twice differentiable. However, such assumptions do not hold in general for deep neural networks, where the loss function is non-convex. Previously (influence1) had shown that in the case of individual effects, there is still a satisfactory correlation between the approximated influence and the ground truth influence for deep networks when the Hessian is regularized. This observation holds only for the top few influential points.
The behavior of influence functions in the case of deep networks across all training points has not been well-explored. In our experiments, we observe that in this case both first-order and second-order influence function values are very small compared to the ground truth influences and lead to a low correlation. The details of these experiments can be found in the appendix. In order to understand this phenomenon, we take the eigen-decomposition of the Hessian matrix and study how different components contribute to the influence score. The eigen-decomposition of the hessian matrix is defined as follows:
where is the number of model parameters and denotes the eigenvector of the Hessian matrix. We substitute the decomposed Hessian in the first-order influence function to obtain the following:
Note that the influence function is inversely related to eigenvalues. For each component, we compute its contribution to the influence score averaged across all the training points. More specifically, for thecomponent we compute the score as:
We report the contribution of the largest component-score to the overall influence in the following table:
|Neural Network||Logistic Regression|
It can be observed that in case of neural networks, the contribution of each component to the overall influence is very small which leads to a low influence score in general. However, in the case of the logistic regression, where the influence function provides an estimate close to the ground truth influence, we observe that the individual components contribute significantly to the influence score leading to a better estimate in general. Note that this experiment was done using the Iris dataset (Dua:2019), on a relatively simple neural network with one hidden layer and with sigmoid activations consisting of only 27 parameters, whereas the logistic regression model had 15 parameters. Nevertheless, this highlights the importance of understanding influence functions when the prediction model is a deep neural network.
Our goal through the experiments is to observe if the second-order approximations of group influence functions improve the correlation with the ground truth estimate. We compare the computed second-order group influence score with the ground truth influence (which is computed by leave--out retraining for a group with size ). Our metric for evaluation is the Pearson correlation which measures how linearly the computed influence and the actual ground truth estimate are related. We perform our experiments on logistic regression and neural networks.
In our first experiments, we use a synthetic dataset along with logistic regression. The synthetic dataset has 10,000 points drawn from a Gaussian distribution, consisting of 5 features and 2 classes. The details for the synthetic data can be found in the appendix. The second sets of experiments are done with the standard handwritten digits database MNIST(mnist) which consists of 10 classes of different digits. For understanding how group influence functions behave in case of the neural networks we use the MNIST dataset. For each of the two datasets, we pick random groups with sizes ranging from to of the entire training points. We investigated the computed influence score for a mis-classified test-point with the highest test loss.
6.3 Observations and Analysis
6.3.1 Logistic Regression
For logistic regression, the general observation was that the second-order group influence function improves the correlation with the ground truth estimates across different group sizes in both the synthetic dataset as well as the MNIST. For both datasets, the gain in correlation was larger when the size of the considered group was large. For e.g. it can seen in Fig. (2), that when more than of the samples were removed, the gain in correlation is almost always more than .
6.4 Neural Networks
In case of neural networks, the Hessian is not positive semi-definite, which violates the assumptions of influence functions. Previously (influence1) regularized the hessian in the form of , and had shown that for the top few influential points (not groups), for a given test point, the correlation with the ground truth influence is still satisfactory, if not highly significant. For MNIST, we adopted a similar approach with regularized hessian with a value of and conducted experiments for a relatively simple two hidden layered feed-forward network with sigmoid activations for both first-order and second-order group influence functions. The general observation was that both first and second-order group influence functions underestimate the ground truth influence values across different group sizes. However, we observed that while the second-order influence values still suffer from the underestimation issue, they improve the correlation marginally across different group sizes.
7 Conclusion and Future Work
In this paper, we proposed second-order group influence functions for approximating model changes when a group from the training set is up-weighted. Empirically, in the case of linear models and across different group sizes, we showed that the second-order influence has a higher correlation with ground truth values compared to the first-order ones. We showed that the second-order approximation is significantly informative when the size of the groups are large. We showed that the proposed second-order group influence function can be used in conjunction with optimization techniques to select the most influential group in the training set for a particular test prediction. For non-linear models like deep neural networks, we observed that both first-order and second-order influence functions provide low influence scores in comparison to the ground truth values. Developing proper influence functions for neural net models or training neural networks to have improved influence functions are among interesting directions for future work.
Authors would like to acknowledge NSF award, . The authors would also like to thank Pang Wei Koh for insights about the neural network experiments.
Appendix A Proofs
If the third-derivative of the loss function at is sufficiently small, the second-order group influence function (denoted by ) when all samples in a group are up-weighted by is:
We consider the empirical risk minimization problem where the learning algorithm is given training samples drawn i.i.d from some distribution . Let be the space of the parameters and be the hypothesis to learn. The goal is to select model parameters to minimize the empirical risk as follows:
The ERM problem with a subset removed from is as follows:
In our formulation for the group influence function, we assume that weights of samples in the set has been up-weighted all by . This leads to a down-weighting of the remaining training samples by , to conserve the empirical weight distribution of the training data. We denote as to denote the fraction of up-weighted training samples. Therefore, the resulting ERM optimization can be written as:
and . We consider the stationary condition where the gradient of is zero. More specifically:
Next we expand around the optimal parameter using Taylor’s expansion and retrieve the terms with coefficients to find :
At the optimal parameters , and , thus simplifying Equation (A):
Substituting as , we get the following identity:
where is the first-order approximation of group influence function. We denote the first-order approximation as .
Next we derive by comparing terms to the order of (i.e. ) in Equation (30) by expanding around :
Taking the common coefficients out and rearranging Equation (36), we get the following identity:
Now we multiply both sides of the Equation (37) with the Hessian inverse i.e. , we obtain the cross-term involving the gradients and the hessians of the removed points in the second order influence function as follows:
Appendix B Synthetic Dataset
The synthetic data was sampled from a Gaussian distribution with 5 dimensions and consisted of two classes. We sampled a total of 10000 points for our experiments. For the first class, the mean was kept at 0.1 for all the dimensions and in the second case the mean was kept at 0.8. The covariance matrix in both the classes was a diagonal matrix whose entries were sampled randomly between 0 and 1.