1 Introduction
Recently, there has been a rapid and significant success in applying machine learning methods to a wide range of applications including vision (Szeliski:2010:CVA:1941882)
(Sebastiani:2002:MLA:505282.505283), medicine (medical_imaging), finance (finance), etc. In sensitive applications such as medicine, we would like to explain testtime model predictions to humans. An important question is : why the model makes a certain prediction for a particular test sample. One way to address this is to trace back model predictions to its training data. More specifically, one can ask which training samples were the most influential ones for a given test prediction.Influence functions (cook_influence) from robust statistics measure the dependency of optimal model parameters on training samples. Previously (influence1) used firstorder approximations of influence functions to estimate how much model parameters would change if a training point was upweighted by an infinitesimal amount. Such an approximation can be used to identify most influential training samples in a test prediction. Moreover, this approximation is similar to the leaveoneout retraining, thus the firstorder influence function proposed in (influence1) bypasses the expensive process of repeated retraining the model to find influential training samples in a testtime prediction.
In some applications, one may want to understand how model parameters would change when large groups of training samples are removed from the training set. This could be useful to identify groups of training data which drive the decision for a particular test prediction. As shown in (influence2), finding influential groups can be useful in realworld applications such as diagnosing batch effects (scBatch), apportioning credit between different data sources (credit), understanding effects of different demographic groups (demograph) or in a multiparty learning setting (multiparty). (influence2) approximates the group influence by sum of firstorder individual influences over training samples in the considered group. However, removal of a large group from training can lead to a large perturbation to model parameters. Therefore, influence functions based on firstorder approximations may not be accurate in this setup. Moreover, approximating the group influence by adding individual sample influences ignores possible cross correlations that may exist among samples in the group.
In this paper, we relax the firstorder approximations of current influence functions and study how secondorder approximations can be used to capture model changes when a potentially large group of training samples is upweighted. Considering a training set and a group , existing firstorder approximations of the group influence function (influence2) can be written as the sum of firstorder influences of individual points. That is,
where is the firstorder group influence function and is the firstorder influence for the sample in . On the other hand, our proposed secondorder group influence function has the following form:
where
captures informative crossdependencies among samples in the group and is a function of gradient vectors and the Hessian matrix evaluated at the optimal model parameters. We present a more precise statement of this result in Theorem
1. We note that the proposed secondorder influence function can be computed efficiently even for large models. We discuss its computational complexity in Section 4.2.Our analysis shows that the proposed secondorder influence function captures model changes efficiently even when the size of the groups are relatively large. For example, in an MNIST classification problem using logistic regression, when
of the training samples are removed, the correlation between the ground truth estimate and secondorder influence values improves by over when compared to the existing firstorder influence values. We note that higherorder influence functions have been used in statistics (highorder)for point and interval estimates of nonlinear functionals in parameteric, semiparametric and nonparametric models. However, to the best of our knowledge, this is the first time, higherorder influence functions are used for the interpretability task in the machine learning community.
Similar to (influence1)
, our main results for the secondorder influence functions hold for linear prediction models where the underlying optimization is convex. Next, we explore effectiveness of both firstorder and secondorder group influence functions in the case of deep neural networks. Using a twohidden layer fullyconnected network with sigmoid activations, we train a classifier on the MNIST dataset with 10 labels. We observe that none of the existing firstorder and the proposed secondorder influence functions provide good estimates of the groundtruth influence over all training samples
^{1}^{1}1Note that experiments of (influence1) focus on the most influential training samples, not all.. We observe that the influence estimates in case of deep neural networks are very low when compared to the ground truth influences. We take the eigendecomposition of the Hessian matrix in the firstorder influence function and empirically show that the contribution to the influence score from each component of the eigendecomposition is low when compared to linear models.In summary, we make the following contributions:

We propose secondorder group influence functions that consider cross dependencies among the samples in the considered group.

Through several experiments over linear models and across different size of groups, we show that the secondorder influence estimates have higher correlations with the ground truth compared to the firstorder ones.

In the case of deep neural networks, we show that neither first nor the second order influence functions provide good estimates of the ground truth. We show that the contribution to the influence score from each eigenvector of the Hessian matrix is smaller compared to that of linear models.
2 Related Works
Influence functions, a classical technique from robust statistics introduced by (cook_influence; cook_inf_2) were first used in the machine learning community for interpretability by (influence1) to approximate the effect of upweighting a training point on the model parameters. This approximate change in the parameters could be used to compute the change in loss for a particular test point when a sample in the training set is removed or its weight is perturbed by an infinitesimal amount. More recently (pmlr_influence) focused on the behaviour of influence functions on selfloss. Influence functions can also be used to craft adversarial training images which are visually imperceptible from the original images and can be used to flip labels at the test time. In the past few years, there has been an increase in the applications of influence functions for a variety of machine learning tasks. (Schulam2019CanYT)
used influence functions to produce confidence intervals for a prediction and to audit the reliability of predictions.
(model_fairness) used influence functions to approximate the gradient in order to recover a counterfactual distribution and increase model fairness. (koh2019stronger) crafted stronger data poisoning attacks using influence functions. More recently influence functions were used to detect extrapolation (extrapolation) and validate causal inference models (causal).3 Background
We consider the classical supervised learning problem setup, where the task is to learn a function
(also called the hypothesis) mapping from the input space to an output space . We denote the inputoutput pair as . We assume that our learning algorithm is given training examples drawn i.i.d from some unknown distribution . Let be the space of the parameters of considered hypothesis class. The goal is to select model parameters to minimize the empirical risk as follows:(1) 
where , denotes the cardinality of the training set, the subscript indicates that the whole set is used in training and
is the associated loss function. We refer to the optimal parameters computed by the above optimization as
.Let and be the gradient and the Hessian of the loss function, respectively.
First, we discuss the case where we want to compute the effect of an individual training sample on optimal model parameters as well as the test predictions made by the model. The effect or influence of a training sample on the model parameters could characterized by removing that particular training sample and retraining the model again as follows:
(2) 
Then, we can compute the change in model parameters as , due to removal of a training point . However, retraining the model for every such training sample is expensive when is large. Influence functions based on firstorder approximations introduced by (cook_influence; cook_inf_2) was used by (influence1) to approximate this change. Upweighting a training point by an infinitesimal amount leads to a new optimal model parameters, , obtained by solving the following optimization problem:
(3) 
Removing a point is similar to upweighting its corresponding weight by . The main idea used by (influence1) is to approximate by minimizing the firstorder Taylor series approximation around . Following the classical result by (cook_influence), the change in the model parameters on upweighting can be approximated by the influence function (influence1) denoted by :
(4) 
A detailed proof can be found in (influence1). Using the given formulation, we can track the change with respect to any function of . The change in the test loss for a particular test point when a training point is upweighted can be approximated as a closed form expression:
(5) 
This result is based on the assumption (influence1) that the loss function is strictly convex in the model parameters and the Hessian is therefore positivedefinite. This approximation is very similar to forming a quadratic approximation around the optimal parameters and taking a single Newton step. However explicitly computing and it’s inverse is not required. Using the hessianvector product rule (Pearlmutter) the influence function can be computed efficiently.
4 Group Influence Function
Our goal in this section is to understand how the model parameters would change if a particular group of samples was upweighted from the training set. However, upweighting a group can lead to large perturbations to the training data distribution and therefore model parameters, which does not follow the small perturbation assumption of the firstorder influence functions. In this section, we extend influence functions using secondorder approximations to better capture changes in model parameters due to upweighting a group of training samples. In Section 4.1, we show that our proposed secondorder group influence function can be used in conjunction with optimization techniques to select the most influential training groups in a test prediction.
The empirical risk minimization (ERM) when we remove samples from training can be written as:
(6) 
To approximate how optimal solution of this optimization is related to , we study the effect of upweighting a group of training samples on model parameters. Note that in this case, updated weights should still be a valid distribution, i.e. if a group of training samples has been upweighted, the rest of samples should be downweighted to preserve the sum to one constraint of weights in the ERM formulation. In the individual influence function case (when the size of the group is one), upweighting a sample by leads to downweighting other samples by whose effect can be neglected similar to the formulation of (influence1).
In our formulation for the group influence function, we assume that the weights of samples in the set has been upweighted all by and use to denote the fraction of upweighted training samples. This leads to a downweighting of the rest of training samples by , to preserve the empirical weight distribution of the training data. Therefore, the resulting ERM can be written as:
where
(7)  
Or equivalently
(8)  
In the above formulation, if we get the original loss function (where none of the training samples are removed) and if , we get the loss function (where samples are removed from training).
Let denote the optimal parameters for minimization. Essentially we are concerned about the change in the model parameters (i.e. ) when each training sample in a group of size is upweighted by a factor of . The key step of the derivation is to expand around (the minimizer of , or ) with respect to the order of , the upweighting parameter. In order to do that, we use the perturbation theory (perturbation) to expand around .
Frequently used in quantum mechanics and also in other areas of physics such as particle physics, condensed matter and atomic physics, perturbation theory finds approximate solution to a problem () by starting from the exact solution of a closely related and simpler problem (). As gets smaller and smaller, these higher order terms become less significant. However, for large model perturbations (such as the case of group influence functions), using higherorder terms can reduce approximation errors significantly. The following perturbation series forms the core of our derivation for secondorder influence functions:
(9) 
where characterizes the firstorder (in ) perturbation vector of model parameters while is the secondorder (in ) model perturbation vector. We hide the dependencies of these perturbation vectors to constants (such as ) with the notation.
In the case of computing influence of individual points, as shown by (influence1), the scaling of is in the order of while the scaling of the secondorder coefficient is which is very small when is large. Thus, in this case, the secondorder term can be ignored. In the case of computing the group influence, the secondorder coefficient is in the order of , which can be large when the size of is large. Thus, in our definition of the group influence function, both and are taken into account.
The firstorder group influence function (denoted by ) when all the samples in a group are upweighted by can be defined as:
To capture the dependency of the terms in , on the group influence function, we define as follows:
(10) 
Although one can consider even higherorder terms, in this paper, we restrict our derivations up to the secondorder approximations of the group influence function. We now state our main result in the following theorem:
Theorem 1.
If the thirdderivative of the loss function at is sufficiently small, the secondorder group influence function (denoted by ) when all samples in a group are upweighted by is:
(11) 
where:
(12) 
and
The full proof of this result is presented in the appendix. This result is based on the assumption that the thirdorder derivatives of the loss function at is small. For the quadratic loss, the thirdorder derivatives of the loss are zero. Our experiments with the cross entropy loss function indicates that this assumption approximately holds for the classification problem as well. Below, we present the sketch of this result.
Proof Sketch.
We now derive and to be used in the second order group influence function . As
is the optimal parameter set for the interpolated loss function
, due to the firstorder stationary condition, we have the following equality:(13)  
The main idea is to use Taylor’s series for expanding around along with the perturbation series defined in Equation (9) and compare the terms of the same order in :
(14) 
Similarly, we expand around using Taylor series expansion. To derive we compared terms with the coefficient of in Equation (13) and for we compared terms with coefficient . Based on this, can be written in the following way:
(15) 
We expand Equation(13) and compare the terms with coefficient :
(16) 
is the firstorder approximation of group influence function and can be denoted by . Note that our firstorder approximation of group influence function , is slightly different from (influence2) with an additional in the denominator. This extra term arises due to the conservation of the weight distribution of the training samples, which is essential when a large group is upweighted.
For we compare the terms with coefficients of the same order of in Equation (13):
(17) 
For the term, we ignore the thirdorder term due to it being small. Now we substitute the value of and equate the terms with coefficient in the order of :
(18)  
Rearranging the equation, we get the following identity:
∎
It can be observed that the additional term () in our secondorder approximation captures crossdependencies among the samples in through a function of gradients and Hessians of the loss function at the optimal model parameters. This makes the secondorder group influence function to be more informative when training samples are correlated. In Section 6, we empirically show that the addition of improves correlation with the ground truth influence as well.
For tracking the change in the test loss for a particular test point when a group
is removed, we use the chain rule to compute the influence score as follows:
(19) 
Our secondorder approximation of group influence function consists of a firstorder term that is similar to the one proposed in (influence2) with an additional scaling term . This scaling is due to the fact that our formulation preserves the empirical weight distribution constraint in ERM. Our secondorder influence function has an additional term that is directly proportional to and captures large perturbations to the model parameters more effectively.
4.1 Selection of influential groups
In this section, we explain how the secondorder group influence function can be used to select the most influential group of training samples for a particular test prediction. In case of the existing firstorder approximations for group influence functions, selecting the most influential group can be done greedily by ranking the training points with the highest individual influence since the group influence is the sum of influence of the individual points. However, with the secondorder approximations such greedy selection is not optimal since the group influence is not additive in terms of the influence of individual points. To deal with this issue, we first decompose the secondorder group influence function into two terms as:
(20) 
where . While is additive with respect to the samples and has pairwise dependencies among samples.
To simplify notation, we define the constant vector as . Ideally for a given fixed group of size , we want to find training samples amongst the total training samples which maximizes the influence for a given test point . We can define this in the form of a quadratic optimization problem as follows:
(21)  
s.t.  (22) 
where is composed of two matrices and i.e. . contains the weights associated with each sample in the training set. The entries of contain and the rows of contain . In case of , the columns contain . We define the constant as and as . This optimization can be relaxed using the relaxation from compressed sensing (Donoho:2006:CS:2263438.2272089; Candes:2005:DLP:2263433.2271950). The relaxed optimization can then be solved efficiently using the projected gradient descent. (projection1; Duchi:2008:EPL:1390156.1390191).
4.2 Computational Complexity
For models with large number of parameters, computing the inverse of the Hessian can be expensive and is of the order of . However, computing the Hessianvector product (Pearlmutter) is relatively computationally inexpensive. In our experiments, we used conjugate gradients (a secondorder optimization technique) (conjugate_gradient) to compute the inverse Hessianvector product which uses a Hessianvector product in the routine thus saving the expense for inverting the Hessian directly. The secondorder group influence function can be computed similarly to the first order approximations with an additional step of Hessianvector product.
5 Influence on Deep Networks
For both firstorder and secondorder influence functions, it is assumed that the empirical risk is strictly convex and twice differentiable. However, such assumptions do not hold in general for deep neural networks, where the loss function is nonconvex. Previously (influence1) had shown that in the case of individual effects, there is still a satisfactory correlation between the approximated influence and the ground truth influence for deep networks when the Hessian is regularized. This observation holds only for the top few influential points.
The behavior of influence functions in the case of deep networks across all training points has not been wellexplored. In our experiments, we observe that in this case both firstorder and secondorder influence function values are very small compared to the ground truth influences and lead to a low correlation. The details of these experiments can be found in the appendix. In order to understand this phenomenon, we take the eigendecomposition of the Hessian matrix and study how different components contribute to the influence score. The eigendecomposition of the hessian matrix is defined as follows:
where is the number of model parameters and denotes the eigenvector of the Hessian matrix. We substitute the decomposed Hessian in the firstorder influence function to obtain the following:
Note that the influence function is inversely related to eigenvalues. For each component, we compute its contribution to the influence score averaged across all the training points. More specifically, for the
component we compute the score as:We report the contribution of the largest componentscore to the overall influence in the following table:
Neural Network  Logistic Regression  

1  
2  
3  
4  
5 
It can be observed that in case of neural networks, the contribution of each component to the overall influence is very small which leads to a low influence score in general. However, in the case of the logistic regression, where the influence function provides an estimate close to the ground truth influence, we observe that the individual components contribute significantly to the influence score leading to a better estimate in general. Note that this experiment was done using the Iris dataset (Dua:2019), on a relatively simple neural network with one hidden layer and with sigmoid activations consisting of only 27 parameters, whereas the logistic regression model had 15 parameters. Nevertheless, this highlights the importance of understanding influence functions when the prediction model is a deep neural network.
6 Experiments
6.1 Setup
Our goal through the experiments is to observe if the secondorder approximations of group influence functions improve the correlation with the ground truth estimate. We compare the computed secondorder group influence score with the ground truth influence (which is computed by leaveout retraining for a group with size ). Our metric for evaluation is the Pearson correlation which measures how linearly the computed influence and the actual ground truth estimate are related. We perform our experiments on logistic regression and neural networks.
6.2 Datasets
In our first experiments, we use a synthetic dataset along with logistic regression. The synthetic dataset has 10,000 points drawn from a Gaussian distribution, consisting of 5 features and 2 classes. The details for the synthetic data can be found in the appendix. The second sets of experiments are done with the standard handwritten digits database MNIST
(mnist) which consists of 10 classes of different digits. For understanding how group influence functions behave in case of the neural networks we use the MNIST dataset. For each of the two datasets, we pick random groups with sizes ranging from to of the entire training points. We investigated the computed influence score for a misclassified testpoint with the highest test loss.6.3 Observations and Analysis
6.3.1 Logistic Regression
For logistic regression, the general observation was that the secondorder group influence function improves the correlation with the ground truth estimates across different group sizes in both the synthetic dataset as well as the MNIST. For both datasets, the gain in correlation was larger when the size of the considered group was large. For e.g. it can seen in Fig. (2), that when more than of the samples were removed, the gain in correlation is almost always more than .
6.4 Neural Networks
In case of neural networks, the Hessian is not positive semidefinite, which violates the assumptions of influence functions. Previously (influence1) regularized the hessian in the form of , and had shown that for the top few influential points (not groups), for a given test point, the correlation with the ground truth influence is still satisfactory, if not highly significant. For MNIST, we adopted a similar approach with regularized hessian with a value of and conducted experiments for a relatively simple two hidden layered feedforward network with sigmoid activations for both firstorder and secondorder group influence functions. The general observation was that both first and secondorder group influence functions underestimate the ground truth influence values across different group sizes. However, we observed that while the secondorder influence values still suffer from the underestimation issue, they improve the correlation marginally across different group sizes.
7 Conclusion and Future Work
In this paper, we proposed secondorder group influence functions for approximating model changes when a group from the training set is upweighted. Empirically, in the case of linear models and across different group sizes, we showed that the secondorder influence has a higher correlation with ground truth values compared to the firstorder ones. We showed that the secondorder approximation is significantly informative when the size of the groups are large. We showed that the proposed secondorder group influence function can be used in conjunction with optimization techniques to select the most influential group in the training set for a particular test prediction. For nonlinear models like deep neural networks, we observed that both firstorder and secondorder influence functions provide low influence scores in comparison to the ground truth values. Developing proper influence functions for neural net models or training neural networks to have improved influence functions are among interesting directions for future work.
8 Acknowledgements
Authors would like to acknowledge NSF award, . The authors would also like to thank Pang Wei Koh for insights about the neural network experiments.
References
Appendix A Proofs
Theorem 1.
If the thirdderivative of the loss function at is sufficiently small, the secondorder group influence function (denoted by ) when all samples in a group are upweighted by is:
(24) 
where:
(25) 
and
Proof.
We consider the empirical risk minimization problem where the learning algorithm is given training samples drawn i.i.d from some distribution . Let be the space of the parameters and be the hypothesis to learn. The goal is to select model parameters to minimize the empirical risk as follows:
(26) 
The ERM problem with a subset removed from is as follows:
(27) 
In our formulation for the group influence function, we assume that weights of samples in the set has been upweighted all by . This leads to a downweighting of the remaining training samples by , to conserve the empirical weight distribution of the training data. We denote as to denote the fraction of upweighted training samples. Therefore, the resulting ERM optimization can be written as:
(28) 
where:
(29) 
and . We consider the stationary condition where the gradient of is zero. More specifically:
(30)  
Next we expand around the optimal parameter using Taylor’s expansion and retrieve the terms with coefficients to find :
(31)  
At the optimal parameters , and , thus simplifying Equation (A):
(32)  
Substituting as , we get the following identity:
(33) 
where is the firstorder approximation of group influence function. We denote the firstorder approximation as .
Next we derive by comparing terms to the order of (i.e. ) in Equation (30) by expanding around :
(34)  
where:
(35) 
Substituting Equation (35) in Equation (34), and expanding in we get the following identity:
(36)  
Taking the common coefficients out and rearranging Equation (36), we get the following identity:
(37)  
Now we multiply both sides of the Equation (37) with the Hessian inverse i.e. , we obtain the crossterm involving the gradients and the hessians of the removed points in the second order influence function as follows:
(38) 
where and we denote as . We combine Equation (33) and (38), to write the second order influence function as:
(39) 
∎
Appendix B Synthetic Dataset
The synthetic data was sampled from a Gaussian distribution with 5 dimensions and consisted of two classes. We sampled a total of 10000 points for our experiments. For the first class, the mean was kept at 0.1 for all the dimensions and in the second case the mean was kept at 0.8. The covariance matrix in both the classes was a diagonal matrix whose entries were sampled randomly between 0 and 1.
Comments
There are no comments yet.